Databricks Labs Data Generator Documentation
The Databricks Labs Data Generator project provides a convenient way to generate large volumes of synthetic data from within a Databricks notebook (or a regular Spark application).
By defining a data generation spec, either in conjunction with an existing schema or through creating a schema on the fly, you can control how synthetic data is generated.
As the data generator generates a PySpark data frame, it is simple to create a view over it to expose it to Scala or R-based Spark applications also.
As it is installable via %pip install, it can also be incorporated in environments such as Delta Live Tables also.
- Get Started Here
- Installation instructions
- Generating column data
- Using standard datasets
- Using data ranges
- Generating text data
- Using data distributions
- Options for column specification
- Repeatable Data Generation
- Revisiting the IOT data example
- Using constraints to control data generation
- Using streaming data
- Generating JSON and structured column data
- Generating synthetic data from existing data
- Generating Change Data Capture (CDC) data
- Using multiple tables
- Extending text generation
- Use with Delta Live Tables
- Troubleshooting data generation