Databricks Labs Data Generator documentation
The Databricks Labs Data Generator project provides a convenient way to generate large volumes of synthetic test data from within a Databricks notebook (or regular Spark application).
By defining a data generation spec, either in conjunction with an existing schema or through creating a schema on the fly, you can control how synthetic data is generated.
As the data generator generates a PySpark data frame, it is simple to create a view over it to expose it to Scala or R based Spark applications also.
As it is installable via %pip install, it can also be incorporated in environments such as Delta Live Tables also.
Getting Started
- Get Started Here
- Installation instructions
- Generating column data
- Using data ranges
- Generating text data
- Using data distributions
- Options for column specification
- Repeatable data generation
- Revisiting the IOT data example
- Using streaming data
- Generating JSON and structured column data
- Generating Change Data Capture (CDC) data
- Using multiple tables
- Extending text generation
- Troubleshooting data generation
Development
License