Databricks Labs Data Generator Documentation

The Databricks Labs Data Generator project provides a convenient way to generate large volumes of synthetic data from within a Databricks notebook (or a regular Spark application).

By defining a data generation spec, either in conjunction with an existing schema or through creating a schema on the fly, you can control how synthetic data is generated.

As the data generator generates a PySpark data frame, it is simple to create a view over it to expose it to Scala or R-based Spark applications also.

As it is installable via %pip install, it can also be incorporated in environments such as Delta Live Tables also.

License

Indices and tables