Databricks Labs Synthetic Data Generator Release Notes
Change History
All notable changes to the Databricks Labs Data Generator will be documented in this file.
Version 0.4.0
Changed
Updated minimum pyspark version to be 3.2.1, compatible with Databricks runtime 10.4 LTS or later
Modified data generator to allow specification of constraints to the data generation process
Updated documentation for generating text data.
Modified data distribiutions to use abstract base classes
migrated data distribution tests to use
pytest
Additional standard datasets
Added
Added classes for constraints on the data generation via new package
dbldatagen.constraints
Added support for standard data sets via the new package
dbldatagen.datasets
Version 0.3.6 Post 1
Changed
Updated docs for complex data types / JSON to correct code examples
Updated license file in public docs
Fixed
Fixed scenario where
DataAnalyzer
is used on dataframe containing a column namedsummary
Version 0.3.6
Changed
Updated readme to include details on which versions of Databricks runtime support Unity Catalog
shared
access mode.Updated code to use default parallelism of 200 when using a shared Spark session
Updated code to use Spark’s SQL function
element_at
instead of array indexing due to incompatibility
Notes
Ths version marks the changing minimum version of Databricks runtime to 10.4 LTS and later releases.
While there are no known incompatibilities with Databricks 9.1 LTS, we will not test against this release
Version 0.3.5
Changed
Added formatting of generated code as Html for script methods
Allow use of inferred types on
withColumn
method whenexpr
attribute is usedAdded
withStructColumn
method to allow simplified generation of struct and JSON columnsModified pipfile to use newer version of package specifications
Version 0.3.4 Post 3
Changed
Build now uses Python 3.8.12. Updated build process to reflect that.
Version 0.3.4 Post 2
Fixed
Fix for use of values in columns of type array, map and struct
Fix for generation of arrays via
numFeatures
andstructType
attributes when numFeatures has value of 1
Version 0.3.4 Post 1
Fixed
Fix for use and configuration of root logger
Acknowledgements
Thanks to Marvin Schenkel for the contribution
Version 0.3.4
Changed
Modified option to allow for range when specifying
numFeatures
withstructType='array'
to allow generation of varying number of columnsWhen generating multi-column or array valued columns, compute random seed with different name for each column
Additional build ordering enhancements to reduce circumstances where explicit base column must be specified
Added
Scripting of data generation code from schema (Experimental)
Scripting of data generation code from dataframe (Experimental)
Added top level
random
attribute to data generator specification constructor
Version 0.3.3post2
Changed
Fixed use of logger in _version.py and in spark_singleton.py
Fixed template issues
Document reformatting and updates, related code comment changes
Fixed
Apply pandas optimizations when generating multiple columns using same
withColumn
orwithColumnSpec
Added
Added use of prospector to build process to validate common code issues
Version 0.3.2
Changed
Adjusted column build phase separation (i.e which select statement is used to build columns) so that a column with a SQL expression can refer to previously created columns without use of a
baseColumn
attributeChanged build labelling to comply with PEP440
Fixed
Fixed compatibility of build with older versions of runtime that rely on
pyparsing
version 2.4.7
Added
Parsing of SQL expressions to determine column dependencies
Notes
The enhancements to build ordering does not change actual order of column building - but adjusts which phase columns are built in
Version 0.3.1
Changed
Refactoring of template text generation for better performance via vectorized implementation
Additional migration of tests to use of
pytest
Fixed
added type parsing support for binary and constructs such as
nvarchar(10)
Fixed error occurring when schema contains map, array or struct.
Added
Ability to change name of seed column to custom name (defaults to
id
)Added type parsing support for structs, maps and arrays and combinations of the above
Notes
column definitions for map, struct or array must use
expr
attribute to initialize field. Defaults toNULL
Version 0.3.0
Changes
Validation for use in Delta Live Tables
Documentation updates
Minor bug fixes
Changes to build and release process to improve performance
Modified dependencies to base release on package versions used by Databricks Runtime 9.1 LTS
Updated to Spark 3.2.1 or later
Unit test updates - migration from
unittest
topytest
for many tests
Version 0.2.1
Features
Uses pipenv for main build process
Supports Conda based development build process
Uses
pytest-cov
to track unit test coverageAdded HTML help - use
make docs
to create itAdded support for weights with non-random columns
Added support for generation of streaming data sets
Added support for multiple dependent base columns
Added support for scripting of table create statements
Resolved many PEP8 style issues
Resolved / triaged prospector / pylint issues
Changed help to RTD scheme and added additional help content
moved docs to docs folder
added support for specific distributions
renamed packaging to
dbldatagen
Releases now available at https://github.com/databrickslabs/dbldatagen/releases
code tidy up and rename of options
added text generation plugin support for python functions and 3rd party libraries
Use of data generator to generate static and streaming data sources in Databricks Delta Live Tables
added support for install from PyPi
General Requirements
See the contents of the file python/require.txt
to see the Python package dependencies
The code for the Databricks Data Generator has the following dependencies
Requires Databricks runtime 9.1 LTS or later
Requires Spark 3.1.2 or later
Requires Python 3.8.10 or later
While the data generator framework does not require all libraries used by the runtimes, where a library from the Databricks runtime is used, it will use the version found in the Databricks runtime for 9.1 LTS or later. You can use older versions of the Databricks Labs Data Generator by referring to that explicit version.
The recommended method to install the package is to use pip install
in your notebook to install the package from
PyPi
For example:
%pip install dbldatagen
To use an older DB runtime version in your notebook, you can use the following code in your notebook:
%pip install git+https://github.com/databrickslabs/dbldatagen@dbr_7_3_LTS_compat
See the Databricks runtime release notes for the full list of dependencies used by the Databricks runtime.
This can be found at : https://docs.databricks.com/release-notes/runtime/releases.html