Contributing to the Databricks Labs Data Generator
We happily welcome contributions to dbldatagen.
We use GitHub Issues to track community reported issues and GitHub Pull Requests for accepting changes.
License
When you contribute code, you affirm that the contribution is your original work and that you license the work to the project under the project’s Databricks license. Whether or not you state this explicitly, by submitting any copyrighted material via pull request, email, or other means you agree to license the material under the project’s Databricks license and warrant that you have the legal authority to do so.
Building the code
Package Dependencies
See the contents of the file python/require.txt
to see the Python package dependencies.
Dependent packages are not installed automatically by the dbldatagen
package.
Python compatibility
The code has been tested with Python 3.8.12 and later.
Older releases were tested with Python 3.7.5 but as of this release, it requires the Databricks runtime 9.1 LTS or later.
Checking your code for common issues
Run make dev-lint
from the project root directory to run various code style checks.
These are based on the use of prospector
, pylint
and related tools.
Setting up your build environment
Run make buildenv
from the root of the project directory to setup a pipenv
based build environment.
Run make create-dev-env
from the root of the project directory to
set up a conda based virtualized Python build environment in the project directory.
You can use alternative build virtualization environments or simply install the requirements directly in your environment.
Build steps
Our recommended mechanism for building the code is to use a conda
or pipenv
based development process.
But it can be built with any Python virtualization environment.
Spark dependencies
The builds have been tested against Spark 3.2.1. This requires the OpenJDK 1.8.56 or later version of Java 8. The Databricks runtimes use the Azul Zulu version of OpenJDK 8 and we have used these in local testing. These are not installed automatically by the build process, so you will need to install them separately.
Building with Conda
To build with conda
, perform the following commands:
make create-dev-env
from the main project directory to create your conda environment, if usingactivate the conda environment - e.g
conda activate dbl_testdatagenerator
install the necessary dependencies in your conda environment via
make install-dev-dependencies
use the following to build and run the tests with a coverage report
Run
make dev-test-with-html-report
from the main project directory.
Use the following command to make the distributable:
Run
make dev-dist
from the main project directory
The resulting wheel file will be placed in the
dist
subdirectory
Building with Pipenv
To build with pipenv
, perform the following commands:
make buildenv
from the main project directory to create your conda environment, if usinginstall the necessary dependencies in your conda environment via
make install-dev-dependencies
use the following to build and run the tests with a coverage report
Run
make test-with-html-report
from the main project directory.
Use the following command to make the distributable:
Run
make dist
from the main project directory
The resulting wheel file will be placed in the
dist
subdirectory
The resulting build has been tested against Spark 3.2.1
Creating the HTML documentation
Run make docs
from the main project directory.
The main html document will be in the file (relative to the root of the build directory)
./docs/docs/build/html/index.html
Building the Python wheel
Run make clean dist
from the main project directory.
Testing
Developing new tests
New tests should be created using PyTest with classes combining multiple Pytest
tests.
Existing test code contains tests based on Python’s unittest
framework but these are
run on pytest
rather than unitest
.
To get a spark
instance for test purposes, use the following code:
import dbldatagen as dg
spark = dg.SparkSingleton.getLocalInstance("<name to flag spark instance>")
The name used to flag the spark instance should be the test module or test class name.
Running unit / integration tests
If using an environment with multiple Python versions, make sure to use virtual env or
similar to pick up correct python versions. The make target create
If necessary, set PYSPARK_PYTHON
and PYSPARK_DRIVER_PYTHON
to point to correct versions of Python.
To run the tests using a conda
environment:
Run
make dev-test
from the main project directory to run the unit tests.Run
make dev-test-with-html-report
to generate test coverage report inhtmlcov/inxdex.html
To run the tests using a pipenv
environment:
Run
make test
from the main project directory to run the unit tests.Run
make test-with-html-report
to generate test coverage report inhtmlcov/inxdex.html
Using the Databricks Labs data generator
The recommended method for installation is to install from the PyPi package
You can install the library as a notebook scoped library when working within the Databricks
notebook environment through the use of a %pip
cell in your notebook.
To install as a notebook-scoped library, create and execute a notebook cell with the following text:
%pip install dbldatagen
This installs from the PyPi package
You can also install from release binaries or directly from the Github sources.
The release binaries can be accessed at:
Databricks Labs Github Data Generator releases - https://github.com/databrickslabs/dbldatagen/releases
The %pip install
method also works on the Databricks Community Edition.
Alternatively, you use download a wheel file and install using the Databricks install mechanism to install a wheel based library into your workspace.
The %pip install
method can also down load a specific binary release.
For example, the following code downloads the release V0.2.1
‘%pip install https://github.com/databrickslabs/dbldatagen/releases/download/v021/dbldatagen-0.2.1-py3-none-any.whl’
Coding Style
The code follows the Pyspark coding conventions.
Basically it follows the Python PEP8 coding conventions - but method and argument names used mixed case starting with a lower case letter rather than underscores following Pyspark coding conventions.
See https://legacy.python.org/dev/peps/pep-0008/