# Contributing to the Databricks Labs Data Generator
We happily welcome contributions to *dbldatagen*. 

We use GitHub Issues to track community reported issues and GitHub Pull Requests for accepting changes.

## License

When you contribute code, you affirm that the contribution is your original work and that you 
license the work to the project under the project's Databricks license. Whether or not you 
state this explicitly, by submitting any copyrighted material via pull request, email, or 
other means you agree to license the material under the project's Databricks license and 
warrant that you have the legal authority to do so.

# Building the code

## Package Dependencies
See the contents of the file `python/require.txt` to see the Python package dependencies. 
Dependent packages are not installed automatically by the `dbldatagen` package.

## Python compatibility

The code has been tested with Python 3.8.12 and later.

Older releases were tested with Python 3.7.5 but as of this release, it requires the Databricks 
runtime 9.1 LTS or later. 

## Checking your code for common issues

Run `make dev-lint` from the project root directory to run various code style checks. 
These are based on the use of `prospector`, `pylint` and related tools.

## Setting up your build environment
Run `make buildenv` from the root of the project directory to setup a `pipenv` based build environment.

Run `make create-dev-env` from the root of the project directory to 
set up a conda based virtualized Python build environment in the project directory.

You can use alternative build virtualization environments or simply install the requirements
directly in your environment.


## Build steps

Our recommended mechanism for building the code is to use a `conda` or `pipenv` based development process. 

But it can be built with any Python virtualization environment.

### Spark dependencies
The builds have been tested against Spark 3.2.1. This requires the OpenJDK 1.8.56 or later version of Java 8.
The Databricks runtimes use the Azul Zulu version of OpenJDK 8 and we have used these in local testing.
These are not installed automatically by the build process, so you will need to install them separately.

### Building with Conda
To build with `conda`, perform the following commands:
  - `make create-dev-env` from the main project directory to create your conda environment, if using
  - activate the conda environment - e.g `conda activate dbl_testdatagenerator`
  - install the necessary dependencies in your conda environment via `make install-dev-dependencies`
  
  - use the following to build and run the tests with a coverage report
    - Run  `make dev-test-with-html-report` from the main project directory.

  - Use the following command to make the distributable:
    - Run `make dev-dist` from the main project directory
  - The resulting wheel file will be placed in the `dist` subdirectory
  
### Building with Pipenv
To build with `pipenv`, perform the following commands:
  - `make buildenv` from the main project directory to create your conda environment, if using
  - install the necessary dependencies in your conda environment via `make install-dev-dependencies`
  
  - use the following to build and run the tests with a coverage report
    - Run  `make test-with-html-report` from the main project directory.

  - Use the following command to make the distributable:
    - Run `make dist` from the main project directory
  - The resulting wheel file will be placed in the `dist` subdirectory

The resulting build has been tested against Spark 3.2.1

## Creating the HTML documentation

Run  `make docs` from the main project directory.

The main html document will be in the file (relative to the root of the build directory)
 `./docs/docs/build/html/index.html`

## Building the Python wheel
Run  `make clean dist` from the main project directory.

# Testing 

## Developing new tests
New tests should be created using PyTest with classes combining multiple `Pytest` tests.

Existing test code contains tests based on Python's `unittest` framework but these are 
run on `pytest` rather than `unitest`. 

To get a  `spark` instance for test purposes, use the following code:

```python
import dbldatagen as dg

spark = dg.SparkSingleton.getLocalInstance("<name to flag spark instance>")
```

The name used to flag the spark instance should be the test module or test class name. 

## Running unit / integration tests

If using an environment with multiple Python versions, make sure to use virtual env or 
similar to pick up correct python versions. The make target `create`

If necessary, set `PYSPARK_PYTHON` and `PYSPARK_DRIVER_PYTHON` to point to correct versions of Python.

To run the tests using a `conda` environment:
  - Run `make dev-test` from the main project directory to run the unit tests.

  - Run `make dev-test-with-html-report` to generate test coverage report in `htmlcov/inxdex.html`

To run the tests using a `pipenv` environment:
  - Run `make test` from the main project directory to run the unit tests.

  - Run `make test-with-html-report` to generate test coverage report in `htmlcov/inxdex.html`

# Using the Databricks Labs data generator
The recommended method for installation is to install from the PyPi package

You can install the library as a notebook scoped library when working within the Databricks 
notebook environment through the use of a `%pip` cell in your notebook.

To install as a notebook-scoped library, create and execute a notebook cell with the following text:

> `%pip install dbldatagen`

This installs from the PyPi package

You can also install from release binaries or directly from the Github sources.

The release binaries can be accessed at:
- Databricks Labs Github Data Generator releases - https://github.com/databrickslabs/dbldatagen/releases


The `%pip install` method also works on the Databricks Community Edition.

Alternatively, you use download a wheel file and install using the Databricks install mechanism to install a wheel based
library into your workspace.

The `%pip install` method can also down load a specific binary release.
For example, the following code downloads the release V0.2.1

> '%pip install https://github.com/databrickslabs/dbldatagen/releases/download/v021/dbldatagen-0.2.1-py3-none-any.whl'

# Coding Style 

The code follows the Pyspark coding conventions. 

Basically it follows the Python PEP8 coding conventions - but method and argument names used mixed case starting 
with a lower case letter rather than underscores following Pyspark coding conventions.

See https://legacy.python.org/dev/peps/pep-0008/