Skip to main content

DQX Installation Wizard

When you install DQX as a tool in a Databricks workspace using databricks labs install dqx, the installer runs an interactive wizard to collect settings for the "default" run configuration. This page documents each installer prompt with its expected input, default value, and the behavior it controls.

When you use DQX's installation wizard to install DQX:

  • Each prompt shows its default value in brackets. Press Enter to accept the default value.
  • Some prompts will only appear conditionally (for example, the streaming, quarantine, metrics, and job-cluster prompts).
  • When configuring locations (e.g. the input, output, or quarantine locations), the special value skipped instructs the installer not to configure the location.
Changing settings after installation

All responses map to fields in the generated configuration file. Add run configurations or change any of these settings by editing config.yml (open it with databricks labs dqx open-remote-config).

General settings

Configures the logging settings used across DQX jobs installed using the installer.

PromptPropertyWhat it configuresDefaultNotes
Log levellog_levelLogging verbosity for DQX workflows.INFOAccepts standard levels such as DEBUG, INFO, WARN, ERROR.

Input data

Defines an input_config where the source data is read. The input is optional during installation. Leave the location as skipped to omit it and configure it later.

PromptPropertyWhat it configuresDefaultNotes
Should the input data be read using streaming?input_config.is_streamingWhether the input is read as a streaming source rather than a batch source.noWhen enabled, additional streaming trigger options are requested for the output and quarantine tables.
Provide location for the input datainput_config.locationSource data, as a path or a catalog.schema.table / schema.table name.skippedUse skipped to omit the input configuration.
Provide format for the input datainput_config.formatInput data format, e.g. delta, parquet, csv, json.deltaOnly asked when an input location is provided.
Provide schema for the input datainput_config.schemaOptional explicit schema, e.g. col1 int, col2 string.skippedOnly asked when an input location is provided. Use skipped to let the format infer the schema.
Provide additional options for reading the input datainput_config.optionsReader options as a JSON object, e.g. {"versionAsOf": "0"}.{}Only asked when an input location is provided.

Output data

Defines an output_config where checked data (with the _errors / _warnings reporting columns) is written. Leave the location as skipped to omit it and skip writing valid data (e.g. when you only want to write invalid rows to a quarantine table).

PromptPropertyWhat it configuresDefaultNotes
Provide output tableoutput_config.locationOutput table, as catalog.schema.table / schema.table.skippedUse skipped to omit writing valid data. The output table can only be skipped if a quarantine table is provided.
Provide write mode for output tableoutput_config.modeHow results are written.appendOne of append or overwrite.
Provide format for the output dataoutput_config.formatOutput data format.delta
Provide additional options for writing the output dataoutput_config.optionsWriter options as a JSON object, e.g. {"mergeSchema": "true"}.{}
Provide additional options for writing the output data using streamingoutput_config.triggerStreaming trigger options, e.g. {"availableNow": true}.{}Only asked when streaming is enabled.

Quarantine data

Defines a quarantine_config where quarantined rows that fail 1 or more DQX checks are written. If the location is left as skipped, invalid rows are written to the output table instead.

PromptPropertyWhat it configuresDefaultNotes
Provide quarantined tablequarantine_config.locationQuarantine table, as catalog.schema.table / schema.table.skippedUse skipped to keep invalid rows in the output table. The remaining quarantine questions are then not asked.
Provide write mode for quarantine tablequarantine_config.modeHow quarantined rows are written.appendOnly asked when a quarantine table is provided. One of append or overwrite.
Provide format for the quarantine dataquarantine_config.formatQuarantine data format.deltaOnly asked when a quarantine table is provided.
Provide additional options for writing the quarantine dataquarantine_config.optionsWriter options as a JSON object.{}Only asked when a quarantine table is provided.
Provide additional options for writing the quarantine data using streamingquarantine_config.triggerStreaming trigger options.{}Only asked when a quarantine table is provided and streaming is enabled.

Summary metrics

Defines the metrics_config and custom_metrics used to track and write per-run summary metrics produced by the quality checker.

PromptPropertyWhat it configuresDefaultNotes
Do you want to store summary metrics from data quality checking in a table?metrics_configWhether summary metrics are persisted.noWhen no, the remaining metrics questions are not asked.
Provide table for storing summary metricsmetrics_config.locationMetrics table, as catalog.schema.table / schema.table.requiredOnly asked when storing summary metrics; must be provided.
Provide write mode for metrics tablemetrics_config.modeHow metrics are written.appendOnly asked when storing summary metrics. One of append or overwrite.
Provide format for the metrics datametrics_config.formatMetrics data format.deltaOnly asked when storing summary metrics.
Provide additional options for writing the metrics datametrics_config.optionsWriter options as a JSON object.{}Only asked when storing summary metrics.
Provide custom metricscustom_metricsOptional list of Spark SQL aggregate expressions to track, e.g. ["avg(salary) as avg_salary"].[]Only asked when storing summary metrics. Leave blank to track only the default data quality metrics.

Quality checks location

Defines a checks_location where quality check definitions are stored. The check definitions can be stored in a table or file.

PromptPropertyWhat it configuresDefaultNotes
Provide location of the quality checks definitionschecks_locationWhere quality checks (rules) are stored.checks.ymlAccepts a file name (relative to the installation folder), a catalog.schema.table / schema.table table, or a full /Volumes/.../<file> path.

Profiler

Defines a file path where the DQX Profiler writes summary statistics about the profiled datasets.

PromptPropertyWhat it configuresDefaultNotes
Provide filename for storing profile summary statisticsprofiler_config.summary_stats_fileFile produced by the profiler workflow.profile_summary_stats.yml

Compute

Controls the compute used by the profiler, quality checker, and end-to-end workflows. Serverless is recommended; choosing job clusters unlocks per-workflow Spark configuration.

PromptPropertyWhat it configuresDefaultNotes
Do you want to use standard job clusters for the workflows execution (not Serverless)?serverless_clustersCompute type for the workflows.no (use Serverless)Answer no to keep Serverless (recommended). Answer yes to use job clusters, which triggers the per-workflow prompts below.
Optional spark conf to use with the profiler / data quality / end-to-end workflowprofiler_spark_conf, quality_checker_spark_conf, e2e_spark_confPer-workflow Spark configuration as a JSON object, e.g. {"spark.sql.ansi.enabled": "true"}.{}Only asked when not using Serverless. Asked once per workflow.
Optional Cluster ID to use for the profiler / data quality / end-to-end workflowprofiler_override_clusters, quality_checker_override_clusters, e2e_override_clustersAn existing cluster to reuse, e.g. {"default": "<existing-cluster-id>"}.{}Only asked when not using Serverless. If left empty, a job cluster is created automatically when the job runs.

Reference tables

Configures reference tables used by DQX checks (e.g. for schema validation or dataset comparison checks).

PromptPropertyWhat it configuresDefaultNotes
Provide reference tables to use for checksreference_tablesReference datasets for checks such as referential integrity, as a JSON map of name to an input specification.{}The specification accepts location, format, schema, options, and is_streaming. Example: {"reference_vendor": {"location": "catalog.schema.table", "format": "delta"}}.

Custom check functions

Configures a mapping used to reference custom check functions written in PySpark. Custom checks should be stored in your Databricks workspace as Python modules. Each key is a check function defined in the associated module.

PromptPropertyWhat it configuresDefaultNotes
Provide custom check functionscustom_check_functionsCustom check functions, as a JSON map of function name to a Python module path in the workspace or a volume.{}Example: {"my_func": "/Workspace/Shared/my_module.py"}.

Dashboard SQL warehouse

Configures a Databricks SQL warehouse for serving DQX's built-in dashboard.

PromptPropertyWhat it configuresDefaultNotes
Select PRO or SERVERLESS SQL warehouse to run data quality dashboards onwarehouse_idSQL warehouse used by the quality dashboard.select from listChoose an existing PRO or SERVERLESS warehouse, or create a new PRO warehouse.

Dependencies

Configures installation of DQX from files in the workspace instead of PyPI. Useful for installing DQX jobs in air-gapped environments with no access to PyPI.

PromptPropertyWhat it configuresDefaultNotes
Does the given workspace block Internet access?upload_dependenciesWhether DQX dependencies are uploaded to the workspace instead of being fetched from PyPI at runtime.noAnswer yes for workspaces without Internet egress (air-gapped).