Skip to main content

AI-Assisted Quality Checks Generation

DQX provides the capability to generate data quality rule candidates using AI/LLM assistance based on natural language descriptions of your data quality requirements. This feature leverages Large Language Models (LLMs) to automatically create appropriate data quality checks from business descriptions significantly reducing the time and effort required to define quality rules.

Overview

The AI-assisted quality checks generation is available via the DQGenerator class and supports:

  • Analyzing natural language descriptions of data quality requirements.
  • Optionally inspecting table schemas to understand data structure.
  • Generating appropriate data quality rules that match the requirements using built-in and custom check functions.
  • Validating the generated rules to ensure they are syntactically correct.

This approach is particularly useful when you have clear business requirements, but need help translating them into technical data quality rules. It's also useful for generating technical rules without requiring knowledge of DQX-specific syntax.

When to use AI-Assisted generation

AI-assisted generation is ideal for:

  • Translating business requirements into technical quality rules.
  • Quickly prototyping quality checks for new datasets.
  • Generating comprehensive checks based on compliance requirements.
  • Supplementing profiler-generated rules with business-specific rules.

For automated discovery of data patterns and statistics-based rules, consider using the Data Profiling approach instead. Future versions of the AI-assisted rules generation, will be able to leverage data profiles generated by the profiler as well.

Model Access

The feature requires access to an LLM model. DQX supports:

  • Databricks Foundation Model APIs (recommended): Use Databricks-hosted models like databricks/databricks-claude-sonnet-4-5 (default).
  • Custom API endpoints: Any OpenAI-compatible API endpoint.
  • Local models: Any model supported by DSPy.

You'll need appropriate API credentials to access the model endpoint. The credentials are not required when using The api_key and api_base are not required when using Databricks foundational models.

Using AI-Assisted Generation in programmatic way

Prerequisites

To use the AI-assisted quality checks generation feature, you need to install DQX with the LLM extra dependencies:

pip install 'databricks-labs-dqx[llm]'

This will install the required packages including DSPy and other LLM-related dependencies.

Basic Usage

Here's a simple example of generating quality rules from a natural language description:

from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.sdk import WorkspaceClient

# Initialize the generator
ws = WorkspaceClient()
generator = DQGenerator(workspace_client=ws)

# Generate rules from natural language description
user_input = """
Username should not start with 's' if age is less than 18.
All users must have a valid email address.
Age should be between 0 and 120.
"""

checks = generator.generate_dq_rules_ai_assisted(user_input=user_input)

print(checks)

Using with Table Schema

For better results, you can provide a fully qualified table name of the input data. The LLM will analyze the table schema to generate more accurate rules:

from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.sdk import WorkspaceClient

ws = WorkspaceClient()
generator = DQGenerator(workspace_client=ws)

# Generate rules with table schema awareness
user_input = """
All customer records must have complete contact information.
Email addresses must be valid.
Phone numbers should follow standard format.
Registration date should not be in the future.
"""

checks = generator.generate_dq_rules_ai_assisted(
user_input=user_input,
table_name="catalog1.schema1.customers"
)

print(checks)

Providing custom check functions

DQX provides a collection of predefined built-in quality rules (check functions). Additionally, you can define your own custom check functions using python to meet specific requirements (see here. Custom Python check functions can be passed to the generator, allowing the LLM to utilize them as needed.

from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.check_funcs import make_condition, register_rule
import pyspark.sql.functions as F

@register_rule("row")
def not_ends_with_suffix(column: str, suffix: str):
"""
Example of custom python row-level check function.
"""
return make_condition(
F.col(column).endswith(suffix), f"Column {column} ends with {suffix}", f"{column}_ends_with_{suffix}"
)

custom_check_functions = {"ends_with_suffix": not_ends_with_suffix}

# Initialize the generator
ws = WorkspaceClient()
generator = DQGenerator(
workspace_client=ws,
custom_check_functions=custom_check_functions
)

# Generate rules from natural language description
user_input = """
Username should not start with 's' if age is less than 18.
All users must have a valid email address.
Age should be between 0 and 120.
Email address must not end with '@gmail.com'.
"""

checks = generator.generate_dq_rules_ai_assisted(user_input=user_input)

print(checks)

Using Custom Model Configuration

You can configure the generator to use custom models or API endpoints:

from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.sdk import WorkspaceClient
from databricks.labs.dqx.config import LLMModelConfig

ws = WorkspaceClient()

# Option 1: Using Databricks Foundation Model API
model_config = LLMModelConfig(
model_name="databricks/databricks-claude-sonnet-4-5" # default
)
generator = DQGenerator(workspace_client=ws, llm_model_config=model_config)

# Option 2: Using Databricks Foundation Model API with explicit credentials
model_config = LLMModelConfig(
model_name="databricks/databricks-claude-sonnet-4-5",
api_key="your-api-key",
api_base="https://your-workspace.azuredatabricks.net/serving-endpoints"
)
generator = DQGenerator(workspace_client=ws, llm_model_config=model_config)

# Option 3: Using an arbitrary LLM model
model_config = LLMModelConfig(
model_name="databricks/databricks-llama-4-maverick",
api_key="your-api-key",
api_base="https://your-workspace.azuredatabricks.net/serving-endpoints"
)
generator = DQGenerator(workspace_client=ws, llm_model_config=model_config)

user_input = "All fields should contain data and have no empty values"
checks = generator.generate_dq_rules_ai_assisted(user_input=user_input)

Complete Example with Storage

Here's a complete workflow showing how to generate, review, and save AI-assisted quality rules:

from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.labs.dqx.engine import DQEngine
from databricks.labs.dqx.config import WorkspaceFileChecksStorageConfig
from databricks.sdk import WorkspaceClient
import yaml

# Initialize workspace client and generator
ws = WorkspaceClient()
generator = DQGenerator(workspace_client=ws)

# Define business requirements in natural language
business_requirements = """
Data Quality Requirements for User Registration Table:

1. All user IDs must be unique and not null
2. Usernames should not contain special characters
3. Email addresses must be valid and unique
4. Age must be between 18 and 100 for adult users
5. Country must be from a list of supported countries
6. Join date should not be in the future
7. Verified users must have a non-zero followers count
"""

# Generate quality rules using AI
checks = generator.generate_dq_rules_ai_assisted(
user_input=business_requirements,
table_name="production.users.registrations"
)

# Review the generated checks
print("Generated Quality Checks:")
print(yaml.safe_dump(checks, default_flow_style=False))

# Initialize DQEngine
dq_engine = DQEngine(ws)

# Save the generated checks to a workspace file
dq_engine.save_checks(
checks=checks,
config=WorkspaceFileChecksStorageConfig(
location="/Shared/DataQuality/user_registration_checks.yml"
)
)

print("Quality checks saved successfully!")

Combining AI-Assisted rules generation with Profiler-Based Rules

You can combine AI-assisted rules with profiler-generated rules for comprehensive coverage:

from databricks.labs.dqx.profiler.profiler import DQProfiler
from databricks.labs.dqx.profiler.generator import DQGenerator
from databricks.labs.dqx.engine import DQEngine
from databricks.sdk import WorkspaceClient

ws = WorkspaceClient()

# Step 1: Profile the data to generate statistics-based (technical) rules
profiler = DQProfiler(ws)
input_df = spark.read.table("catalog1.schema1.sales_data")
summary_stats, profiles = profiler.profile(input_df)

generator = DQGenerator(ws)

# Generate profiler-based rules
profiler_checks = generator.generate_dq_rules(profiles)

# Step 2: Generate business logic rules using AI
business_requirements = """
Sales data must follow these business rules:
- Transaction amount must be positive
- Discount cannot exceed 50% of original price
- Customer ID must exist in the customer master table
- Sale date should be within the last 2 years
"""

ai_checks = generator.generate_dq_rules_ai_assisted(
user_input=business_requirements,
table_name="catalog1.schema1.sales_data"
)

# Step 3: Combine both sets of rules
all_checks = profiler_checks + ai_checks

# Step 4: Save combined checks
dq_engine = DQEngine(ws)
dq_engine.save_checks(
checks=all_checks,
config=WorkspaceFileChecksStorageConfig(
location="/Shared/DataQuality/combined_sales_checks.yml"
)
)

print(f"Combined {len(profiler_checks)} profiler rules with {len(ai_checks)} AI rules")
print(f"Total: {len(all_checks)} quality checks")

Example Use Cases

Use Case 1: Compliance Requirements

Generate quality rules from compliance documentation:

compliance_requirements = """
GDPR Compliance Requirements:
- User email addresses must be validated and properly formatted
- User consent date must not be null for any user
- Data retention: user records older than 7 years should be flagged
- Personal data fields must not contain placeholder values like 'N/A' or 'Unknown'
"""

checks = generator.generate_dq_rules_ai_assisted(
user_input=compliance_requirements,
table_name="gdpr.users.personal_data"
)

Use Case 2: Financial Data Validation

Generate rules for financial data:

financial_requirements = """
Financial Transaction Rules:
- Transaction amount must be non-negative
- Account balance should not go below minimum threshold
- Currency code must be valid ISO 4217 code
- Transaction timestamp should be within business hours (9 AM - 5 PM)
- Suspicious transactions above $10,000 should be flagged
"""

checks = generator.generate_dq_rules_ai_assisted(
user_input=financial_requirements,
table_name="finance.transactions.daily"
)

Use Case 3: Manufacturing

Generate rules for IoT sensor readings:

iot_requirements = """
IoT Sensor Data Quality Rules:
- Temperature readings should be between -40°C and 125°C
- Humidity levels must be between 0% and 100%
- Sensor reading timestamps should not have gaps longer than 5 minutes
- Battery level should not drop below 10% without alert
- All sensor IDs must be registered in the device registry
"""

checks = generator.generate_dq_rules_ai_assisted(
user_input=iot_requirements,
table_name="iot.sensors.readings"
)

Using AI-Assisted Generation in no-code approach (Profiler Workflow)

You can run profiler workflow to run both the statistics-based rules generation and the AI-assisted rules generation. The profiler workflow saves the generated checks automatically in the checks location as defined in the configuration file. You need to install DQX as a tool in the workspace to have the profiler workflow available (see installation guide). More information about the profiler workflow can be found here.

Configuration options

The following fields from the configuration file are used for the AI-assisted rules generation:

  • checks_user_requirements: (optional) user input for AI-assisted rule generation
  • llm_config: (optional) configuration for the llm-assisted features

The AI-assisted rules generation is only supported within the workflow if DQX is installed with the serverless clusters used for execution, i.e. serverless_clusters setting enabled in the configuration file.

Example of the configuration file (relevant fields only):

  serverless_clusters: true  # ai-assisted rules generation via workflow requires serverless cluster for execution
llm_config:
model:
model_name: "databricks/databricks-claude-sonnet-4-5"
api_key: xxx # optional API key for the model as secret in the format: secret_scope/secret_key. Not required by foundational models
api_base: xxx # optional API base for the model as secret in the format: secret_scope/secret_key. Not required by foundational models
run_configs:
- name: default
checks_user_requirements: "business rules description" # user input
custom_check_functions: # optional dict of custom check functions that the llm can use in addition to the build-in checks
my_func: custom_checks/my_funcs.py # can be relative or absolute workspace path or UC Volume path

The api_key and api_base are unnecessary when utilizing Databricks foundational models. For enhanced security and streamlined workflow usage, it is advised to store them as Databricks secrets within the configuration file. Use a slash-separated format for specifying the scope and key, such as api_key: secret_scope/secret_key.

Best Practices

  1. Be Specific: Provide clear and specific business requirements in your input. The more detailed your description, the better the generated rules. However, be aware that the model endpoint that you use have token limitations see Databricks Model APIs limits and quotas.
  2. Include Table Name: When possible, provide the fully qualified table name to help the LLM understand the actual data structure.
  3. Review Generated Rules: Always review the generated rules before applying them to production data. The AI may not perfectly understand all nuances of your requirements so treat the generated rules as candidates and review them.
  4. Combine Approaches: Use AI-assisted generation for business logic rules and profiler-based generation for statistical (technical) rules.
  5. Iterate: If the generated rules don't match your expectations, refine your input description and regenerate, or update manually.

Troubleshooting

Error: DSPy compiler not available

Problem: You receive an error saying "DSPy compiler not available".

Solution: Install the LLM dependencies:

pip install 'databricks-labs-dqx[llm]'

Generated Rules Don't Match Requirements

Problem: The AI generates rules don't align with your business requirements.

Solution:

  • Make your input description more specific and detailed.
  • Provide the table name so the LLM can analyze the actual schema.
  • Break complex requirements into simpler, more focused descriptions.
  • Include examples in your description.

Model Access Issues

Problem: Unable to connect to the LLM model endpoint.

Solution:

  • Verify your API credentials are correct.
  • Check that the API base URL is accessible from your workspace.
  • Ensure you have permissions to access the model endpoint.
  • Try using the default Databricks Foundation Model API.

Validation Errors in Generated Rules

Problem: The generated rules fail validation.

Solution:

  • Check the validation error messages for specific issues.
  • Verify that the column names mentioned in your requirements exist in the table.
  • Ensure the rule types requested are supported by DQX (see Quality Checks Reference).
  • Simplify your requirements and regenerate.

Limitations

  • The AI-assisted generation requires network access to the LLM model endpoint.
  • Generated rules quality depends on the clarity of the input description.
  • Complex business logic may require manual refinement of generated rules.
  • The feature requires additional llm dependencies which increases package size.
  • LLM inference may take a few seconds to complete.