Skip to main content

databricks.labs.dqx.llm.llm_engine

DQLLMEngine Objects

class DQLLMEngine()

High-level interface for LLM-based data quality rule generation.

This class serves as a Facade pattern, providing a simple interface to the underlying complex LLM system.

__init__

def __init__(model_config: LLMModelConfig,
spark: SparkSession | None = None,
custom_check_functions: dict[str, Callable] | None = None)

Initialize the LLM engine.

This class configures the DSPy model once and then creates components that rely on this global configuration.

Arguments:

  • model_config - Configuration for the LLM model.
  • spark - Optional Spark session. If None, a new session is created.
  • custom_check_functions - Optional custom check functions to include.

detect_business_rules_with_llm

def detect_business_rules_with_llm(
user_input: str,
schema_info: str = "") -> dspy.primitives.prediction.Prediction

Detect DQX rules based on natural language request with optional schema.

If schema_info is empty (default), it will automatically infer the schema from the user_input before generating rules.

Arguments:

  • user_input - Natural language description of data quality requirements.
  • schema_info - Optional JSON string containing table schema. If empty (default), triggers schema inference.

Returns:

A Prediction object containing:

  • quality_rules: The generated DQ rules
  • reasoning: Explanation of the rules
  • guessed_schema_json: The inferred schema (if schema was inferred)
  • assumptions_bullets: Assumptions made (if schema was inferred)
  • schema_info: The final schema used (if schema was inferred)

detect_primary_keys_with_llm

def detect_primary_keys_with_llm(table: str) -> dict[str, Any]

Detects primary keys using LLM-based analysis.

This method analyzes table schema and metadata to identify primary key columns.

Arguments:

  • table - The table name to analyze.

Returns:

A dictionary containing the primary key detection result with the following keys:

  • table: The table name
  • success: Whether detection was successful
  • primary_key_columns: List of detected primary key columns (if successful)
  • confidence: Confidence level (high/medium/low)
  • reasoning: LLM reasoning for the selection
  • has_duplicates: Whether duplicates were found (if validation performed)
  • duplicate_count: Number of duplicate combinations (if validation performed)
  • error: Error message (if failed)