databricks.labs.dqx.llm.table_manager
TableDataProvider Objects
class TableDataProvider(Protocol)
Protocol defining the interface for table data access operations.
get_table_columns
def get_table_columns(table: str) -> DataFrame
Retrieve table column definitions.
Arguments:
table- Fully qualified table name.
Returns:
DataFrame with columns: col_name, data_type, comment.
get_existing_primary_key
def get_existing_primary_key(table: str) -> str | None
Retrieve existing primary key constraint from table properties.
Arguments:
table- Fully qualified table name.
Returns:
Primary key constraint string if exists, None otherwise.
get_table_properties
def get_table_properties(table: str) -> DataFrame
Retrieve table properties/metadata.
Arguments:
table- Fully qualified table name.
Returns:
DataFrame with columns: key, value containing table properties.
get_column_statistics
def get_column_statistics(table: str) -> DataFrame
Retrieve column-level statistics and metadata.
Arguments:
table- Fully qualified table name.
Returns:
DataFrame with columns: col_name, data_type, and other stats.
get_table_column_names
def get_table_column_names(table: str) -> list[str]
Get list of column names for a table.
Arguments:
table- Fully qualified table name.
Returns:
List of column names.
execute_query
def execute_query(query: str) -> DataFrame
Execute a SQL query and return results.
Arguments:
query- SQL query string.
Returns:
DataFrame containing query results.
Raises:
ValueError- If query execution fails.
SparkTableDataProvider Objects
class SparkTableDataProvider()
Spark implementation of the TableDataProvider protocol.
This class encapsulates all Spark SQL operations for table metadata retrieval, providing a clean interface for accessing table data and structure.
Attributes:
spark- SparkSession instance for executing SQL queries.
__init__
def __init__(spark: SparkSession | None = None) -> None
Initialize the Spark data provider.
Arguments:
spark- SparkSession instance. If None, gets or creates a session.
get_table_columns
def get_table_columns(table: str) -> DataFrame
Retrieve table column definitions from DESCRIBE TABLE EXTENDED.
Arguments:
table- Fully qualified table name.
Returns:
Pandas DataFrame with columns: col_name, data_type, comment.
Raises:
ValueError- If table is not found.TypeError- If there's a type error in processing.
get_existing_primary_key
def get_existing_primary_key(table: str) -> str | None
Retrieve existing primary key from table properties.
Arguments:
table- Fully qualified table name.
Returns:
Primary key constraint string if exists, None otherwise.
get_table_properties
def get_table_properties(table: str) -> DataFrame
Retrieve table properties using SHOW TBLPROPERTIES.
Arguments:
table- Fully qualified table name.
Returns:
Pandas DataFrame with columns: key, value.
get_column_statistics
def get_column_statistics(table: str) -> DataFrame
Retrieve column statistics from DESCRIBE TABLE EXTENDED.
Arguments:
table- Fully qualified table name.
Returns:
Pandas DataFrame with column information.
get_table_column_names
def get_table_column_names(table: str) -> list[str]
Get list of column names for a table.
Arguments:
table- Fully qualified table name.
Returns:
List of column names.
execute_query
def execute_query(query: str) -> DataFrame
Execute a SQL query and return Spark DataFrame.
Note: Returns Spark DataFrame, not Pandas DataFrame, for compatibility with existing code that calls toPandas() on the result.
Arguments:
query- SQL query string.
Returns:
Spark DataFrame containing query results.
Raises:
Exception- If query execution fails.
TableDefinitionBuilder Objects
class TableDefinitionBuilder()
Builder for constructing table definition strings.
This class uses the Builder pattern to construct complex table definition strings step by step, separating the construction logic from representation.
__init__
def __init__() -> None
Initialize the builder with empty state.
add_columns
def add_columns(columns: list[str]) -> "TableDefinitionBuilder"
Add column definitions to the table.
Arguments:
columns- List of column definition strings (e.g., "id bigint").
Returns:
Self for method chaining.
add_primary_key
def add_primary_key(primary_key: str | None) -> "TableDefinitionBuilder"
Add primary key constraint information.
Arguments:
primary_key- Primary key constraint string, or None if no PK exists.
Returns:
Self for method chaining.
build
def build() -> str
Build and return the final table definition string.
Returns:
Formatted table definition string.
MetadataFormatter Objects
class MetadataFormatter(ABC)
Abstract base class for metadata formatting strategies.
This uses the Strategy pattern to allow different formatting approaches for various types of metadata.
format
@abstractmethod
def format(data: DataFrame) -> list[str]
Format metadata from a DataFrame into string lines.
Arguments:
data- DataFrame containing metadata to format.
Returns:
List of formatted string lines.
PropertyMetadataFormatter Objects
class PropertyMetadataFormatter(MetadataFormatter)
Formatter for table property metadata.
Extracts and formats useful properties like row counts, data sizes, and constraint information.
format
def format(data: DataFrame) -> list[str]
Extract useful properties from table properties DataFrame.
Arguments:
data- DataFrame with columns: key, value.
Returns:
List of formatted property strings.
ColumnStatisticsFormatter Objects
class ColumnStatisticsFormatter(MetadataFormatter)
Formatter for column statistics and type distribution.
Categorizes columns by data type and formats distribution information.
format
def format(data: DataFrame) -> list[str]
Format column type distribution from column statistics.
Arguments:
data- DataFrame with columns: col_name, data_type.
Returns:
List of formatted column distribution strings.
ColumnDefinitionExtractor Objects
class ColumnDefinitionExtractor()
Extracts and formats column definitions from DESCRIBE TABLE results.
This class handles the parsing of DESCRIBE TABLE output and converts it into formatted column definition strings.
extract_columns
@staticmethod
def extract_columns(describe_df: DataFrame) -> list[str]
Extract column definitions from DESCRIBE TABLE DataFrame.
Arguments:
describe_df- DataFrame from DESCRIBE TABLE EXTENDED query.
Returns:
List of formatted column definition strings.
TableManager Objects
class TableManager()
Facade for table operations providing schema retrieval and metadata checking.
This class acts as a simplified interface (Facade pattern) that coordinates between the data repository and formatters. It delegates actual operations to specialized components while maintaining backward compatibility with the existing API.
Attributes:
repository- Data provider for table operations (defaults to SparkTableDataProvider)property_formatter- Formatter for table property metadatastats_formatter- Formatter for column statistics and distribution
__init__
def __init__(spark: SparkSession | None = None, repository=None) -> None
Initialize TableManager with optional dependency injection.
Arguments:
spark- SparkSession instance. Used if repository is not provided.repository- Optional TableDataProvider implementation. If None, creates SparkTableDataProvider with the provided spark session.
get_table_definition
def get_table_definition(table: str) -> str
Retrieve table definition using repository and formatters.
This method coordinates between the repository for data access and the builder/extractor for formatting the result.
Arguments:
table- Fully qualified table name.
Returns:
Formatted table definition string with columns and primary key.
get_table_metadata_info
def get_table_metadata_info(table: str) -> str
Get additional metadata information to help with primary key detection.
This method coordinates multiple formatters to build comprehensive metadata information from the repository.
Arguments:
table- Fully qualified table name.
Returns:
Formatted metadata information string.
get_table_column_names
def get_table_column_names(table: str) -> list[str]
Get table column names.
Arguments:
table- Fully qualified table name.
Returns:
List of column names.
run_sql
def run_sql(query: str)
Run a SQL query and return the result DataFrame.
Arguments:
query- SQL query string.
Returns:
Spark DataFrame containing query results.