Skip to main content

databricks.labs.dqx.utils

get_column_name_or_alias

def get_column_name_or_alias(
column: "str | Column | ConnectColumn",
normalize: bool = False,
allow_simple_expressions_only: bool = False) -> str

Extracts the column alias or name from a PySpark Column or ConnectColumn expression.

PySpark does not provide direct access to the alias of an unbound column, so this function parses the alias from the column's string representation.

  • Supports columns with one or multiple aliases.
  • Ensures the extracted expression is truncated to 255 characters.
  • Provides an optional normalization step for consistent naming.
  • Supports ConnectColumn when PySpark Connect is available (falls back gracefully when not available).

Arguments:

  • column - Column, ConnectColumn (if PySpark Connect available), or string representing a column.
  • normalize - If True, normalizes the column name (removes special characters, converts to lowercase).
  • allow_simple_expressions_only - If True, raises an error if the column expression is not a simple expression. Complex PySpark expressions (e.g., conditionals, arithmetic, or nested transformations), cannot be fully reconstructed correctly when converting to string (e.g. F.col("a") + F.lit(1)). However, in certain situations this is acceptable, e.g. when using the output for reporting purposes.

Returns:

The extracted column alias or name.

Raises:

  • InvalidParameterError - If the column expression is invalid or unsupported.

get_columns_as_strings

def get_columns_as_strings(
columns: list[str | Column],
allow_simple_expressions_only: bool = True) -> list[str]

Extracts column names from a list of PySpark Column or ConnectColumn expressions.

This function processes each column, ensuring that only valid column names are returned. Supports ConnectColumn when PySpark Connect is available (falls back gracefully when not available).

Arguments:

  • columns - List of columns, ConnectColumns (if PySpark Connect available), or strings representing columns.
  • allow_simple_expressions_only - If True, raises an error if the column expression is not a simple expression.

Returns:

List of column names as strings.

Raises:

  • InvalidParameterError - If any column expression is invalid or unsupported.

is_simple_column_expression

def is_simple_column_expression(col_name: str) -> bool

Returns True if the column name does not contain any disallowed characters: space, comma, semicolon, curly braces, parentheses, newline, tab, or equals sign.

Arguments:

  • col_name - Column name to validate.

Returns:

True if the column name is valid, False otherwise.

normalize_bound_args

def normalize_bound_args(val: Any) -> Any

Normalize a value or collection of values for consistent processing.

Handles primitives, dates, and column-like objects. Lists, tuples, and sets are recursively normalized with type preserved.

Arguments:

  • val - Value or collection of values to normalize.

Returns:

Normalized value or collection.

Raises:

  • TypeError - If a column type is unsupported.

normalize_col_str

def normalize_col_str(col_str: str) -> str

Normalizes string to be compatible with metastore column names by applying the following transformations:

  • remove special characters
  • convert to lowercase
  • limit the length to 255 characters to be compatible with metastore column names

Arguments:

  • col_str - Column or string representing a column.

Returns:

Normalized column name.

safe_json_load

def safe_json_load(value: str)

Safely load a JSON string, returning the original value if it fails to parse. This allows to specify string value without a need to escape the quotes.

Arguments:

  • value - The value to parse as JSON.

safe_strip_file_from_path

def safe_strip_file_from_path(path: str) -> str

Safely removes the file name from a given path, treating it as a directory if no file extension is present.

  • Hidden directories (e.g., .folder) are preserved.
  • Hidden files with extensions (e.g., .file.yml) are treated as files.

Arguments:

  • path - The input path from which to remove the file name.

Returns:

The path without the file name, or the original path if it is already a directory.

list_tables

@rate_limited(max_requests=100)
def list_tables(workspace_client: WorkspaceClient,
patterns: list[str] | None,
exclude_matched: bool = False,
exclude_patterns: list[str] | None = None) -> list[str]

Gets a list of table names from Unity Catalog given a list of wildcard patterns.

Arguments:

  • workspace_client WorkspaceClient - Databricks SDK WorkspaceClient.
  • patterns list[str] | None - A list of wildcard patterns to match against the table name.
  • exclude_matched bool - Specifies whether to include tables matched by the pattern. If True, matched tables are excluded. If False, matched tables are included.
  • exclude_patterns list[str] | None - A list of wildcard patterns to exclude from the table names.

Returns:

  • list[str] - A list of fully qualified table names. DataFrame with values read from the input data

Raises:

  • NotFound - If no tables are found matching the include or exclude criteria.