databricks.labs.dqx.pii.pii_detection_funcs
does_not_contain_pii
@register_rule("row")
def does_not_contain_pii(
column: str | Column,
language: str = "en",
threshold: float = 0.7,
entities: list[str] | None = None,
nlp_engine_config: NLPEngineConfig | dict | None = None) -> Column
Check if a column contains personally-identifying information (PII). Uses Microsoft Presidio to detect various named entities (e.g. PERSON, ADDRESS, EMAIL_ADDRESS). If PII is detected, the message includes a JSON string with the entity types, location within the string, and confidence score from the model.
Arguments:
column
- Column to check; can be a string column name or a column expressionlanguage
- Optional language of the text (default: 'en')threshold
- Confidence threshold for PII detection (0.0 to 1.0, default: 0.7) Higher values = less sensitive, fewer false positives Lower values = more sensitive, more potential false positivesentities
- Optional list of entities to detectnlp_engine_config
- Optional NLP engine configuration used for PII detection; Can be NLPEngineConfiguration or dict in the format:{
"nlp_engine_name": "spacy",
"models": [{"lang_code": "en", "model_name": "en_core_web_sm"}],
}
Returns:
Column object for condition that fails when PII is detected