How I Built a PII Detection Engine with LLMs
The architecture behind an ML-powered system that continuously scans codebases, databases, and SaaS integrations for sensitive data.
The Problem
Sensitive data ends up everywhere. Customer emails in log files. Social security numbers in staging databases. API keys in Slack messages. If you’ve worked at any company handling user data, you know this isn’t a hypothetical — it’s Tuesday.
My goal was to build a system that could continuously scan codebases, databases, and SaaS integrations to find PII before it became a compliance liability. The hard part wasn’t finding obvious patterns like 123-45-6789. The hard part was everything else: names that look like city names, addresses embedded in free-text fields, medical record numbers that look like order IDs.
Manual audits don’t scale. You can’t ask engineers to tag every field correctly — they won’t, and even when they try, data drifts. I needed something automated, accurate, and fast enough to run continuously.
Stage 1: Regex and Heuristics
The first pass was straightforward. Pattern matching catches the low-hanging fruit: SSNs, credit card numbers, phone numbers, email addresses. These have predictable formats, and regex handles them well.
I built a rule engine with about 40 patterns covering common PII formats across US, EU, and APAC regions. It was fast — scanning a million records in under a minute — and caught roughly 70% of true PII.
The problem was the other 30%. And the false positive rate was brutal. Phone numbers matched order IDs. ZIP codes matched internal codes. The string “John Smith” in a comment about a fictional user triggered alerts just like a real customer name in a database column.
A 30% miss rate and a 15% false positive rate is not a detection engine. It’s a suggestion engine.
Stage 2: Named Entity Recognition
To handle contextual entities — names, addresses, organizations — I added a Named Entity Recognition layer using spaCy with a fine-tuned model. NER understands that “Springfield” after “123 Main St” is probably an address, while “Springfield” in a column called server_region is not.
I trained the model on a labeled dataset of about 50,000 examples drawn from anonymized production data. The key insight was including the surrounding context: column names, neighboring fields, file paths. A value of “M” means nothing alone, but in a column called gender next to patient_name, it’s clearly PII.
This brought our recall up to about 88% and dropped false positives to around 8%. Good, but not good enough for a system that would generate alerts engineers actually needed to act on.
Stage 3: LLM Review for Ambiguous Cases
The final stage handles the hard cases — the ones where regex says “maybe” and NER says “I’m not sure.” These get routed to an LLM with a structured prompt that includes the value, its context (schema, neighboring data, source system), and a request for a classification with confidence score.
from dataclasses import dataclass
from enum import Enum
import spacy
import re
import openai
class PIICategory(Enum):
SSN = "ssn"
EMAIL = "email"
PHONE = "phone"
NAME = "name"
ADDRESS = "address"
MEDICAL_ID = "medical_id"
NONE = "none"
@dataclass
class DetectionResult:
value: str
category: PIICategory
confidence: float
stage: str
context: dict
class PIIDetectionPipeline:
def __init__(self):
self.nlp = spacy.load("en_pii_custom_v3")
self.regex_patterns = self._load_patterns()
self.llm_threshold = 0.6 # route to LLM below this confidence
def scan(self, value: str, context: dict) -> DetectionResult:
# Stage 1: Regex -- fast, high precision for structured PII
regex_result = self._regex_scan(value)
if regex_result and regex_result.confidence > 0.95:
return regex_result
# Stage 2: NER -- contextual entity detection
ner_result = self._ner_scan(value, context)
if ner_result and ner_result.confidence > self.llm_threshold:
return ner_result
# Stage 3: LLM -- ambiguous cases only
if regex_result or ner_result:
best_guess = regex_result or ner_result
if best_guess.confidence > 0.3:
return self._llm_review(value, context, best_guess)
return DetectionResult(value, PIICategory.NONE, 0.0, "regex", context)
def _regex_scan(self, value: str) -> DetectionResult | None:
for pattern_name, pattern in self.regex_patterns.items():
if match := re.search(pattern, value):
return DetectionResult(
value=match.group(),
category=PIICategory(pattern_name),
confidence=0.97,
stage="regex",
context={}
)
return None
def _ner_scan(self, value: str, context: dict) -> DetectionResult | None:
enriched = f"[{context.get('column_name', '')}] {value}"
doc = self.nlp(enriched)
if doc.ents:
top = max(doc.ents, key=lambda e: e._.confidence)
return DetectionResult(
value=top.text,
category=PIICategory(top.label_.lower()),
confidence=top._.confidence,
stage="ner",
context=context
)
return None
def _llm_review(
self, value: str, context: dict, prior: DetectionResult
) -> DetectionResult:
prompt = f"""Classify whether this value is PII.
Value: {value}
Source: {context.get('source', 'unknown')}
Column/Field: {context.get('column_name', 'unknown')}
Neighboring fields: {context.get('neighbors', [])}
Prior classification: {prior.category.value} (confidence: {prior.confidence:.2f})
Respond with JSON: {{"category": "...", "confidence": 0.0-1.0, "reasoning": "..."}}"""
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
response_format={"type": "json_object"},
temperature=0
)
result = response.choices[0].message.content
# Parse and return structured result
parsed = eval(result) # simplified -- use json.loads in production
return DetectionResult(
value=value,
category=PIICategory(parsed["category"]),
confidence=parsed["confidence"],
stage="llm",
context={**context, "reasoning": parsed["reasoning"]}
)
The critical design decision: the LLM only sees cases that the cheaper stages couldn’t resolve confidently. In production, roughly 5% of scanned values reach Stage 3. This keeps costs manageable and latency low — the median scan takes 12ms, with LLM-routed cases adding ~800ms.
Pipeline Architecture
The system runs as a set of workers pulling from a job queue. Connectors for each data source (PostgreSQL, S3, GitHub, Slack) emit scan jobs. Each job contains the value, its context, and metadata about the source. Workers run the three-stage pipeline and write results to a findings database.
A scheduler triggers full scans weekly and incremental scans on every commit or database migration. Alerts go to Slack with enough context for engineers to triage without switching tools.
What Worked
The multi-stage approach was the right call. Regex handles 70% of cases at near-zero cost. NER picks up another 20%. The LLM handles the remaining 10% of ambiguous cases where context truly matters. Overall precision landed at 94% with 96% recall — good enough that engineers trusted the alerts and actually fixed findings.
What Didn’t
Fine-tuning the NER model was a bigger time investment than I expected. Labeling 50,000 examples took three weeks, and the model needed retraining every quarter as new data patterns emerged. If I were starting over, I’d invest earlier in an active learning loop that surfaces uncertain cases for human labeling.
The LLM stage also introduced a dependency I wasn’t thrilled about. API rate limits, model version changes, and cost unpredictability are real operational concerns. I’m exploring local models as a replacement for this stage, but the accuracy gap is still meaningful as of late 2024.
The system has been running in production for eight months. It’s caught PII in places nobody expected — debug logs, analytics event payloads, even README files. The lesson: sensitive data doesn’t stay where you put it, so you need a system that doesn’t assume it will.
Interested in this kind of work? Let's talk
Comments powered by GitHub Discussions