AI-Powered PII Detection Engine
Built an ML + LLM system to automatically discover and classify sensitive data across codebases, databases, and SaaS integrations.
The Ghost PII Problem
Sensitive data does not stay where you put it. Engineers copy production email addresses into staging databases, log request bodies containing social security numbers, and pass API keys through third-party analytics tools — all without malicious intent. The result is “ghost PII”: personally identifiable information scattered across systems in ways that no one documents and compliance audits routinely miss. Traditional regex-based scanners catch the obvious patterns but fail on unstructured data, ambiguous fields, and the creative ways real-world systems handle sensitive information.
Multi-Stage Detection Pipeline
I designed the detection engine around a three-stage pipeline that balances precision, recall, and cost. The first stage applies fast heuristic classifiers and pattern matchers to identify candidate fields across codebases, databases, and SaaS integrations. Candidates then pass through a Named Entity Recognition (NER) model trained on data-privacy-specific labels, filtering out false positives from the heuristic stage. The final stage sends ambiguous cases to an LLM for contextual review, where the model evaluates field names, surrounding schema, and sample values to make a classification decision. This layered approach keeps LLM costs low while maintaining high accuracy on edge cases.
Zero-Config Discovery
A key design principle was zero-config operation. The engine performs AST analysis on application code to identify ORM patterns and trace how data flows from models to API endpoints to external services. Teams connect their repositories and databases, and the system builds a map of where sensitive data lives and how it moves — without requiring engineers to annotate schemas or maintain data inventories. This approach surfaces PII in places that manual audits consistently overlook: log pipelines, caching layers, and third-party webhook payloads.
Impact
The system reduced compliance audit preparation cycles by 40%, primarily by eliminating the manual data discovery phase that previously consumed the bulk of audit effort. Security teams shifted from periodic sweeps to continuous monitoring, catching new PII exposure within hours of deployment rather than at the next quarterly review.
The hardest PII to protect is the PII you do not know exists. Automated discovery turns compliance from a periodic scramble into a continuous posture.