The problem
Most PII redaction either ships sensitive documents to a cloud service (which defeats the point) or rolls brittle regex (which misses everything a regex can't express). I wanted a local-first pipeline with ML-grade entity detection — without the "oops, a document with SSNs just went to a third party" failure mode.
Approach
- Extract: IBM Granite-Docling-258M handles layout-aware PDF extraction. Tables come out as structured cells, body text as runs — no OCR spaghetti.
- Detect (two-pass): Presidio for 22 built-in entity types plus custom financial recognizers. GLiNER zero-shot NER as a fallback for anything Presidio missed, gated by a false-positive blocklist.
- Review: Every proposed redaction gets an approve/deny prompt with surrounding context, so the human is the final filter — not the model.
- Output: Pretty-printed markdown with aligned tables, stable token replacements (
<PERSON_1>,<LOCATION_1>), or pure extraction with no redaction at all.
Architecture
- FastAPI web server + CLI share the same underlying pipeline.
- CUDA auto-detected, CPU fallback for machines without a GPU.
- Per-cell redaction on tables, so a name in cell A doesn't leak its redaction token into cell B.
- Interactive review runs in a local browser window.
Status & links
Alpha. No cloud deployment — local-only by design. The whole point is that documents stop at the machine.