docredact — James Gault

The problem

Most PII redaction either ships sensitive documents to a cloud service (which defeats the point) or rolls brittle regex (which misses everything a regex can't express). I wanted a local-first pipeline with ML-grade entity detection — without the "oops, a document with SSNs just went to a third party" failure mode.

Approach

Extract: IBM Granite-Docling-258M handles layout-aware PDF extraction. Tables come out as structured cells, body text as runs — no OCR spaghetti.
Detect (two-pass): Presidio for 22 built-in entity types plus custom financial recognizers. GLiNER zero-shot NER as a fallback for anything Presidio missed, gated by a false-positive blocklist.
Review: Every proposed redaction gets an approve/deny prompt with surrounding context, so the human is the final filter — not the model.
Output: Pretty-printed markdown with aligned tables, stable token replacements (<PERSON_1>, <LOCATION_1>), or pure extraction with no redaction at all.

Architecture

FastAPI web server + CLI share the same underlying pipeline.
CUDA auto-detected, CPU fallback for machines without a GPU.
Per-cell redaction on tables, so a name in cell A doesn't leak its redaction token into cell B.
Interactive review runs in a local browser window.

Status & links

Alpha. No cloud deployment — local-only by design. The whole point is that documents stop at the machine.

docredact — local-first PII redaction.

The problem

Approach

Architecture

Status & links