Home / Work / 06 · docredact

docredact — local-first PII redaction.

A local FastAPI app that runs PDFs through layout-aware extraction, two-pass PII detection, and an interactive approve/deny review — before anything leaves the machine.

Role
Solo — design + build
Timeline
Apr 2026 — WIP
Stack
Python 3.11+, FastAPI, PyTorch, IBM Granite-Docling-258M, Presidio, GLiNER, spaCy
Status
Alpha — local-only

The problem

Most PII redaction either ships sensitive documents to a cloud service (which defeats the point) or rolls brittle regex (which misses everything a regex can't express). I wanted a local-first pipeline with ML-grade entity detection — without the "oops, a document with SSNs just went to a third party" failure mode.

Approach

  • Extract: IBM Granite-Docling-258M handles layout-aware PDF extraction. Tables come out as structured cells, body text as runs — no OCR spaghetti.
  • Detect (two-pass): Presidio for 22 built-in entity types plus custom financial recognizers. GLiNER zero-shot NER as a fallback for anything Presidio missed, gated by a false-positive blocklist.
  • Review: Every proposed redaction gets an approve/deny prompt with surrounding context, so the human is the final filter — not the model.
  • Output: Pretty-printed markdown with aligned tables, stable token replacements (<PERSON_1>, <LOCATION_1>), or pure extraction with no redaction at all.

Architecture

  • FastAPI web server + CLI share the same underlying pipeline.
  • CUDA auto-detected, CPU fallback for machines without a GPU.
  • Per-cell redaction on tables, so a name in cell A doesn't leak its redaction token into cell B.
  • Interactive review runs in a local browser window.

Status & links

Alpha. No cloud deployment — local-only by design. The whole point is that documents stop at the machine.