The problem
Two annoyances with commercial assistants I wanted to solve for my own daily use:
- Cost. API calls stack up fast on everyday work — research, summaries, scheduling. The ceiling isn't capability; it's a running bill.
- Amnesia. Every context window fills eventually. The fix is either truncation (forget old turns) or lossy summarization (compress them to a gist). Both drop information I wanted kept.
Routing
Two-layer router. An O(1) keyword pre-filter dispatches ~80% of requests immediately — "remind me," "search," "open my calendar" don't need a model to classify. The ambiguous rest go to Qwen3-0.6B with JSON-grammar enforcement, so a 0.6B model reliably emits valid routing decisions without retry loops.
The router emits either a direct answer or a step sequence (tool → model → tool) with variable substitution between steps. Intermediate results are stored, so the chain is inspectable after the fact.
Memory: Lossless Context Management
Older turns are never deleted — they're compressed into a DAG of summary nodes at increasing depths. The fresh tail of the conversation always goes to the model raw. If the summarization policy ever changes, the raw turns are still there and the DAG can be regenerated from scratch.
This keeps hundreds of conversation turns addressable per context window without hitting the budget.
Architecture
- Fast path: Qwen3-0.6B with JSON grammar for routing decisions (tight, cheap).
- Reasoning path: Qwen3-4B in full reasoning mode for complex tasks. Decision made by router, not by caller.
- Tools: 16 implementations — web scraping, PDF handling, notes, deep research, calendar, budget, etc.
- Storage: SQLite with WAL mode for session state; vector store over
nomic-embed-textfor semantic notes. - Frontend: static HTML/JS with WebSocket streaming of orchestrator events (routing decision → tool call → model inference).
Status & links
Operational locally and actively evolving. Runs as a local FastAPI app with no cloud dependency in the default config — a 0.6B router and 4B reasoner keep inference cheap enough for constant daily use.