Home / Work / 05 · MyAssistant

MyAssistant — a local orchestrator with long-context memory.

A FastAPI + Ollama assistant that routes requests across 16 tools and two Qwen3 models (a 0.6B router, a 4B reasoner) — with DAG-based context compression so long conversations keep full fidelity without token bloat.

Role
Solo — orchestration + memory layer
Timeline
Apr 2026 — ongoing
Stack
Python, FastAPI, Ollama, Qwen3-0.6B + Qwen3-4B, nomic-embed-text, SQLite (WAL)
Status
Operational locally

The problem

Two annoyances with commercial assistants I wanted to solve for my own daily use:

  1. Cost. API calls stack up fast on everyday work — research, summaries, scheduling. The ceiling isn't capability; it's a running bill.
  2. Amnesia. Every context window fills eventually. The fix is either truncation (forget old turns) or lossy summarization (compress them to a gist). Both drop information I wanted kept.

Routing

Two-layer router. An O(1) keyword pre-filter dispatches ~80% of requests immediately — "remind me," "search," "open my calendar" don't need a model to classify. The ambiguous rest go to Qwen3-0.6B with JSON-grammar enforcement, so a 0.6B model reliably emits valid routing decisions without retry loops.

The router emits either a direct answer or a step sequence (tool → model → tool) with variable substitution between steps. Intermediate results are stored, so the chain is inspectable after the fact.

Memory: Lossless Context Management

Older turns are never deleted — they're compressed into a DAG of summary nodes at increasing depths. The fresh tail of the conversation always goes to the model raw. If the summarization policy ever changes, the raw turns are still there and the DAG can be regenerated from scratch.

This keeps hundreds of conversation turns addressable per context window without hitting the budget.

Architecture

  • Fast path: Qwen3-0.6B with JSON grammar for routing decisions (tight, cheap).
  • Reasoning path: Qwen3-4B in full reasoning mode for complex tasks. Decision made by router, not by caller.
  • Tools: 16 implementations — web scraping, PDF handling, notes, deep research, calendar, budget, etc.
  • Storage: SQLite with WAL mode for session state; vector store over nomic-embed-text for semantic notes.
  • Frontend: static HTML/JS with WebSocket streaming of orchestrator events (routing decision → tool call → model inference).

Status & links

Operational locally and actively evolving. Runs as a local FastAPI app with no cloud dependency in the default config — a 0.6B router and 4B reasoner keep inference cheap enough for constant daily use.