Cost and amnesia · James Gault

I use a personal AI assistant every day. For a long time it lived in a commercial app. I eventually built a replacement — not because the commercial one was bad at any single task, but because two failures compounded over the course of daily use until I couldn't ignore them.

Failure one: cost

Commercial assistants bill per token. For occasional use, that's fine — the per-query price is a rounding error. For daily use, the same shape generates a different outcome. If you reach for the assistant dozens of times a day, across searches, summaries, scheduling, drafting, research, error diagnosis, the bill is a direct function of how useful the tool became to you. The better it got, the more it cost.

That's not a bug in how the pricing works. It's an economic incentive that's fine for the vendor and fine for the casual user. It's bad for the power user, because it puts a tax on the exact behavior the tool was trying to encourage.

Failure two: amnesia

Every assistant has a context window. Once it fills, something has to go. There are two common options, and both throw away things I wanted kept.

Truncation. Drop the oldest turns. Simple. Also: you've now forgotten the thing you told the assistant at the start of the session, which is frequently the reason you were using it in the first place.
Lossy summarization. Compress old turns into a gist. Slightly better. Also: the gist is a summary of what the model decided mattered — which is not always what you decided mattered. Facts vanish, nuance collapses, and you can't tell until you ask a question that hinges on the detail that got dropped.

The combined effect: I was paying a recurring premium for a system that was quietly discarding the things I used it hardest for.

What changes when you go local

Local inference has tradeoffs. It is not universally better. But for the specific shape of "daily personal assistant," the tradeoffs flip:

Cost collapses. Running a small-enough model on your own hardware means the marginal cost per token trends toward the cost of electricity. You stop flinching before each prompt.
Memory becomes a design space. When you're not paying per token on history, you can keep the history. Summarization stops being a forced economy and starts being an engineering choice.
Routing becomes possible. You can send trivial requests to a tiny fast model and reserve a bigger model for the hard ones. The whole two-tier design only makes sense when you're not paying API rates on every hop.

A memory idea worth stealing

Once you're not forced to summarize for cost reasons, you can rethink how long-running memory should actually work. My current approach: keep every raw turn forever, and layer a DAG of summaries on top at increasing depths. The fresh tail of a conversation always goes to the model raw. Older material is compressed — but the raw turns are still there, so if the summarization policy changes, the summaries can be regenerated from source.

This is the inverse of commercial-assistant memory. Those throw away source and keep summaries. This keeps source and treats summaries as derivations. Nothing is lost on purpose.

The tradeoff you take on

Going local means maintaining infrastructure that was somebody else's problem. That's real. You need a machine that can run the model, a few gigabytes of weights, and some plumbing. For the class of work this covers — daily personal use, not team-scale — the infrastructure is small. A laptop with a GPU, two or three model files, and a router between them.

The ongoing cost, after setup, is electricity. The ongoing memory cost, after setup, is disk — which is effectively free. The productivity ceiling, which used to sit at "how often can I afford to prompt," is now somewhere else entirely.

Who should consider this

Not everyone. If you use an assistant a few times a week, commercial is fine. The friction of running your own stack won't pay for itself.