I use a personal AI assistant every day. For a long time it lived in a commercial app. I eventually built a replacement — not because the commercial one was bad at any single task, but because two failures compounded over the course of daily use until I couldn't ignore them.

Failure one: cost

Commercial assistants bill per token. For occasional use, that's fine — the per-query price is a rounding error. For daily use, the same shape generates a different outcome. If you reach for the assistant dozens of times a day, across searches, summaries, scheduling, drafting, research, error diagnosis, the bill is a direct function of how useful the tool became to you. The better it got, the more it cost.

That's not a bug in how the pricing works. It's an economic incentive that's fine for the vendor and fine for the casual user. It's bad for the power user, because it puts a tax on the exact behavior the tool was trying to encourage.

Failure two: amnesia

Every assistant has a context window. Once it fills, something has to go. There are two common options, and both throw away things I wanted kept.

The combined effect: I was paying a recurring premium for a system that was quietly discarding the things I used it hardest for.

What changes when you go local

Local inference has tradeoffs. It is not universally better. But for the specific shape of "daily personal assistant," the tradeoffs flip:

A memory idea worth stealing

Once you're not forced to summarize for cost reasons, you can rethink how long-running memory should actually work. My current approach: keep every raw turn forever, and layer a DAG of summaries on top at increasing depths. The fresh tail of a conversation always goes to the model raw. Older material is compressed — but the raw turns are still there, so if the summarization policy changes, the summaries can be regenerated from source.

This is the inverse of commercial-assistant memory. Those throw away source and keep summaries. This keeps source and treats summaries as derivations. Nothing is lost on purpose.

The tradeoff you take on

Going local means maintaining infrastructure that was somebody else's problem. That's real. You need a machine that can run the model, a few gigabytes of weights, and some plumbing. For the class of work this covers — daily personal use, not team-scale — the infrastructure is small. A laptop with a GPU, two or three model files, and a router between them.

The ongoing cost, after setup, is electricity. The ongoing memory cost, after setup, is disk — which is effectively free. The productivity ceiling, which used to sit at "how often can I afford to prompt," is now somewhere else entirely.

Who should consider this

Not everyone. If you use an assistant a few times a week, commercial is fine. The friction of running your own stack won't pay for itself.

But if you reach for a model constantly — the way a senior developer reaches for a terminal — you're the exact person the commercial pricing penalizes and the commercial memory quietly forgets on. You'd save money and frustration by going local, and the friction of setup is a one-time cost. It's worth considering whether the default stack is still the right one for you.