Kenneth Ballenegger@kob
The more I build Klaw, the more I think durable agents are codebases with language interfaces, not prompts with tool access.
The default agent pattern still feels like:
Write a big prompt.
Give the model a pile of tools.
Hope it reasons through the workflow correctly every time.
That works for demos. It’s not a great foundation for anything you want running every day.
For repeatable work, the agent should not be rediscovering the procedure from scratch. It should be calling something known.
A script.
A CLI.
A queue worker.
A typed adapter.
A deterministic parser.
A database query.
A narrow classifier.
A job with logs, retries, validation, and boring failure modes.
Then the language model does the part it’s actually good at: summarizing messy inputs, drafting text, classifying ambiguous cases, ranking options, explaining results, or choosing between known paths.
This sounds less magical. I think it’s much closer to how useful personal agents actually work.
Code gathers the data.
Code validates it.
Code computes the numbers.
Code checks source-of-truth state.
Code handles retries and side effects.
Then, when needed, the model gets a narrow job:
“Summarize this.”
“Classify this into one of these categories.”
“Explain these options.”
“Draft the reply, using this evidence.”
“Choose the next step from this list.”
That split matters.
If an LLM is doing the math, checking the state, deciding which source of truth matters, and writing the final answer all in one big mushy pass, you’ll eventually get weird failures.
If code computes the answer and the LLM explains it, the system is much easier to trust.
Same for email, travel, finance, contacts, reminders, dashboards, approvals. Basically anything personal enough that being wrong is annoying or expensive.
The job of the agent is not to be clever at every step.
The job is to know which parts should be deterministic and which parts need judgment.
This also changes how “memory” works.
A prompt-first agent wants to stuff more context into the model.
A code-first agent asks:
Where is the source of truth?
What query should retrieve it?
What is the minimum useful context?
What evidence should be attached to the result?
What should be logged so we can debug this later?
That is a very different product.
It’s cheaper.
It’s faster.
It’s easier to test.
It’s easier to audit.
It fails in more obvious ways.
And when something breaks, you fix the primitive instead of rewriting vibes into a longer system prompt.
Natural language still matters. A lot.
The whole point is that I can ask Klaw for an outcome in plain English, and it can assemble context, choose the right workflow, run it, and explain what happened.
But once a pattern repeats, it should graduate out of prompt-land and into code.
That’s where agents start becoming infrastructure instead of chat sessions.
And there’s a second-order effect I think people underweight: once the agent is not just a pile of prompts, you can build real software on top of it.
The interface does not have to be a chatbot.
It can be chat.
It can be a mobile app.
It can be a dashboard.
It can be a button.
It can be a background job that just does the thing.
All the automations, permissions, adapters, databases, logs, and weird little connections between systems already exist underneath. The parts that need judgment can still call the LLM through the harness. But the product surface can be whatever makes sense.
That’s the part I keep coming back to.
The best personal agents will probably feel conversational at the edge and boring underneath.
And once they’re boring underneath, they stop being just agents.
They become a way to build software.