Tom Ballard

4.8K posts

Tom Ballard banner
Tom Ballard

Tom Ballard

@tcballard

building tools I wanted, and open sourcing them…

Github 👉 Katılım Mayıs 2024
1.9K Takip Edilen871 Takipçiler
Sabitlenmiş Tweet
Tom Ballard
Tom Ballard@tcballard·
Most LLM routers decide where to send a prompt by… calling another model. Wayfinder doesn’t. It reads the prompts structure - deterministic, offline, microseconds - to keep cheap prompts on your local model and send hard ones to the frontier ones. Self-hosted, bring your own keys, OpenAI-compatible
Tom Ballard tweet media
English
6
4
24
4.3K
Tom Ballard
Tom Ballard@tcballard·
@ashwingop this is why I built Lore back in late May, so we preserve context and decisions in a deterministic, versioned manner. And have built Skills that write to that corpus, so you preserve your context. github.com/itsthelore/rac…
English
0
0
0
17
Ashwin Gopinath
Ashwin Gopinath@ashwingop·
Claude Tag is a Trojan horse.  Not because Anthropic is doing anything evil. Because the incentives are obvious. Day one, this looks like a great feature: tag Claude in Slack, let it follow the thread, remember context, connect to tools, break down tasks, chase work, and act like a teammate. But that is exactly the problem. The moment your AI vendor becomes a shared coworker, it stops being just a model provider. It starts becoming the place where work is interpreted, remembered, routed, and eventually executed. That is not model lock-in. That is context lock-in. You are now renting your company back from them. Models can be swapped. Agents can be copied. But the memory of how your company actually works is much harder, maybe impossible, to move: the Slack scar tissue, the exception paths, the customer promises, the unfinished threads, the weird workflows, the implicit owners, the “we tried that in Q2 and it failed” knowledge. Once that lives inside one vendor’s agent layer, you are not renting intelligence anymore. You are renting your company’s operating memory. And the pricing model makes it even more dangerous. A human coworker has a salary. Claude has unbounded tokenized activity. The more work moves through it, the more the vendor captures not just IT spend, but labor spend. This is the enterprise bargain people will regret: Convenience now, and rapid decent into dependency. The right architecture is simple: rent the best intelligence from whoever is best this month. OpenAI, Anthropic, Gemini, open source, whatever. But own the context layer. Your company memory should be inspectable, permissioned, portable, and model-neutral. It should not be buried inside the same vendor that sells you the intelligence and the workflow surface. Claude Tag is useful. That is why it is dangerous. Rent the intelligence, but own the context. Or, regret later.
Claude@claudeai

Introducing Claude Tag, a new way for teams to work with Claude. In Slack, Claude joins as a team member with access to the channels and tools you choose. Tag Claude in and delegate tasks to it while you focus on other work.

English
224
201
2.2K
494.8K
Tom Ballard
Tom Ballard@tcballard·
@NathanFlurry use lore so it captures the decisions and then doesn't spiral when using goal... bit more time upfront to design and make decisions, but if you then want to leave your agents running for 6hrs you can be safe knowing the bounds are respected... github.com/itsthelore/rac…
English
0
0
1
82
Nathan Flurry 🔩
Nathan Flurry 🔩@NathanFlurry·
/goal honestly sucks in claude, it's too defensive and never finishes in codex, it runs indefinitely and gets lost ralph still wins bc of the structured plan at the cost of up front planning there needs to be a middle ground of the flexibility of goal and planning of ralph
English
18
2
57
3.7K
Tom Ballard
Tom Ballard@tcballard·
Point Claude Code at a router that sends easy prompts to your local model and hard ones to the cloud - decided offline, deterministically, with no extra model call. Hit your budget and it degrades to local instead of failing. Open source, BYO-key. github.com/itsthelore/way…
English
1
0
0
27
Tom Ballard
Tom Ballard@tcballard·
Took this to the extreme @RhysSullivan ... and extended Lore. Lore Proofkeeper Autonomous verification for the Lore family. Proofkeeper keeps the proof — the stable test plus its replayable trace — so an agent's work is verified by reading the committed test and its trace in the pull request, not by a local run. Status: v0.0.1 prototype. The full drive→compile→fidelity→run pipeline now works end-to-end. The coverage read-model (below) runs against a real corpus graph. The local Playwright runner and the fidelity gate are real: the runner drives an actual browser, parses Playwright's JSON report into typed results, and emits a replayable trace per run; the gate accepts a test only after N green re-runs. The session→test compiler is real: a Recorder captures real browser actions (recording each only after it succeeds) and a deterministic emitter compiles that trace into a .spec.ts. And the autonomous drive is real: a bring-your-own-model agent loop observes the page, decides the next action, drives the product through the Recorder, and produces a session that compiles into a fidelity-gated test and proven end-to-end with a model deciding actions from page observations. Proofkeeper bundles no model; you supply a ModelClient. And the ## Verified By write-back is real: it merges the verification links into a requirement artifact and proposes them as a human-reviewed pull request (never a direct commit to the base branch) — the merged artifact validates clean against the real engine (rac validate + rac relationships --validate), and the resulting verified_by edge flips the capability from unverified to verified in the coverage report (see Scope). Given real developer tools - a browser and a terminal, bring your own model - Proofkeeper: The coverage signal Proofkeeper's free, local hook into a Lore corpus is the coverage read-model. A requirement node in rac export --graph is a product capability. The engine emits a typed, directed verified_by edge from a capability to each test/trace that verifies it. Because those targets are external files (not corpus artifacts), the edge is always emitted with resolved: false and the literal reference as its target. A capability is unverified when it has no outgoing verified_by edge. That's the whole signal - pure, deterministic, no browser and no model required. # Report unverified capabilities from a graph export proofkeeper coverage --graph-file graph.json # Machine-readable, for CI gating proofkeeper coverage --graph-file graph.json --json # Convenience: shell out to `rac export --graph` if `rac` is on PATH proofkeeper coverage --corpus path/to/rac/ Exit codes: 0 every capability is verified, 1 one or more are unverified (so it gates cleanly in CI), 2 usage or parse error. Autonomous drive (bring your own model) The AutonomousDriver observes the page, asks your model for the next action, and drives the product through the Recorder — recording only what succeeds. Proofkeeper bundles no model: you implement ModelClientagainst your provider. A reference adapter for the Anthropic Claude API ships in the box - ClaudeModelClient - but it is optional. @anthropic-ai/sdk is an optional peer dependency, imported lazily, so installing Proofkeeper never pulls in a model SDK. Use the adapter, or implement ModelClient directly for any provider: import { chromium } from "@playwright/test"; import { AutonomousDriver, CodegenCompiler, PlaywrightRunner, assessFidelity, ClaudeModelClient, // optional reference adapter (needs `npm i @anthropic-ai/sdk` + ANTHROPIC_API_KEY) } from "@itsthelore/proofkeeper"; // Option A — the reference Claude adapter (defaults to claude-opus-4-8): const model = new ClaudeModelClient({ /* apiKey?, model?, thinking?, effort? */ }); // Option B — bring your own provider by implementing ModelClient: const customModel = { async complete(request) { /* call your LLM with request.transcript and request.tools */ return { toolCalls: [/* { name, arguments } */] }; }, }; const page = await (await chromium.launch()).newPage(); const { session, finished } = await new AutonomousDriver(page, model, { capabilityId: "REQ-VERIFY", title: "verify interaction flips status to verified", startUrl: "http://localhost:3000/", goal: "Click Verify and confirm the status changes to 'verified'.", }).drive(); // Compile the recorded session and keep it only if it is stable. const candidate = await new CodegenCompiler({ outDir: "tests/generated" }).compile(session); const verdict = await assessFidelity(new PlaywrightRunner(), candidate, { n: 5, target: { name: "dev", baseURL: "http://localhost:3000/" }, }); Write-back (propose ## Verified By) Once a test is stable, Proofkeeper proposes linking it to the capability it verifies — by opening a human-reviewed pull request against the target's Lore corpus. It never commits to the base branch (ADR-065): it branches, commits the merged artifact to the branch, and opens a PR base ← head. The merge is pure and idempotent; re-proposing an already-present link opens no PR. Repository operations go through an injected RepoGateway, so there is no hard GitHub dependency - wire it to Octokit, the gh CLI, or a GitHub MCP client: import { GitHubWriteBackProposer, linksFromResults } from "@itsthelore/proofkeeper"; const proposer = new GitHubWriteBackProposer(gateway /* your RepoGateway */, { baseBranch: "main" }); const result = await proposer.propose({ capabilityId: "REQ-VERIFY", targetPath: "rac/requirements/verify.md", links: linksFromResults(candidate, verdict.stable ? runResults : []), }); // result: { status: "proposed", url, number, headBranch } | { status: "no-change", reason } The merged artifact validates against the real engine (rac validate and rac relationships --validate stay clean), and the emitted verified_by edge turns the capability from unverified to verified in proofkeeper coverage. Install & develop npm install npm run typecheck # strict TypeScript npm test # vitest unit tests (fast, no browser) npm run build # emit dist/ # Browser-driven end-to-end checks (real Chromium): npx playwright install chromium npx playwright test # run the seed spec PROOFKEEPER_E2E=1 npx vitest run \ tests/runner.integration.test.ts # runner + fidelity gate, real browser The default npm test is fully hermetic — no browser required. The runner and fidelity-gate integration tests launch a real browser and are gated behind PROOFKEEPER_E2E so they run only when you opt in (and in the CI e2e job). Requires Node ≥ 20. Published as @itsthelore/proofkeeper (npm). A lore-proofkeeper PyPI counterpart may follow; the npm package is the Playwright-native primary. v0.0.1 scope In: repo scaffold (packaging, Apache-2.0 + DCO, CI); the coverage read-model end-to-end; a real local Playwright runner (drives a browser, parses the JSON report into typed results, emits a replayable trace) gated by the fidelity gate over N green re-runs; a real session→test compiler - a Recorder that captures faithful browser actions and a deterministic emitter that compiles them into a .spec.ts; a real autonomous drive - a BYO-model agent loop (AutonomousDriver) that observes the page, decides the next action, and drives the product through the Recorder, proven end-to-end by a model deciding actions from observations through compile + a 3× green fidelity pass; a real ## Verified By write-back - an idempotent artifact merge (validated clean against the real engine) proposed as a human-reviewed pull request through an injected RepoGateway, never a direct commit to the base branch. It ships an optional reference ModelClient adapter for the Anthropic Claude API (ClaudeModelClient), behind the bring-your-own-model boundary — the model SDK is an optional peer dependency, never a hard one. Deferred (named, not silently dropped): a bundled RepoGateway (the write-back is gateway-agnostic — wire Octokit/gh/GitHub MCP, like the model adapter); reference ModelClient adapters for other providers; a terminal tool surface (the drive is browser-only today); generalization of the recorder/tool set beyond the core actions; the cross-target/cross-OS matrix and VM-fabric runner; Proofkeeper Cloud (the hosted commercial tier); an loreMCP client.
English
0
0
3
80
Rhys
Rhys@RhysSullivan·
sharing my ideas too, if you build this i will be your first customer i am desperate for an 'OpenDevin' but not in the background agent sense, in their autonomous QA sense the idea here is you give the agent the same tools that you use to develop and use your product, think codex computer use but if it could be turned into e2e tests as well. you should be able to verify an agents work without having to run anything locally, purely by looking at the e2e test and test output misc requirements: - the agent gets real developer tools to test your product (chrome, a terminal, etc) - the agent is able to develop the app using those tools, and then once it's done developing can turn them into tests in the repo - you're able to plug in multiple 'targets' to the same / similar tests, i.e "test the dev server, test production" - be able to run the tests quickly and also against different environments (i.e against MacOS, Windows, Linux) - needs to be open source and run locally - needs to output videos that can be played back to see the output for monetization you can likely turn this into a hosted product with vms to run the tests on but i'm also so desperate for a good version of this that i'd do github sponsors i've implemented an ok version of this in github.com/RhysSullivan/e… but it has so much more potential to be better if you have built this already, have your agent open a PR to my repo with your product and i'll take a look, if you're interested in building this i will get on a call with you to explain it in more detail
Theo - t3.gg@theo

I have a lot of ideas. I wish I could build them all. I don't have the time. I decided to give them all away in hopes of someone else building them.

English
44
8
393
74.7K
Michael
Michael@michael_chomsky·
theo's ideas for anyone who didn't watch the vid yet: - better npm/npx that fixes security + publishing drama - git but actually good - dropbox designed for devs (r2 wrapper) - entirely new mobile platform cuz current current ones suck - better slack - improved benchmarking tool now go start a /loop and make a billion dollars
Theo - t3.gg@theo

I have a lot of ideas. I wish I could build them all. I don't have the time. I decided to give them all away in hopes of someone else building them.

English
31
10
327
71.6K
Morgan
Morgan@morganlinton·
Absolutely, VulcanBench is free and open source so you can use the engine to run any benchmark you want. Likely the best way to do this would be to download the source code, and then pop into whatever you use for agentic coding and tell it you want to update VulcanBench to run your benchmark. Here's the repo: github.com/morganlinton/V…
English
1
0
1
34
Morgan
Morgan@morganlinton·
And, my first benchmark with VulcanBench is now complete ✅ Not totally thrilled with the results tbh, I still think there's quite a bit more work to do refining the actual tasks themselves, and making sure they cover a wider array of difficulty. My thinking is the easy and medium difficulty tasks are actually too easy for these models, and with only three hard tasks, I'm probably missing the opportunity to get a lot more signal out of the benchmark. What is interesting is that GLM 5.2 turned out to be the most expensive to run, Opus and GPT 5.5 came out at almost the exact same cost. A lot to dive into here, but I'm off to bed, probably not much to see here, but a lot for me to learn from. Either way, feels good to have one benchmark completed so I can really zero in on where I can improve. As the saying goes, live long and benchmark 🖖
VulcanBench@vulcanbench

Okay, the very first run of our full 52 test benchmark suite, comparing GLM 5.2, Opus 4.8 and GPT 5.5 is done. Came in under the cost estimate by quite a bit, which is nice, but I think there's still quite a bit of work to do before this is very meaningful. Overall, while there are 52 tests total, I think the easy and medium were likely too easy, and there weren't enough hard tasks. So, I don't think this is anything to write home about yet, but still worth sharing the results, and now back to work improving the tasks, a lot to analyze from this first run. Will be converting the results into a more readable, sharable format tomorrow so anyone that wants to can do a deeper dive. But like I said, first run, and I don't think I quite have things dialed in yet, so might not be too interesting, yet.

English
5
2
12
2.3K
Tom Ballard
Tom Ballard@tcballard·
@mattepstein soon you will just be able to think and Claude will do things for you
English
0
0
0
27
Tom Ballard
Tom Ballard@tcballard·
@ryancarson @JPoehnelt this is the sort of thing which will burn google long term when big engineering talent do cool stuff, and then they get forced out… impacting other programs
English
0
0
1
168
Ryan Carson
Ryan Carson@ryancarson·
I can't believe Google did this. It's the perfect demonstration of the dysfunction of large orgs. @JPoehnelt - wherever you choose to work next is going to be very lucky to have you.
Justin Poehnelt@JPoehnelt

Two months ago I was fired by Google for creating the Google Workspace CLI. It went viral, hit #1 on Hacker News, gained thousands of GitHub stars and many thousands of actual users in just a couple days. It was an incredible, confusing journey, from directors and leaders asking what they could learn from the tool to getting grilled by legal about why the Google logo and brand colors are on the Google Workspace GitHub code repositories. I think the cause was that Workspace and certain leaders (and projects) were afraid of being disrupted. But the fear wasn't specific to my CLI, it was a broader fear in what agents meant for Workspace. Either way, the irony of my termination was the announcement at Google Cloud Next two days before I was fired that an official Workspace CLI was coming. I want this out there because it is easier for me to explain my story and it is an experience I want to fully own. It's also part of my healing. Nearly 7 years at Google was an incredible opportunity for me and I was fortunate to have wonderful teammates and a manager that fully supported me through these last few months. Thank you.

English
12
9
128
21.4K
Tom Ballard
Tom Ballard@tcballard·
@hasantoxr I’d augment the RAG layer to include Lore, which is a deterministic layer that feeds into those graphs… grounding in truth as opposed to synthesised LLM authored/maintained data github.com/itsthelore/rac…
English
1
0
0
96
Hasan Toor
Hasan Toor@hasantoxr·
The 3 architectures every Al agent builder needs to know.
Hasan Toor tweet media
English
9
34
133
10.3K
Tom Ballard
Tom Ballard@tcballard·
@sqs it’s why I built Wayfinder… as I was getting a bit frustrated about switching/routing between models
English
0
0
1
72
Quinn Slack
Quinn Slack@sqs·
The trillion-dollar question
Michael McNair@michaeljmcnair

I want to explain why developing frontier models is such an awful business right now. The core problem is that frontier models have close substitutes, and those substitutes have a materially lower cost structure. Let me explain why that is a nightmare combination for profit capture. Most industries deal with substitutes. For ex, Gillette faces online razor brands and private label competition. That limits pricing power, but Gillette still has a superior cost structure that allows it to earn attractive profits. The LLM business is in a much worse position. The market clearing price is set by substitutes. But the frontier labs have a materially higher cost structure bc they must bear the full cost of discovery (R&D, training, failed experiments, infrastructure, etc). Meanwhile, distillation allows their outputs to become the raw material to create slightly inferior substitutes at a fraction of the cost. That's a uniquely brutal competitive dynamic. It might actually be the worst free-rider problem I’ve ever seen. If open weight models can deliver most of the capability at a fraction of the cost, then customers don’t just have to prefer the frontier model. They have to believe the frontier output is worth many, many times more on the marginal task. That is an extremely high bar. The hope is that frontier models will keep their advantage in the highest value tasks. Bc the more valuable the task, the more valuable the superior output becomes. But test-time scaling threatens that assumption too. If performance on long-horizon tasks keeps improving with more test-time compute, then a slightly inferior but much cheaper model can run longer for the same cost. So it can afford to spend more on search, verification, retries, decomposition, and tool use. In that world, cost efficiency becomes a capability advantage. That will matter most in domains where answers can be checked like, coding, math, data analysis, structured research, optimization problems, tool based workflows, and agentic tasks with objective feedback loops. In other words, the frontier labs’ current bread and butter. So with these long-horizon tasks, the cheaper substitute actually becomes even more of a threat. An industry structure where close substitutes have dramatically lower cost structures doesn’t just risk capping frontier model profitability. Once the full costs are factored in, it can prevent frontier labs from earning any economic profit at all. Frontier labs may eventually find a way to differentiate enough to earn real pricing power. But so far the opposite is happening…the substitutes are closing the gap.

English
2
2
27
7.9K
Chris Ashby
Chris Ashby@chris_bgp·
Markdown is the new programming language. But writing markdown sucks. So I created a beautiful app to write and edit markdown files with agents, locally. It's called Skribe, an open-source, beautiful markdown editor and you can try it for free today. Just: - Open your codebase - Only markdown files are shown - Edit and write markdown alongside Claude Code or Codex within a beautiful editor
English
11
7
94
10.8K
Tom Ballard
Tom Ballard@tcballard·
agree. It’s the VC cycle funding themselves again building open source, like I am, is more of a passion project… yet I am paying for the MAX plan to keep myself moving forward maybe @claudeai can provide some new method for foundational OS not backed by Foundations to keep going?
English
1
1
1
198
Jason Kneen
Jason Kneen@jasonkneen·
You know who doesn’t need free api credits. VC backed founders. You know who does? All the unpaid open source developers whose products and tools are used by the VC backed founders. Stop treating OSS maintainers and creators as second class citizens.
Ksenia Moskalenko@kseniam0s

FOUNDERS: You've been paying to build on Claude. @AnthropicAI launched a program to change that. @Claudeai for Startups - free API credits and priority rate limits for early-stage VC-backed founders: - Free Claude API credits - Highest rate limits, no throttling in production - Hackathons, Founder Days, and meetups - Early access to new model releases Build with the full Claude stack: Claude API, Claude Code, Claude Managed Agents, and Claude Cowork. To qualify: your startup must be early-stage and backed by one of Anthropic's partner VCs. Ask your investors for a unique application link. Apply → claude.com/programs/start… P.S. Founders using Claude to build - when you're ready to raise, @ThePageform is where your data room lives → pageform.io

English
2
1
13
1.5K