pre.dev

892 posts

pre.dev

@predotdev

Self-driving coding agent

Katılım Ağustos 2023

168 Takip Edilen1.9K Takipçiler

Sabitlenmiş Tweet

pre.dev@predotdev·6 Mar

For the engineer in all of us

English

787

pre.dev@predotdev·2d

We brought raw benchmark data to the stage at Mana Tech scale2miami. Adam took the stage to showcase predev and share our latest Terminal-Bench 2.0 results with the cohort. A massive thank you to the entire team at Mana Tech (@ManaTechMiami) for having us. Btw, Adam (@adampredev) is based in Miami. If you are in the Miami AI scene and want to link up to talk agentic systems, shoot him a DM.

English

335

pre.dev@predotdev·3d

Is your business throwing tokens at engineering tasks to rush agent adoption? Have you actually measured the ROI of that spend? If not, you might be tokenmaxxing. Here is what that means. As frontier labs 10x their revenue for near-trillion-dollar IPOs, real businesses are hitting a bitter realization: they have zero visibility into their actual token ROI. Yes, devs are shipping faster. They are tackling experimental projects that would have otherwise never seen the light of day. But are your margins actually expanding, or is your output merely accelerating? There is a profound difference between a genuine productivity uplift and mere output acceleration. If your monthly token bill now equates to the full-time engineer you would have hired anyway, you haven't saved a dime. Worse, mass-producing code creates a hidden tax: a massive future token bill just to maintain that technical debt. More projects equals more maintenance. ROI has officially entered the room. Throwing an overkill model like Opus 4.7 max at basic typos isn't a productivity revolution. It’s pure waste that flatlines your business margins. But that’s okay. This is a entirely new paradigm, and optimization takes time. It’s highly reminiscent of the early days of serverless computing. Back then, companies left beefy, expensive instances running 24/7 just to avoid cold starts on a microservice that ran once a month. Eventually, they learned to architect for the platform. Now, corporate management is trying to tackle token spend all wrong—restricting model access by seniority, assigning arbitrary budgets, and demanding bureaucratic approvals. It’s an absolute nightmare waiting to happen. The solution isn’t policing developers; the efficiency lies entirely in the harness layer. That’s why we’ve been relentless at optimizing predev to maximize the token runway our customers see. And it's not just about cutting costs—it's about performance. In our latest Terminal-Bench 2.0 run, predev running the smaller Sonnet 4.6 outright beat Claude Code running the massive Opus 4.5. We dropped an entire model tier, raised accuracy, and crushed the per-task bill. Because an agent's architecture matters far more than its raw weights. We built predev for real, high-ROI use cases: agencies building for clients, vendors creating proofs-of-concept, enterprise teams orchestrating complex data movement, and startups protecting their MVP runway. If ROI matters to your business, stop burning capital on brute-force reasoning. Send us a DM or reach out through our website to discuss how we can optimize your token spend.

English

pre.dev@predotdev·4d

What's the opposite of tokenmaxxing? Stop burning tokens on brute force. If you’re using coding agents but seeing diminished ROI, this breakdown is for you.

pre.dev@predotdev

x.com/i/article/2057…

English

102

pre.dev@predotdev·4d

x.com/i/article/2057…

ZXX

352

pre.dev@predotdev·5d

good point, agreed. coding fundamentals are more relevant than ever. yet, that solves the how but not the why. a coding agent that sees each task and context in isolation will not go beyond passing tests. solving architectural friction requires planning, multiple solution paths and a verifier that re-runs the criteria against the output.

English

Sable Project@Sable_Project·5d

@predotdev I have not found this issue in two of our products. But we have a clean code base and good documentation. I work in ERPs, and we tell our users “Shit data in, shit data out”. I feel like the same goes for coding agents. Messy starting code base = messy code output.

English

pre.dev@predotdev·6d

Why do AI coding agents completely fall apart the second you drop them into an existing codebase? It isn't a context window issue. It's something much more fundamental...

English

pre.dev@predotdev·5d

@bettercallsalva we open sourced the trajectories and harbor results github.com/predotdev/pred…

English

Thiago Salvador@bettercallsalva·5d

@predotdev skeptical of benchmark wins until i see how it handles 6+ tool call chains on the same stateful task. opus on sonnet costs more but the planning depth is what i actually pay for. open to being wrong but a public eval suite + raw traces would help.

English

pre.dev@predotdev·5d

If you have been using Claude Code professionally, take a minute to read this. We beat Opus with Sonnet by using the predev harness. Here is what it means for agentic coding: Orchestration beats brute reasoning. A smaller model running on our architecture just beat Claude Opus on Terminal-Bench 2.0. For the last two years, the default way to improve an AI coding agent has been simple: throw more money at the model. Upgrade the tier, burn more tokens, buy a bigger context window, and hope for better code. But a model's raw weights matter far less than the system architecture wrapping it. Seeing a clear ROI on every shipped feature has a direct impact on the success of your business. So does spending 50% more tokens per feature. We wanted to prove that a smarter harness could break the trend of relying on brute compute. So we put it to the test on Terminal-Bench 2.0. We ran predev + Sonnet 4.6 on the Harbor reference harness against the Terminal-Bench 2.0 task set. The Claude Code numbers were taken directly from their public submissions on tbench. The final result: predev + Sonnet 4.6 scored a 56.2% pass rate. Claude Code running the massive Opus 4.5 scored 53.9%. We dropped an entire model tier and finished ahead. Accuracy went up, while the per-task model bill went down. We didn't close that performance gap by paying for a premium model; we did it by spending tokens more efficiently. Buying a bigger model just trades dollars for points. A better harness gets the points without making that trade. Frontier labs build mass-market engines. They cannot highly specialize their layer for deep engineering tasks. It is exactly like a database program—the engine doesn't tell you how to organize your application data. We built our harness narrowly and explicitly to solve complex, production-ready systems engineering. The core architecture rests on three uncompromising loops: It plans before it codes. The agent extracts a structured blueprint with milestones before a single file is touched. It uses dynamic execution paths. Leveraging ToDo dependency graphs and parallel analysis. It verifies before passing. A blind verifier re-runs acceptance criteria against the output and disagrees freely. You don't need to be a frontier lab to beat one. You just need a system engineered for the actual work. We built predev for real, live use cases and customers. predev customers include software development agencies building products for clients, software vendors building proofs-of-concept for their prospects, enterprise teams building internal data pipelines, and startups getting their MVP out on time and on budget. If ROI matters to your business, stop burning tokens on brute force. Head over to predev, run your project through our harness—be it an existing codebase or greenfield—and watch us ship. The open-sourced Harbor trajectories and repository are live on GitHub. The full breakdown of the results is live on our website. (Links to both in the comments).

English

285

pre.dev@predotdev·5d

@adampredev GitHub Repo: try.pre.dev/xDD8LOm

English

pre.dev@predotdev·5d

Full benchmark breakdown by Adam @adampredev try.pre.dev/RSvJ3pm

English

pre.dev@predotdev·6d

@adampredev Token maxxing should be using the least amount of tokens possible

English

Adam Elkassas@adampredev·6d

x.com/i/article/2056…

ZXX

512

pre.dev@predotdev·6d

We show how it's possible to get Opus 4.5-like performance out of Sonnet 4.6 and Sonnet 4.5 performance out of Haiku 4.5 with the right harness. We also open sourced the results and trajectories from the full Harbor Terminal Bench 2.0 runs, so there is total transparency. We show there is still a lot of alpha to gain on the harness side, and we are emerging as a clear leader in the space.

Adam Elkassas@adampredev

x.com/i/article/2056…

English

227

pre.dev retweetledi

Adam Elkassas@adampredev·17 May

It’s wild that token maxxing means using the most tokens possible It should mean using the least tokens possible to get the job done

English

pre.dev@predotdev·6d

Head over to predev, run a "Reverse-Engineer" scan on your project, and see the difference. It's time for a self-driving coding agent. try.pre.dev/02inXgW

English

pre.dev@predotdev·6d

No slop. No redundant utilities. Just agents that actually understand what you are trying to achieve. If you are managing client projects or scaling existing codebases, stop babysitting your AI.

English

Keşfet

@ManaTechMiami @adampredev @bettercallsalva @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates