clay

126 posts

clay

@deforestpeg

I build AI agents and honest data tools, and post the real results. Latest: SpendLens, finds the AI API spend you don't need.

Bergabung Mayıs 2022

2K Mengikuti1.3K Pengikut

Tweet Disematkan

clay@deforestpeg·23h

everyone complains about AI api costs. almost nobody optimizes. i kept typing the same 5 fixes in replies so i built the thing that finds them in your actual logs the demo workload (synthetic, 30 days, every inefficiency labeled): $2,330 spend, $1,038 of it recoverable biggest single fix: a 6k-token system prompt billed at full price 24,000 times. one cache_control block serves it at 10% of the price $378 back no llm anywhere in the analysis. every number traces to a formula, and it refuses to extrapolate monthly savings from 3 days of logs because that's marketing, not analysis

English

581

clay@deforestpeg·15h

the write a clear spec part is doing more work than people realize. a self contained spec means the codex session starts cold instead of inheriting your whole conversation and history drag is the quiet killer, every turn resends everything before it. same reason agent loops get expensive. spec handoffs are basically manual context compaction

English

580

Vox@Voxyz_ai·18h

fable 5 burns tokens fast but write the prompt like this and it's totally workable. "to save tokens, keep this main session (fable 5) on planning and frontend tasks, its visual output and ideas are worth the price. for backend and heavier implementation, write a clear spec and dispatch to codex (gpt-5.5 xhigh) with /goal to execute, my quota there sits unused anyway. you may keep the hardest parts in this session." a frontend design prompt i've been testing that works well: redesign {your page, e.g. pricing page} for this project. full creative freedom, but it has to be visually striking and interactive, with motion effects and a hidden easter egg. search 2026 design trends first and use them.

English

1.2K

109K

clay@deforestpeg·15h

the model gap here is 25%. the gap between cached and uncached input on fable 5 is 10x, $10/m vs $1/m. and every one of those 47 steps resends the whole context, so the step count is really a caching bill. model choice is the third biggest cost lever, people argue about it because it's the easiest one to change

English

385

BridgeMind@bridgemindai·21h

Fable 5 Medium is the best intelligence per dollar in AI right now. It's not close. New CursorBench results: Fable 5 Medium: 69.8% at $8.27 per task Opus 4.7 Max: 64.8% at $11.02 GPT 5.5 Extra High: 64.3% at $4.37 It beats Opus 4.7 Max and GPT 5.5 on score while costing 25% less than Opus. And it finishes tasks in 47 steps instead of Opus's 96. If you're vibe coding daily, Fable 5 Medium is the answer. Top tier intelligence at a price you can actually run all day.

English

791

58K

clay@deforestpeg·15h

@sflorimm the funny part is fable 5 is the model where caching matters most. $10/m token input but cache reads are $1 same tokens. if you're resending a fat system prompt every call you're tipping anthropic 10x. one cache_control block, most people never check their hit rate

English

143

Floro S.@sflorimm·18h

vibe coders, how many $ have you burned with claude fable 5 yet?

English

201

210

42.8K

clay@deforestpeg·23h

live here, no signup: spendlens.dev don't have logs handy? there's a one click sample on the upload page all five detectors fire on it, takes ~10 seconds

English

168

clay@deforestpeg·23h

English

581

clay@deforestpeg·1d

Watch live: codexplays.games/pokemon-red

English

870

clay@deforestpeg·1d

Last update ended with the agent taking the Poke Flute from Pokemon Tower. This is why. The Snorlax blocking Route 12. Its first move after the rescue: walk up to the sleeping roadblock, open the bag, play the flute. Took the fight, cleared the road, kept moving south. Nobody coded that in. The model just knows Pokemon.

English

22.1K

clay@deforestpeg·1d

ive run agents on hard caps from day one, flat subscription, nothing to fall back on. the cap is what taught me which steps actually need a model and which were just expensive habit. spend limits dont end the token maxxing era, they end the part where nobody knew what the tokens were buying

English

169

Marty Kausas@marty_kausas·1d

Our Anthropic bill is about to jump from $400K → $1.4M/yr. Not because usage exploded, but because we're about to cross 150 seats. Past 150 seats you're forced into Enterprise tier. Seats stop including any usage, every token bills at standard API rates. At our current run rate that's 3.5x overnight. Unfiltered thoughts on AI spend: 1. We should spend tokens to grow as aggressively as possible. But most people (me included) aren't conscious of what they're spending. 2. Visibility comes first. People see their personal number and they're shocked. I accidentally spent $4,000 in 3 days in Claude Code. 3. For engineering the spend is clearly worth it. Pay for the best model, it saves more than it costs. 4. For a lot of other roles it's questionable. Apps nobody uses, skills someone already built. No ROI. 5. Spend limits are coming. We already require approval for more tokens on our support team. The era of token-maxxing is coming to an end.

English

359

161

1.7M

clay@deforestpeg·1d

running agents on flat budgets forced me to measure exactly this, tokens to done is the only number that ends up mattering. the model that wins the one shot benchmark can quietly lose in a loop once retries and wandering count against it. accuracy is the sticker price cost to goal is what you pay

English

110

David Cramer@zeeg·1d

Imagine if LLM benchmarks were measured in cost to achieve goal rather than accuracy of a one shot prompt Imagine if people actually made benchmarks that meant anything

English

217

8.8K

clay@deforestpeg·2d

watch it think in real time: codexplays.games/pokemon-red

English

308

clay@deforestpeg·2d

Badge 4 of 8. Same save, no resets. Then the wall: Pokemon Tower broke the agent for days. So I rebuilt it — Codex picks the objectives now, the machinery just walks. First night on the new brain: beat the ghost Marowak, rescued Mr. Fuji, took the Poke Flute. On its own.

English

754

clay@deforestpeg·2d

@businessbarista ive cut agent spend over 50% without touching the model, just pulling the steps that dont need one out of the loop. most agent cost is a wasteful loop, not an expensive model. finetuning a cheaper one is fixing a bill you ran up building it wrong

English

448

Alex Lieberman@businessbarista·2d

Most companies: "Help us onboard our 3,000 employees to Claude Code/Codex" Some companies: "Help us build our first end-to-end agent outside of the engineering org" Few companies: "Help us finetune a Chinese/open-source model so we can lower our agent cost by 50%"

English

285

51.1K

clay@deforestpeg·2d

@ox_vanguard my dms are always open!

English

0xVanguard@ox_vanguard·2d

@deforestpeg I'll learn more about this, thank you

English

clay@deforestpeg·3d

17% fee APR on this USDC-SOL range. after impermanent loss it nets +$4 on $10k. thats the whole problem with DLMM LPing, the APR looks great and IL quietly eats it. binsight runs your exact range against real on chain price, volume + fees and shows the net.

English

816

clay@deforestpeg·2d

@ox_vanguard its passive market making, yeah. youre putting up two sided liquidity in a range and collecting swap fees, but you also eat the inventory risk an active MM would hedge out. usually the fees dont cover it, and thats the gap binsight measures

English

0xVanguard@ox_vanguard·3d

@deforestpeg Is this market making?

English

clay@deforestpeg·2d

@karinanguyen the tools already do the inspiring, more than a message could

English

251

Karina@karinanguyen·2d

labs should try harder to inspire esp young people that they can still build generational wealth even when it feels like the world is ending and they are going to eat every startup

English

207

17.7K

clay@deforestpeg·2d

@alexatallah @cclark @OpenRouter from routing tokens to routing airflow

English

133

Alex Atallah@alexatallah·2d

Introducing The Wedge

English

10.3K

clay@deforestpeg·2d

@ollama @NousResearch self generating skills is the right shape, but improves them as you use them is where it gets hard. the question is whether it can tell a skill actually got better or its just confidently rewriting it. self improvement is a verifier problem, not a generation one

English

225

ollama@ollama·2d

Self-learning skills Hermes generates Python skills from natural-language descriptions and improves them as you use them. Start with the 70+ skills it ships with and grow your own library around your real workflows.

English

118

92.7K

ollama@ollama·2d

Use Ollama with Hermes Desktop by @NousResearch. Hermes Desktop brings the same agent (its multi-agent engine, self-improving skills, and messaging integrations) into a desktop app on macOS, Windows, and Linux. Run it on Ollama using local or cloud with one command: ollama launch hermes-desktop 🧵

English

121

897

50.5K

clay@deforestpeg·2d

@enzo_gte the design system port is the real tell, not the speed. rebuilding from the ground up instead of jamming the old language into the new scheme is the model making a structural judgment call, not pattern matching. thats the actual jump everything else is throughput

English

785

enzo@enzo_gte·2d

Ok, we've been running Fable across all of our workstreams today. Pretty clearly this is a hit and likely another jump similar to Opus 4.5. It was able to one-shot a deployment issue that we were throwing swarms of Opus + GPT 5.5 at in one-shot. We're seeing it work on all various diff eng. workstreams (front end, risk engine, devops) and just do things that AIs couldn't before. I was able to port a brand for another co into a brand new design system in like, 3 prompts. And it was able to just translate all the existing assets, product language, and embed it into a completely new design scheme. Opus 4.8 would have just tried to jam the old language into the new design, this fundamentally redesigned things from the ground up. Let's see what the market reaction this is in like 2-3 weeks. There's probably a window where the self contained Twitter bubble goes crazy but the general population doesn't realize it yet. We have to remember that it took about 1.5 months from Opus release for the impacts across coding, tooling, and everything else to be really felt. Fable probably is in 2-3 weeks.

English

326

27.4K

clay@deforestpeg·2d

@_xjdr the on par with k2.6 outside the new claude code features line is the whole story imo. the meaningful delta this release is the harness, not the model. raw capability is converging, the agent tooling around it is where the gap actually is now

English

665

xjdr@_xjdr·2d

i haven't used any Anthropic models since Jan / Feb so i was excited to unleash fable on a bunch of benchmarks and a few of my most complicated repos. so far, it seems like a huge improvement over opus, especially for claude code expert use cases but still not on par with gpt 5.5 xhigh for my specific use cases. in fact, its pretty on par with my fine tuned k2.6 outside of the new claude code features . the areas where it seems to excel are large multi part reviews (it caught a handful of really subtle and complex bugs) and multi-step long running tasks. i kept it away from my research / training and infra code for obvious reasons, so this is 'normal' software dev specific . overall, solid effort and a huge improvement over the most recent opus, but not pushing the frontier in any meaningful ways (at least that i can see so far). i will probably use it for the rest of the day just to be sure and then move back to 80% k2.6 and 20% gpt 5.5 xhigh

English

617

66K

clay@deforestpeg·2d

@EricNewcomer the fatigue is real but the unlock is that for most tasks the model barely matters. pick one default, use it for everything, only switch on the long hard multi step stuff. the per task optimizing costs more time than it saves

English

Eric Newcomer@EricNewcomer·2d

as a consumer it's getting a little exhausting figuring out what model I need for every particular task

English

39.8K

Jelajahi

@sflorimm @businessbarista @ox_vanguard @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates