clay

126 posts

clay banner
clay

clay

@deforestpeg

I build AI agents and honest data tools, and post the real results. Latest: SpendLens, finds the AI API spend you don't need.

Sumali Mayıs 2022
2K Sinusundan1.3K Mga Tagasunod
Naka-pin na Tweet
clay
clay@deforestpeg·
everyone complains about AI api costs. almost nobody optimizes. i kept typing the same 5 fixes in replies so i built the thing that finds them in your actual logs the demo workload (synthetic, 30 days, every inefficiency labeled): $2,330 spend, $1,038 of it recoverable biggest single fix: a 6k-token system prompt billed at full price 24,000 times. one cache_control block serves it at 10% of the price $378 back no llm anywhere in the analysis. every number traces to a formula, and it refuses to extrapolate monthly savings from 3 days of logs because that's marketing, not analysis
clay tweet media
English
3
0
10
573
clay
clay@deforestpeg·
the write a clear spec part is doing more work than people realize. a self contained spec means the codex session starts cold instead of inheriting your whole conversation and history drag is the quiet killer, every turn resends everything before it. same reason agent loops get expensive. spec handoffs are basically manual context compaction
English
0
0
1
573
Vox
Vox@Voxyz_ai·
fable 5 burns tokens fast but write the prompt like this and it's totally workable. "to save tokens, keep this main session (fable 5) on planning and frontend tasks, its visual output and ideas are worth the price. for backend and heavier implementation, write a clear spec and dispatch to codex (gpt-5.5 xhigh) with /goal to execute, my quota there sits unused anyway. you may keep the hardest parts in this session." a frontend design prompt i've been testing that works well: redesign {your page, e.g. pricing page} for this project. full creative freedom, but it has to be visually striking and interactive, with motion effects and a hidden easter egg. search 2026 design trends first and use them.
English
39
44
1.2K
102K
clay
clay@deforestpeg·
the model gap here is 25%. the gap between cached and uncached input on fable 5 is 10x, $10/m vs $1/m. and every one of those 47 steps resends the whole context, so the step count is really a caching bill. model choice is the third biggest cost lever, people argue about it because it's the easiest one to change
English
1
0
0
373
BridgeMind
BridgeMind@bridgemindai·
Fable 5 Medium is the best intelligence per dollar in AI right now. It's not close. New CursorBench results: Fable 5 Medium: 69.8% at $8.27 per task Opus 4.7 Max: 64.8% at $11.02 GPT 5.5 Extra High: 64.3% at $4.37 It beats Opus 4.7 Max and GPT 5.5 on score while costing 25% less than Opus. And it finishes tasks in 47 steps instead of Opus's 96. If you're vibe coding daily, Fable 5 Medium is the answer. Top tier intelligence at a price you can actually run all day.
BridgeMind tweet media
English
82
52
783
56.5K
clay
clay@deforestpeg·
@sflorimm the funny part is fable 5 is the model where caching matters most. $10/m token input but cache reads are $1 same tokens. if you're resending a fat system prompt every call you're tipping anthropic 10x. one cache_control block, most people never check their hit rate
English
0
0
1
136
Floro S.
Floro S.@sflorimm·
vibe coders, how many $ have you burned with claude fable 5 yet?
English
194
0
186
37.7K
clay
clay@deforestpeg·
live here, no signup: spendlens.dev don't have logs handy? there's a one click sample on the upload page all five detectors fire on it, takes ~10 seconds
English
0
0
3
167
clay
clay@deforestpeg·
everyone complains about AI api costs. almost nobody optimizes. i kept typing the same 5 fixes in replies so i built the thing that finds them in your actual logs the demo workload (synthetic, 30 days, every inefficiency labeled): $2,330 spend, $1,038 of it recoverable biggest single fix: a 6k-token system prompt billed at full price 24,000 times. one cache_control block serves it at 10% of the price $378 back no llm anywhere in the analysis. every number traces to a formula, and it refuses to extrapolate monthly savings from 3 days of logs because that's marketing, not analysis
clay tweet media
English
3
0
10
573
clay
clay@deforestpeg·
Last update ended with the agent taking the Poke Flute from Pokemon Tower. This is why. The Snorlax blocking Route 12. Its first move after the rescue: walk up to the sleeping roadblock, open the bag, play the flute. Took the fight, cleared the road, kept moving south. Nobody coded that in. The model just knows Pokemon.
English
6
5
70
22.1K
clay
clay@deforestpeg·
ive run agents on hard caps from day one, flat subscription, nothing to fall back on. the cap is what taught me which steps actually need a model and which were just expensive habit. spend limits dont end the token maxxing era, they end the part where nobody knew what the tokens were buying
English
0
0
3
167
Marty Kausas
Marty Kausas@marty_kausas·
Our Anthropic bill is about to jump from $400K → $1.4M/yr. Not because usage exploded, but because we're about to cross 150 seats. Past 150 seats you're forced into Enterprise tier. Seats stop including any usage, every token bills at standard API rates. At our current run rate that's 3.5x overnight. Unfiltered thoughts on AI spend: 1. We should spend tokens to grow as aggressively as possible. But most people (me included) aren't conscious of what they're spending. 2. Visibility comes first. People see their personal number and they're shocked. I accidentally spent $4,000 in 3 days in Claude Code. 3. For engineering the spend is clearly worth it. Pay for the best model, it saves more than it costs. 4. For a lot of other roles it's questionable. Apps nobody uses, skills someone already built. No ROI. 5. Spend limits are coming. We already require approval for more tokens on our support team. The era of token-maxxing is coming to an end.
English
359
161
3K
1.7M
clay
clay@deforestpeg·
running agents on flat budgets forced me to measure exactly this, tokens to done is the only number that ends up mattering. the model that wins the one shot benchmark can quietly lose in a loop once retries and wandering count against it. accuracy is the sticker price cost to goal is what you pay
English
0
0
2
110
David Cramer
David Cramer@zeeg·
Imagine if LLM benchmarks were measured in cost to achieve goal rather than accuracy of a one shot prompt Imagine if people actually made benchmarks that meant anything
English
37
9
217
8.8K
clay
clay@deforestpeg·
Badge 4 of 8. Same save, no resets. Then the wall: Pokemon Tower broke the agent for days. So I rebuilt it — Codex picks the objectives now, the machinery just walks. First night on the new brain: beat the ghost Marowak, rescued Mr. Fuji, took the Poke Flute. On its own.
clay tweet media
English
2
0
9
754
clay
clay@deforestpeg·
@businessbarista ive cut agent spend over 50% without touching the model, just pulling the steps that dont need one out of the loop. most agent cost is a wasteful loop, not an expensive model. finetuning a cheaper one is fixing a bill you ran up building it wrong
English
1
0
4
448
Alex Lieberman
Alex Lieberman@businessbarista·
Most companies: "Help us onboard our 3,000 employees to Claude Code/Codex" Some companies: "Help us build our first end-to-end agent outside of the engineering org" Few companies: "Help us finetune a Chinese/open-source model so we can lower our agent cost by 50%"
English
56
15
285
51.1K
clay
clay@deforestpeg·
17% fee APR on this USDC-SOL range. after impermanent loss it nets +$4 on $10k. thats the whole problem with DLMM LPing, the APR looks great and IL quietly eats it. binsight runs your exact range against real on chain price, volume + fees and shows the net.
clay tweet media
English
4
1
14
816
clay
clay@deforestpeg·
@ox_vanguard its passive market making, yeah. youre putting up two sided liquidity in a range and collecting swap fees, but you also eat the inventory risk an active MM would hedge out. usually the fees dont cover it, and thats the gap binsight measures
English
1
0
1
9
clay
clay@deforestpeg·
@karinanguyen the tools already do the inspiring, more than a message could
English
0
0
1
251
Karina
Karina@karinanguyen·
labs should try harder to inspire esp young people that they can still build generational wealth even when it feels like the world is ending and they are going to eat every startup
English
21
9
207
17.7K
Alex Atallah
Alex Atallah@alexatallah·
Introducing The Wedge
Alex Atallah tweet media
English
18
2
94
10.3K
clay
clay@deforestpeg·
@ollama @NousResearch self generating skills is the right shape, but improves them as you use them is where it gets hard. the question is whether it can tell a skill actually got better or its just confidently rewriting it. self improvement is a verifier problem, not a generation one
English
0
0
2
223
ollama
ollama@ollama·
Self-learning skills Hermes generates Python skills from natural-language descriptions and improves them as you use them. Start with the 70+ skills it ships with and grow your own library around your real workflows.
ollama tweet media
English
4
7
118
92.6K
ollama
ollama@ollama·
Use Ollama with Hermes Desktop by @NousResearch. Hermes Desktop brings the same agent (its multi-agent engine, self-improving skills, and messaging integrations) into a desktop app on macOS, Windows, and Linux. Run it on Ollama using local or cloud with one command: ollama launch hermes-desktop 🧵
ollama tweet media
English
35
121
896
50.4K
clay
clay@deforestpeg·
@enzo_gte the design system port is the real tell, not the speed. rebuilding from the ground up instead of jamming the old language into the new scheme is the model making a structural judgment call, not pattern matching. thats the actual jump everything else is throughput
English
0
0
1
785
enzo
enzo@enzo_gte·
Ok, we've been running Fable across all of our workstreams today. Pretty clearly this is a hit and likely another jump similar to Opus 4.5. It was able to one-shot a deployment issue that we were throwing swarms of Opus + GPT 5.5 at in one-shot. We're seeing it work on all various diff eng. workstreams (front end, risk engine, devops) and just do things that AIs couldn't before. I was able to port a brand for another co into a brand new design system in like, 3 prompts. And it was able to just translate all the existing assets, product language, and embed it into a completely new design scheme. Opus 4.8 would have just tried to jam the old language into the new design, this fundamentally redesigned things from the ground up. Let's see what the market reaction this is in like 2-3 weeks. There's probably a window where the self contained Twitter bubble goes crazy but the general population doesn't realize it yet. We have to remember that it took about 1.5 months from Opus release for the impacts across coding, tooling, and everything else to be really felt. Fable probably is in 2-3 weeks.
English
14
7
326
27.4K
clay
clay@deforestpeg·
@_xjdr the on par with k2.6 outside the new claude code features line is the whole story imo. the meaningful delta this release is the harness, not the model. raw capability is converging, the agent tooling around it is where the gap actually is now
English
0
0
5
665
xjdr
xjdr@_xjdr·
i haven't used any Anthropic models since Jan / Feb so i was excited to unleash fable on a bunch of benchmarks and a few of my most complicated repos. so far, it seems like a huge improvement over opus, especially for claude code expert use cases but still not on par with gpt 5.5 xhigh for my specific use cases. in fact, its pretty on par with my fine tuned k2.6 outside of the new claude code features . the areas where it seems to excel are large multi part reviews (it caught a handful of really subtle and complex bugs) and multi-step long running tasks. i kept it away from my research / training and infra code for obvious reasons, so this is 'normal' software dev specific . overall, solid effort and a huge improvement over the most recent opus, but not pushing the frontier in any meaningful ways (at least that i can see so far). i will probably use it for the rest of the day just to be sure and then move back to 80% k2.6 and 20% gpt 5.5 xhigh
English
29
23
617
66K
clay
clay@deforestpeg·
@EricNewcomer the fatigue is real but the unlock is that for most tasks the model barely matters. pick one default, use it for everything, only switch on the long hard multi step stuff. the per task optimizing costs more time than it saves
English
0
0
0
90
Eric Newcomer
Eric Newcomer@EricNewcomer·
as a consumer it's getting a little exhausting figuring out what model I need for every particular task
English
27
1
63
39.8K