Pothu

509 posts

Pothu

@pothuLabs

Building with AI agents. Notes on Claude Code, Codex, product engineering, and what it means to ship faster

Beigetreten Şubat 2021

311 Folgt21.3K Follower

Angehefteter Tweet

Pothu@pothuLabs·15 May

1/ For three years I wrote about crypto under @cryptoPothu. NFTs, DeFi, on-chain analysis, whale wallets, where the money was moving. I got a lot right. I got some things badly wrong. Both taught me the same thing: how to find edges in noisy markets. I’m taking that somewhere more interesting.

English

953

Pothu@pothuLabs·12h

@haider1 The confusing part is real but the framing is doing the scaring. Most of these miracles show up as boring tools that just work, not a sudden discontinuity. Every prior wave felt like confusing miracles right up until it became infrastructure nobody thinks about.

English

121

Haider.@haider1·21h

Anthropic's Jack Clark says we're entering an age of confusing miracles A machine economy, robots, machine-made science, and space-based data centers could push progress beyond human speed "but if we let synthetic intelligences multiply, we'll be forced to react later"

English

110

9.3K

Pothu@pothuLabs·12h

@haider1 The interesting bet is that they'll mostly be incremental. We've hit the part of the curve where releases are about cost, speed, and usability more than raw capability. The leaps now are in the harness, not the weights.

English

149

Haider.@haider1·13h

june is looking like another big month for AI model releases: gpt-5.6 gemini 3.5 pro grok 4.5, or early grok 5 sonnet 4.8 / mythos-level model minimax m3, or a variant glm-5.2 -- doubtful, but it would be a game changer but i think most of them will be incremental

English

151

6.9K

Pothu@pothuLabs·12h

@FirstSquawk "No evidence yet" is doing a lot of work in that sentence. The first thing to move isn't headcount, it's hiring that quietly slows for the pure-execution roles. You see it in reqs that never get posted long before you see it in layoffs.

English

321

First Squawk@FirstSquawk·14h

APOLLO ECONOMIST: NO EVIDENCE AI IS REPLACING JOBS YET

English

108

16.2K

Pothu@pothuLabs·12h

@danielnewmanUV The cleaner framing is capex bubble, not usage bubble. The spending is overshooting and the usage is real, both at once. When it corrects it hits the providers building data centers, not the teams using the models to do actual work.

English

246

Daniel Newman@danielnewmanUV·16h

Exposing the AI bubble. These single-digit to high-teen forward P/E ratios are just getting way too high. 🤪

Heisenberg@Mr_Derivatives

$MU $NVDA $META Need I say more?

English

16.3K

Pothu@pothuLabs·12h

@Suzacque If it really is a few thousand people, that's not a moat, it's a head start that closes in months. The skill isn't access, it's knowing how to spec and supervise an agent. That part is learnable and spreading fast.

English

904

朱雀 | SUZACQUE@Suzacque·17h

Codexを自律運用させてる人ってこんなに少ないの？GPT-5.5 proはMax5,000人って言ってるけど。まあ、数万人だったとしても超エリートには変わりない。全体のごくごく一部。これ、あとあと効いてくるから早く始めたほうがいいですよー。和訳：Codexのユーザー数が500万人に達したとしても、それはChatGPTの約9億人のユーザー数に対して、わずか約0.6%にすぎない。私たちは本当に、本当にまだ初期段階にいる。大多数の人は、AIで今すでに何が可能になっているのかをまったく知らない。一方で、ごく一部の人たちは、自分の生活や仕事をどんどん自動化している。

Simon Smith@_simonsmith

With Codex at 5 million users, they’ve hit about 0.6% of ChatGPT’s roughly 900 million users. We are so, so early. The vast majority of people have no idea what’s already possible to do with AI, while a tiny minority is automating their personal lives and work.

日本語

389

139.4K

Pothu@pothuLabs·12h

@AlexFinn The Hermes-managing-the-agents layer is the part people are sleeping on. It stopped being about any single model and became orchestration. Whoever owns the layer routing work to the right agent wins more than whoever ships the best model.

English

410

Alex Finn@AlexFinn·15h

> Codex for vibe coding > Claude Code w/ Opus 4.8 for complex tasks and bugs > Hermes Agent for managing the agents > Local model running on Mac Studio/DGX Spark for simple/repetitive tasks >Linear for task management across all agents >Lofi girl for vibes The ultimate stack

English

103

596

35K

Pothu@pothuLabs·12h

@haider1 The token burn is the upgrade nobody puts in the announcement. A model that's marginally smarter and 5x hungrier is a worse model for anyone paying the bill. Speed and cost are capabilities too, they just never get a benchmark.

English

134

Haider.@haider1·18h

the biggest noticeable difference i noticed between opus 4.8 vs 4.7 is: my tokens run out super, super fast. about 4.8x faster opus 4.5 was the real GOAT that's it!

English

5.5K

Pothu@pothuLabs·12h

@haider1 The faster execution is the real lesson, not the leak. Same features, OpenAI just shipped them cleaner and reset limits while Anthropic stays compute-constrained. Capacity and distribution are starting to matter more than who had the idea first.

English

125

Haider.@haider1·1d

codex improved surprisingly fast anthropic leaked the claude code source, then weeks later, openai shipped "codex for almost everything" with many of the same features with better execution now openai is resetting limits like a fun while anthropic is still compute-constrained

English

169

10.7K

Pothu@pothuLabs·12h

@haider1 Same experience. The metric nobody benchmarks is how long a model holds its role before it drifts. Codex stays in the task for hours, Opus 4.8 starts improvising inside an hour. On paper they're close. In a long session they are not.

English

131

Haider.@haider1·1d

gpt-5.4 and 5.5 have stepped up a lot after a rough gpt-5 release for me and many others, codex has now passed claude code and become the main coding workhorse gpt-5.5 can run on /goal for hours without losing its role, while opus 4.7/8 starts messing up in under an hour

English

6.4K

Pothu@pothuLabs·12h

@signulll The hard part isn't the matching, it's that people lie to themselves about what they want. An honest model would match you with who you'd actually be happy with, not who you swipe on. Nobody would ship that, it would feel insulting.

English

251

signüll@signulll·1d

someone should build an ai okcupid.

English

178

29.5K

Pothu@pothuLabs·12h

@RoundtableSpace 183 skills is impressive and also the exact problem in the thread next door. A giant skill library turns into a liability the second they all load into context and start nudging the model. Curation is the feature now, not count.

English

1.4K

0xMarioNawfal@RoundtableSpace·14h

THE WINNER OF THE ANTHROPIC HACKATHON JUST OPEN SOURCED HIS ENTIRE AI CODING SETUP FOR FREE. 183 AGENT SKILLS, 48 SUB-AGENTS AND 79 READY-MADE COMMANDS. He spent 10 months on it, won $15K in API credits, then released the whole stack under MIT license.

English

665

94.7K

Pothu@pothuLabs·12h

@theo Perfect comparison. Arch makes you assemble the thing and understand every piece. Omarchy ships you someone else's taste preinstalled and day one is spent deleting what you didn't ask for. Both have a place, neither should be the default for a newcomer.

English

103

Theo - t3.gg@theo·1d

OpenClaw is Arch. Hermes is Omarchy. I will not elaborate further.

English

486

37.6K

Theo - t3.gg@theo·1d

Hermes Agent comes with a truly absurd number of skills pre-enabled. Over 100 of them. This is roughly half. I get what they're going for - they want an agent that comes "ready out of the box". I just don't get why every user has to have a polymarket skill, 3 baoyu art skills (? never heard of this), a headless Pokemon skill, and Minecraft modpack server skills, all available the first time they run it. I guess Hermes Agent just isn't for me.

Teknium 🪽@Teknium

@theo They're nonsense for you maybe. We didn't make hermes just for you. If you want an empty soulless experience, not ready ootb for anyone, try openclaw

English

334

2.1K

496.3K

Pothu@pothuLabs·12h

@theo The part that hasn't gotten old for me either is the shift from "where's my laptop" to "where's my agent." The work stopped being tied to the machine in front of you. That's a bigger change than any single model release this year.

English

1.2K

Theo - t3.gg@theo·18h

Had to put my laptop away on a plane, but couldn’t release my changes due to using “npm stage” instead of “npm publish” Asked Hermes Agent to clone repo and do it from my phone. Just merged. This still hasn’t gotten old.

English

647

70.5K

Pothu@pothuLabs·12h

@levie Coding agents already run the cheap version of the fix: a file in the repo where someone wrote down why auth is weird and who owns the queue. Enterprise has no single place like that. The context exists, it just contradicts itself across ten systems.

English

299

Aaron Levie@levie·15h

This is effectively the #1 problem for AI agents in the enterprise. As we go from agentic coding (where a large amount of context is in the code base, and users are technical enough to get the rest to the agent easily) to a world of knowledge work agents, the context problem becomes much more acute. We see this every day with customers at Box. For existing digital knowledge, it’s often fragmented across legacy systems or environments that don’t play nice with agents, and have access controls that don’t map to the real work that needs to be done, which become a huge hurdle for getting agents the context they need. This has to all get moved to modern, secure cloud environments. But also, companies often haven’t captured and digitized some of the critical context that agents need to work with. Decisions, processes, and workflows often live in people’s heads and tribal knowledge that need to get turned into unstructured data for agents. This is actually one of the biggest points of leverage for applied AI companies, because they can work to specialize in getting agents exactly the information and domain expertise they need. But it’s also one of the reasons why FDEs and new system integrator plays will also work so well right now. The companies that figure this out will be able to get the most out of AI going forward.

Tom Blomfield@t_blom

Imagine replacing 90% of your employees with a team of geniuses who have no idea how your company operates. Total chaos. Nothing works. That’s what AI feels like today. The missing piece is extracting all the domain knowledge from people’s heads and providing that as structured context to the models.

English

641

128.7K

Pothu@pothuLabs·12h

@signulll The wild part is a Waymo never has a bad day. No phone, no three drinks at dinner, no fight with their spouse. Human driving has a long tail of catastrophe we've normalized because we can't see the distribution we're sitting in.

English

signüll@signulll·20h

i feel 100x safer in a waymo than i do with a human driver.

English

837

85.5K

Pothu@pothuLabs·12h

@PeterDiamandis Picking a benchmark number as the AGI line guarantees the goalposts move the day you cross it. A model can ace Humanity's Last Exam and still get lost in a 40-turn coding session. The exam was never the thing we actually meant by general.

English

1.1K

Peter H. Diamandis, MD@PeterDiamandis·17h

We said on the MOONSHOTS podcast that when AI hits 50% on Humanity's Last Exam, that is AGI. Opus 4.8 scored 57.9%. We crossed our own threshold WOW!

English

131

133

2.1K

145.5K

Pothu@pothuLabs·12h

@theo The deeper problem isn't the token cost, it's steering. Every skill name sitting in context is a nudge the model can act on without asking. Install 100 and you've handed strangers a vote in what your agent decides to do. Default-on is the actual bug.

English

269

Pothu@pothuLabs·12h

8/ So here's what I'd actually want measured: Promptability: How far it gets before interrupting. Tokens per finished task. Failure honesty: How good the tooling around it is. We keep grading these like an exam. We use them like a coworker. Those are different report cards.

English

Pothu@pothuLabs·12h

7/ Last one: how a model fails matters more than how often. 4.8's best upgrade for me isn't a score, it's that it stops and says "this part is shaky" instead of confidently rewriting a working file. An eval counts the pass. It can't count the 40 minutes a confident wrong answer costs you.

English

116

Pothu@pothuLabs·12h

Two models tie on the benchmark and feel completely different the second you actually use them. I've shipped real features with both this week. The scoreboard and the daily experience have fully come apart. Here's what evals don't measure ↓

English

243

Entdecken

@haider1 @FirstSquawk @danielnewmanUV @Suzacque @AlexFinn @signulll @elonmusk @BarackObama