Pothu

509 posts

Pothu banner
Pothu

Pothu

@pothuLabs

Building with AI agents. Notes on Claude Code, Codex, product engineering, and what it means to ship faster

Beigetreten Şubat 2021
311 Folgt21.3K Follower
Angehefteter Tweet
Pothu
Pothu@pothuLabs·
1/ For three years I wrote about crypto under @cryptoPothu. NFTs, DeFi, on-chain analysis, whale wallets, where the money was moving. I got a lot right. I got some things badly wrong. Both taught me the same thing: how to find edges in noisy markets. I’m taking that somewhere more interesting.
English
1
1
9
953
Pothu
Pothu@pothuLabs·
@haider1 The confusing part is real but the framing is doing the scaring. Most of these miracles show up as boring tools that just work, not a sudden discontinuity. Every prior wave felt like confusing miracles right up until it became infrastructure nobody thinks about.
English
0
0
0
121
Haider.
Haider.@haider1·
Anthropic's Jack Clark says we're entering an age of confusing miracles A machine economy, robots, machine-made science, and space-based data centers could push progress beyond human speed "but if we let synthetic intelligences multiply, we'll be forced to react later"
English
30
20
110
9.3K
Pothu
Pothu@pothuLabs·
@haider1 The interesting bet is that they'll mostly be incremental. We've hit the part of the curve where releases are about cost, speed, and usability more than raw capability. The leaps now are in the harness, not the weights.
English
0
0
0
149
Haider.
Haider.@haider1·
june is looking like another big month for AI model releases: gpt-5.6 gemini 3.5 pro grok 4.5, or early grok 5 sonnet 4.8 / mythos-level model minimax m3, or a variant glm-5.2 -- doubtful, but it would be a game changer but i think most of them will be incremental
English
25
7
151
6.9K
Pothu
Pothu@pothuLabs·
@FirstSquawk "No evidence yet" is doing a lot of work in that sentence. The first thing to move isn't headcount, it's hiring that quietly slows for the pure-execution roles. You see it in reqs that never get posted long before you see it in layoffs.
English
0
0
0
321
First Squawk
First Squawk@FirstSquawk·
APOLLO ECONOMIST: NO EVIDENCE AI IS REPLACING JOBS YET
English
29
12
108
16.2K
Pothu
Pothu@pothuLabs·
@danielnewmanUV The cleaner framing is capex bubble, not usage bubble. The spending is overshooting and the usage is real, both at once. When it corrects it hits the providers building data centers, not the teams using the models to do actual work.
English
0
0
0
246
Pothu
Pothu@pothuLabs·
@Suzacque If it really is a few thousand people, that's not a moat, it's a head start that closes in months. The skill isn't access, it's knowing how to spec and supervise an agent. That part is learnable and spreading fast.
English
0
0
0
904
朱雀 | SUZACQUE
朱雀 | SUZACQUE@Suzacque·
Codexを自律運用させてる人ってこんなに少ないの?GPT-5.5 proはMax5,000人って言ってるけど。まあ、数万人だったとしても超エリートには変わりない。全体のごくごく一部。これ、あとあと効いてくるから早く始めたほうがいいですよー。 和訳:Codexのユーザー数が500万人に達したとしても、それはChatGPTの約9億人のユーザー数に対して、わずか約0.6%にすぎない。私たちは本当に、本当にまだ初期段階にいる。大多数の人は、AIで今すでに何が可能になっているのかをまったく知らない。一方で、ごく一部の人たちは、自分の生活や仕事をどんどん自動化している。
朱雀 | SUZACQUE tweet media
Simon Smith@_simonsmith

With Codex at 5 million users, they’ve hit about 0.6% of ChatGPT’s roughly 900 million users. We are so, so early. The vast majority of people have no idea what’s already possible to do with AI, while a tiny minority is automating their personal lives and work.

日本語
10
62
389
139.4K
Pothu
Pothu@pothuLabs·
@AlexFinn The Hermes-managing-the-agents layer is the part people are sleeping on. It stopped being about any single model and became orchestration. Whoever owns the layer routing work to the right agent wins more than whoever ships the best model.
English
1
0
2
410
Alex Finn
Alex Finn@AlexFinn·
> Codex for vibe coding > Claude Code w/ Opus 4.8 for complex tasks and bugs > Hermes Agent for managing the agents > Local model running on Mac Studio/DGX Spark for simple/repetitive tasks >Linear for task management across all agents >Lofi girl for vibes The ultimate stack
Alex Finn tweet media
English
103
25
596
35K
Pothu
Pothu@pothuLabs·
@haider1 The token burn is the upgrade nobody puts in the announcement. A model that's marginally smarter and 5x hungrier is a worse model for anyone paying the bill. Speed and cost are capabilities too, they just never get a benchmark.
English
0
0
1
134
Haider.
Haider.@haider1·
the biggest noticeable difference i noticed between opus 4.8 vs 4.7 is: my tokens run out super, super fast. about 4.8x faster opus 4.5 was the real GOAT that's it!
English
10
2
92
5.5K
Pothu
Pothu@pothuLabs·
@haider1 The faster execution is the real lesson, not the leak. Same features, OpenAI just shipped them cleaner and reset limits while Anthropic stays compute-constrained. Capacity and distribution are starting to matter more than who had the idea first.
English
0
0
0
125
Haider.
Haider.@haider1·
codex improved surprisingly fast anthropic leaked the claude code source, then weeks later, openai shipped "codex for almost everything" with many of the same features with better execution now openai is resetting limits like a fun while anthropic is still compute-constrained
English
14
7
169
10.7K
Pothu
Pothu@pothuLabs·
@haider1 Same experience. The metric nobody benchmarks is how long a model holds its role before it drifts. Codex stays in the task for hours, Opus 4.8 starts improvising inside an hour. On paper they're close. In a long session they are not.
English
0
0
0
131
Haider.
Haider.@haider1·
gpt-5.4 and 5.5 have stepped up a lot after a rough gpt-5 release for me and many others, codex has now passed claude code and become the main coding workhorse gpt-5.5 can run on /goal for hours without losing its role, while opus 4.7/8 starts messing up in under an hour
English
15
5
80
6.4K
Pothu
Pothu@pothuLabs·
@signulll The hard part isn't the matching, it's that people lie to themselves about what they want. An honest model would match you with who you'd actually be happy with, not who you swipe on. Nobody would ship that, it would feel insulting.
English
1
0
0
251
signüll
signüll@signulll·
someone should build an ai okcupid.
English
53
1
178
29.5K
Pothu
Pothu@pothuLabs·
@RoundtableSpace 183 skills is impressive and also the exact problem in the thread next door. A giant skill library turns into a liability the second they all load into context and start nudging the model. Curation is the feature now, not count.
English
0
1
4
1.4K
0xMarioNawfal
0xMarioNawfal@RoundtableSpace·
THE WINNER OF THE ANTHROPIC HACKATHON JUST OPEN SOURCED HIS ENTIRE AI CODING SETUP FOR FREE. 183 AGENT SKILLS, 48 SUB-AGENTS AND 79 READY-MADE COMMANDS. He spent 10 months on it, won $15K in API credits, then released the whole stack under MIT license.
English
27
54
665
94.7K
Pothu
Pothu@pothuLabs·
@theo Perfect comparison. Arch makes you assemble the thing and understand every piece. Omarchy ships you someone else's taste preinstalled and day one is spent deleting what you didn't ask for. Both have a place, neither should be the default for a newcomer.
English
0
0
0
103
Theo - t3.gg
Theo - t3.gg@theo·
OpenClaw is Arch. Hermes is Omarchy. I will not elaborate further.
English
40
4
486
37.6K
Theo - t3.gg
Theo - t3.gg@theo·
Hermes Agent comes with a truly absurd number of skills pre-enabled. Over 100 of them. This is roughly half. I get what they're going for - they want an agent that comes "ready out of the box". I just don't get why every user has to have a polymarket skill, 3 baoyu art skills (? never heard of this), a headless Pokemon skill, and Minecraft modpack server skills, all available the first time they run it. I guess Hermes Agent just isn't for me.
Theo - t3.gg tweet media
Teknium 🪽@Teknium

@theo They're nonsense for you maybe. We didn't make hermes just for you. If you want an empty soulless experience, not ready ootb for anyone, try openclaw

English
334
57
2.1K
496.3K
Pothu
Pothu@pothuLabs·
@theo The part that hasn't gotten old for me either is the shift from "where's my laptop" to "where's my agent." The work stopped being tied to the machine in front of you. That's a bigger change than any single model release this year.
English
0
0
1
1.2K
Theo - t3.gg
Theo - t3.gg@theo·
Had to put my laptop away on a plane, but couldn’t release my changes due to using “npm stage” instead of “npm publish” Asked Hermes Agent to clone repo and do it from my phone. Just merged. This still hasn’t gotten old.
Theo - t3.gg tweet media
English
40
7
647
70.5K
Pothu
Pothu@pothuLabs·
@levie Coding agents already run the cheap version of the fix: a file in the repo where someone wrote down why auth is weird and who owns the queue. Enterprise has no single place like that. The context exists, it just contradicts itself across ten systems.
English
0
0
0
299
Aaron Levie
Aaron Levie@levie·
This is effectively the #1 problem for AI agents in the enterprise. As we go from agentic coding (where a large amount of context is in the code base, and users are technical enough to get the rest to the agent easily) to a world of knowledge work agents, the context problem becomes much more acute. We see this every day with customers at Box. For existing digital knowledge, it’s often fragmented across legacy systems or environments that don’t play nice with agents, and have access controls that don’t map to the real work that needs to be done, which become a huge hurdle for getting agents the context they need. This has to all get moved to modern, secure cloud environments. But also, companies often haven’t captured and digitized some of the critical context that agents need to work with. Decisions, processes, and workflows often live in people’s heads and tribal knowledge that need to get turned into unstructured data for agents. This is actually one of the biggest points of leverage for applied AI companies, because they can work to specialize in getting agents exactly the information and domain expertise they need. But it’s also one of the reasons why FDEs and new system integrator plays will also work so well right now. The companies that figure this out will be able to get the most out of AI going forward.
Tom Blomfield@t_blom

Imagine replacing 90% of your employees with a team of geniuses who have no idea how your company operates. Total chaos. Nothing works. That’s what AI feels like today. The missing piece is extracting all the domain knowledge from people’s heads and providing that as structured context to the models.

English
90
86
641
128.7K
Pothu
Pothu@pothuLabs·
@signulll The wild part is a Waymo never has a bad day. No phone, no three drinks at dinner, no fight with their spouse. Human driving has a long tail of catastrophe we've normalized because we can't see the distribution we're sitting in.
English
0
0
0
98
signüll
signüll@signulll·
i feel 100x safer in a waymo than i do with a human driver.
English
82
34
837
85.5K
Pothu
Pothu@pothuLabs·
@PeterDiamandis Picking a benchmark number as the AGI line guarantees the goalposts move the day you cross it. A model can ace Humanity's Last Exam and still get lost in a 40-turn coding session. The exam was never the thing we actually meant by general.
English
0
0
9
1.1K
Peter H. Diamandis, MD
Peter H. Diamandis, MD@PeterDiamandis·
We said on the MOONSHOTS podcast that when AI hits 50% on Humanity's Last Exam, that is AGI. Opus 4.8 scored 57.9%. We crossed our own threshold WOW!
English
131
133
2.1K
145.5K
Pothu
Pothu@pothuLabs·
@theo The deeper problem isn't the token cost, it's steering. Every skill name sitting in context is a nudge the model can act on without asking. Install 100 and you've handed strangers a vote in what your agent decides to do. Default-on is the actual bug.
English
0
0
2
269
Pothu
Pothu@pothuLabs·
8/ So here's what I'd actually want measured: Promptability: How far it gets before interrupting. Tokens per finished task. Failure honesty: How good the tooling around it is. We keep grading these like an exam. We use them like a coworker. Those are different report cards.
English
0
0
1
97
Pothu
Pothu@pothuLabs·
7/ Last one: how a model fails matters more than how often. 4.8's best upgrade for me isn't a score, it's that it stops and says "this part is shaky" instead of confidently rewriting a working file. An eval counts the pass. It can't count the 40 minutes a confident wrong answer costs you.
English
1
0
1
116
Pothu
Pothu@pothuLabs·
Two models tie on the benchmark and feel completely different the second you actually use them. I've shipped real features with both this week. The scoreboard and the daily experience have fully come apart. Here's what evals don't measure ↓
English
1
1
2
243