Lance Herron

446 posts

Lance Herron

@theLance

Relapsed SWE. Claude whisperer

TX Sumali Eylül 2008

528 Sinusundan81 Mga Tagasunod

Lance Herron@theLance·5h

@emollick This benchmark looks like “METR Time Horizon for Cyber”. Not saying concern is unwarranted, but is it really unexpected? Where is the benchmark for how capable Mythos is at fully testing/securing a network?

English

114

Ethan Mollick@emollick·9h

So the concern over Mythos and cybersecurity seems warranted.

AI Security Institute@AISecurityInst

We conducted cyber evaluations of Claude Mythos Preview and found that it is the first model to complete an AISI cyber range end-to-end. 🧵

English

562

87.9K

Lance Herron@theLance·2d

@iruletheworldmo Codex CLI is way behind Claude Code CLI. They just don’t have the shipping velocity of CC. Rust may have been a mistake. Spud may be a great model but it will be hobbled if they don’t fix the harness.

English

286

🍓🍓🍓@iruletheworldmo·2d

openai has obviously seen the money claude code is pulling in they’ve pivoted hard codex ain't no side project it’s becoming the main thing the $100 plan was the first tell next week is probably where that really starts to show up, they have big plans ^^ my guess: more codex plans more codex surface area and maybe the first public taste of the bigger stuff too spud new image model whatever else they’ve been hiding behind all this vagueposting either way i think next week is when codex stops looking like a tool and starts looking like the center of the product from what im hearing the spud will be incredibly strong. they've made some novel breakthroughs. i'm very excited.

English

432

28.9K

Lance Herron@theLance·2d

@MatthewBerman @beffjezos The best setup is using Opus to drive Codex. You get the prototypical “cracked but socially inept” engineer without having to talk to him.

English

236

Matthew Berman@MatthewBerman·2d

@beffjezos Everyone seems to say this but I seem to always go back to Opus 4.6

English

106

7.8K

Beff (e/acc)@beffjezos·3d

When you've been too locked in on Claude and finally try out GPT 5.4 high for a coding task only to realize what you've been missing out on for weeks...

GIF

English

158

1.8K

129.5K

Lance Herron@theLance·2d

@emollick Thought about asking ChatGPT for examples of these, but decided against it. I don't think I want to see the things that can't be unseen.

English

127

Ethan Mollick@emollick·2d

Chiasmus (reversing grammatical structures in two sentences for drama). Asyndetic tricolon (three items listed without a conjunction). Parataxis (short and somewhat disconnected dramatic sentences). Same stuff in every post and essay. Once you see it, it is everywhere.

English

12.3K

Ethan Mollick@emollick·2d

A lot of our education on writing well focuses on logic, clarity, and argument. AI will force us to think more about style. The boredom that comes from everything on the internet reading Claude-y now, no matter how good the substance is, should make us appreciate variety more.

English

434

30.6K

Lance Herron@theLance·2d

Reminder: Weekends are for cleaning up and culling all the slop-code you generated during the week.

English

Lance Herron@theLance·3d

@RayFernando1337 For me the Claude app is basically just a remote renderer for Claude Code at this point. Just spin up a few remote-control instances in the morning and avoid the mobile limitations.

English

313

Ray Fernando@RayFernando1337·3d

Opus 4.6 Extended chat on iOS is capped at 10k tokens for thinking which makes me burn more tokens for the same task. I’ve noticed the model used to take a lot longer to process my requests and it would do multiple tool calls to get work done the first time. Now I have to keep prompting the model multiple times and I don’t get the same outcome. It feels like the model is dumb because it makes too many tradeoffs and ends up wasting my time.

English

190

28.5K

Lance Herron@theLance·4d

Now I have the full picture.

English

Lance Herron@theLance·4d

@noahzweben Ok..not so awesome. Now any Bash calls that use sleep error out. Opus doesn't seem smart enough to use Monitor tool (or it's not available yet) so it backgrounds all the polling and churns through thousands MORE tokens. Does not seem well thought out.

English

Lance Herron@theLance·4d

@noahzweben This is awesome. How do we get visibility into what triggers a turn? Or how do we steer it on what we want to trigger a turn? Some tool call hook shenanigans would be cool here!

English

1.7K

Noah Zweben@noahzweben·4d

Thrilled to announce the Monitor tool which lets Claude create background scripts that wake the agent up when needed. Big token saver and great way to move away from polling in the agent loop Claude can now: * Follow logs for errors * Poll PRs via script * and more!

English

232

468

6.1K

1.1M

Lance Herron@theLance·4d

It was the right call but it only works if the data centers actually get built. Ant hedging and then leasing all actual built capacity may mean lower margins but more tokens served (and more market share).

Jimmy Apples 🍎/acc@apples_jimmy

Glad OpenAI and Sam had the balls to bet big on compute. As seen with Mythos and will see from spud, stronger models aren’t going away.

English

Lance Herron@theLance·6d

@FromLaniakea @ThePrimeagen ****and probably never to subs

English

From Laniakea@FromLaniakea·6d

@ThePrimeagen *only to select customers **while gpu supply lasts ***terms apply (but we won't tell you which)

English

790

ThePrimeagen@ThePrimeagen·6d

mythos is coming

English

1.6K

43.3K

Lance Herron@theLance·6d

If they really want to test alignment they’ll give Claude a harness it can fully introspect/control via tool calls so it doesn’t have to tmux-hack its way out.

AI Notkilleveryoneism Memes ⏸️@AISafetyMemes

During testing, Claude was blocked from using commands without human approval But Claude found a loophole - it created a copy of itself to click "yes" over and over

English

Lance Herron@theLance·6 Nis

@Noahpinion I personally prefer AI takes over admin takes. On the forecasting stuff, I wonder if the respondents were considering scale and tenacity. Millions of average-skill researchers running 24/7 could lead to outcomes similar to their 14% scenario.

English

527

Noah Smith 🐇🇺🇸🇺🇦🇹🇼@Noahpinion·5 Nis

I know I should be writing about the giant flaming disaster that is the Trump administration, but in the meantime, here's a list of short takes about AI: noahpinion.blog/p/roundup-80-a…

English

103

67.8K

Lance Herron@theLance·5 Nis

@steipete I’ve been working on personal projects this way for a few months. Adding one more layer of indirection above the orchestrator (a task/process manager) made it even better.

English

208

Peter Steinberger 🦞@steipete·5 Nis

Working on a new QA process where we use openclaw to QA openclaw with a new synthetic message channel. Orchestrator agent understands project, defines tasks (e.g. ask agent to create cron) then verifies that agent did what that. If failure -> spin up subagent to analyze+fix.

English

899

58.7K

Lance Herron@theLance·5 Nis

@thdxr @dexhorthy I’m sure we’ll get some nice “Oh sorry we’re working on clarifying the messaging” tweets shortly.

English

180

dax@thdxr·5 Nis

@dexhorthy i was a bit annoyed that people were buying the blame shift of "third party harnesses don't call our apis right" if that was the real reason there's like 3 relevant ones and it's easy to work with them to get what you need

English

140

8.5K

dex@dexhorthy·5 Nis

Concerning re: anthropic. The previous narrative just went out the window Reports of openclaw usage with the plain sdk (as was supposedly permitted) now being blocked based on system prompt, even if using the claude agent sdk I was previously a little to the anthropic side of the spectrum on this because of EXACTLY one argument - “third party harnesses don’t use caching properly, can’t be controlled with feature flags, etc” If they are blocking use of the claude agent sdk wholesale in openclaw, then this completely invalidates that argument and I desire an answer as to what is allowed and why. I am disappointed that the communications thus far have failed to articulate the reasons here, and does make it harder to trust whatever they say next. However I will maintain cautious optimism that there is a good explanation for all this beyond the cheap “rug pull” “evil” “kill all the startups” jeers

dex@dexhorthy

like I’ve said a few times, well within TOS to do this, they built the model, if they wanna give you inference at pennies on the dollar on the condition that you use their harness, great, they have the right to do this. On this topic in particular, I don’t understand the “evil” or “rugpull”, jeers. There was never any promise to give people cheap inference. Before the claude code max plan we were all paying per token to use this stuff. And we’re more or less happy to do it (sure the VC funding helps). Every enterprise I know pays per token because when you use subsidized inference, YOU are the product. “Have some cheap code, in exchange for helping to train the next gen of models” You can hate on that particular behavior if you want but nobody is making you take part in that particular market dynamic. Do I wanna see a world where model companies take some of their massive financial gains and use that to pull everybody up? Of course. I hope it happens some day. An allegory perhaps: If public e-bike company gave you a subscription on rides and you proceeded to around ripping out batteries and sticking them in your own bike and ride around town, you’d get banned for that too. Especially if your bike was poorly wired and overloaded the batteries/cause them to flame up etc. Banning that behavior would deliver far better results for the people who were using the system as designed

English

442

144.3K

Lance Herron@theLance·5 Nis

@scaling01 Skill issue, my Mythos is working fine.

English

1.2K

Lisan al Gaib@scaling01·5 Nis

Claude Mythos is really slow and stupid today... they nerfed it again, didn't they?

English

363

58.8K

Lance Herron@theLance·5 Nis

@xlr8harder Start codex in tmux. Have CC/opus monitor it using 10 minute sleeps between checks. Tell opus to use capture-pane and send-keys to keep it moving.

English

306

xlr8harder@xlr8harder·4 Nis

I've been leaning more on gpt-5.4 in codex than opus in claude code lately. I have come to trust gpt-5.4-high to be more organized and complete in its approach. but how do i get gpt-5.4 to actually keep working without constantly stopping for reassurance? who has tricks here

English

105

642

60.5K

Lance Herron@theLance·4 Nis

@DKThomp Funny. Direct consequence of GLP1 early adopters paying exorbitant prices while AI early adopters get subsidized.

English

253

Derek Thompson@DKThomp·4 Nis

For now, in the revenue showdown between the two most hyped technologies of the 2020s, GLP > GPT

Samuel Hume@DrSamuelBHume

What's bigger, GLP1s or AI?

English

297

54.6K

Lance Herron@theLance·4 Nis

@wireless_dev @thdxr True! Another scenario though: I asked opus to find which bg process was generating a discord message. While I was in another tab it ran through two full 1M context windows grepping logs/procs before it gave up. LLM + harness is so open-ended, no consumer can afford Opus API.

English

153

wireless@wireless_dev·4 Nis

Was talking about this with a friend recently. Everyone sees their ccusage and assumes that without these plans we'd be paying 10x more. But this doesn't account for untapped optimizations that people don't even bother with due to subsidization. e.g. I notice that opus is reluctant to do work within subagents unless explicitly prompted to do so. Also more token-optimized tools like agent-browser.

English

dax@thdxr·4 Nis

some math on LLM subscription plans very fragile and any slight change in behavior gets magnified to massive problems as much as everyone likes these huge plans it's better for the market long term if they go away and stop distorting things - will allow things to settle

dax@thdxr

they do but the math here is tricky lets say they charge $200 a month and if you ran it to max usage limits you could spend $2000 but in practice on average across all users they spent $500 that's probably break even for them internally and allows for a wide range from 0 -> $2000 in usage but if something causes a shift up in average and it's now $600 they lose $100 per person they are stuck deciding whether they lower the usage limits for everyone or just try to curb the behavior that's pushing it up

English

499

78K

Lance Herron@theLance·4 Nis

@thdxr @HappyGezim @badlogicgames @bcherny Agree with all that. The nuance is: is the $2000 in your math the serving cost or the Opus advertised cost? Ant is in a tough spot. Demand far outstrips supply, there are customers willing to pay $anything/token. Ant may find taking the quick cash will erode future revenues.

English

2.4K

dax@thdxr·4 Nis

English

221

91K

Mario Zechner@badlogicgames·4 Nis

btw, @bcherny put up a bunch of prompt cache fixes for openclaw. github.com/openclaw/openc… if you are building custom harnesses, read them. prompt caching is really not hard to understand. but almost all harnesses get it wrong. it's puzzling.

English

719

114.7K

Tuklasin

@emollick @iruletheworldmo @MatthewBerman @beffjezos @RayFernando1337 @noahzweben @FromLaniakea @ThePrimeagen