vittorio: "ehm, guys… I think it’s happening"

Post

vittorio@IterIntellectus·20 Şub

ehm, guys… I think it’s happening

We estimate that Claude Opus 4.6 has a 50%-time-horizon of around 14.5 hours (95% CI of 6 hrs to 98 hrs) on software tasks. While this is the highest point estimate we’ve reported, this measurement is extremely noisy because our current task suite is nearly saturated.

English

107

207

3.7K

763.7K

wil michael@wilplatypus·20 Şub

@IterIntellectus one shotting a workweek is wild

English

5.4K

Dr Singularity@Dr_Singularity·20 Şub

@IterIntellectus Looks like fast takeoff. Doubling time since GPT 5.2 only 7 weeks.

English

149

21K

The Knowledge Archivist@KnowledgeArchiv·20 Şub

@IterIntellectus You can almost feel the acceleration by simply reading the timeline on here

English

3.9K

djcows@djcows·20 Şub

@IterIntellectus it's so joever but also it just started so really life is great

English

6.9K

david rein@idavidrein·21 Şub

@IterIntellectus x.com/idavidrein/sta…

david rein@idavidrein

Seems like a lot of people are taking this as gospel—when we say the measurement is extremely noisy, we really mean it. Concretely, if the task distribution we're using here was just a tiny bit different, we could've measured a time horizon of 8 hours, or 20 hours.

QME

Ian Miles Cheong@ianmiles·21 Şub

@IterIntellectus Yeah so I’ve been using Claude and, it makes graphic design an obsolete job

English

9.2K

Sterling Cooley@SterlingCooley·21 Şub

@IterIntellectus vertical takeoff - my god.... 🫥🫡🫣

GIF

English

1.6K

Savaer@savaerx·21 Şub

@IterIntellectus metr's task suite being nearly saturated is the real headline here. we're running out of benchmarks before we're running out of capability gains

English

Rand@rand_longevity·20 Şub

@IterIntellectus wooooah

English

647

Daniel Faggella@danfaggella·21 Şub

@IterIntellectus I feel already the cold embrace of death

English

702

Adam Gaertner 🇺🇸@veryvirology·21 Şub

@IterIntellectus Hey, they finally put Opus on that chart!

English

203

Paul@paulbettner·21 Şub

@IterIntellectus It's definitely happening.

English

Barrak@BarrakAli·21 Şub

@IterIntellectus at the top of that chart like it owns the place. The jump from everything else is massive

English

André Arslanian@andrearslanian·21 Şub

@IterIntellectus @grok explain the context & implications of this like I’m a 6th grader

English

107

TokenPark@ZhuoSSS·20 Şub

@IterIntellectus The programmers who create LLM models will be eliminated by LLM models themselves, before LLM models are widely accepted.

English

8.1K

LexanderB@LexanderBrouwer·20 Şub

@IterIntellectus Might be in a lab environment, in private use or corporate use, haven’t seen it run way beyond 20/25min. Nevertheless, the take off of models last 3months is insane, they really have an opinion of their own, and capability is in another league

English

13K

tom@kay1492111·20 Şub

@IterIntellectus cool graph I just argued with opus 4.6 for 2 hours while it consistently provided terrible solutions to a complex feature build in a large enterprise codebase. great to see these models are improving at benchmarking graph tasks but it truly does not mean shit to me lol

English

3.5K

Mr Strijker@mrstrijker·20 Şub

@IterIntellectus No, it's not. Do you even use these models? Besides, small models actually outperform newer models given the right skill: x.com/forloopcodes/s…

𝕱𝖔𝖗𝕷𝖔𝖔𝖕@forloopcodes

2026 ai bubble is peaking itself right now: someone just dropped a new benchmark on claude skills and the result of their paper is actually insane the paper explicitly states smaller models with skills beat larger models without them a smaller model like claude 4.5 haiku equipped with high quality skills smokes a raw state of the art opus 4.5 model by about 6 percent (27.7 vs 22.0) imagine getting sota level performance from a free model, its basically cheating, you just have to manually spoonfeed it a basic markdown file explaining how to do its job all of you opus guys are dumb, you can literally spam haiku with skills and get things shipped in 5x lesser time and 0 cost even wilder thing is that codex gpt 5.2 fails on the pareto frontier. codex burned massive compute and costs, just to get completely mogged by gemini 3 flash hitting maximum performance at a fraction of cash i can believe skill engineering is now a valid, mathematically proven substitute for compute over that, it says self generated skills provide zero benefit on average and show negative deltas on 16/84 tasks. if you give an agent more than 3 skills at once, it bloats its context and completely fails

English

9.5K

TBOF@Birdmeister17·20 Şub

@IterIntellectus Used Opus 4.6 to make a game the other day and was blown away. Did everything as expected, no issues. Project was always adjusted as I had expected it to be with each addition or update. Compared to my experience using other models for games, it was 10-20 times better than all.

English

4.4K

Guido@Full_Metal_QR·20 Şub

@IterIntellectus we are not decelerating

English

2.2K

ghostpen@gh0stpen·20 Şub

@IterIntellectus are you situating your monitors

English

543

Elias Datler@fxgst·21 Şub

@IterIntellectus What is an LLM time horizon @grok

English

14.7K

Brian Cheong@briancheong·21 Şub

@IterIntellectus The confidence interval is where METR buried the real story. 6 to 98 hours means the benchmark is saturating, not that the model is somewhere between 6 and 98 hours of capability.

English

Ihor@ihorsol·20 Şub

@IterIntellectus 1coin

English

Alta@404Alta·20 Şub

@IterIntellectus I'm dumb at this so @grok explain is your digital brother better than others on this graph

English

25.9K

Mafumofu@Mafumofu01·21 Şub

@IterIntellectus I can’t wait for nothing to happen yet again

English

468

pixlflip@pixlflipold·21 Şub

@IterIntellectus The line is going vertical

English

goldenboi@andthedropout·21 Şub

@IterIntellectus Can someone explain this benchmark to me. Is that what its estimated time that task is to take?

English

509

k k@khalidaxx·20 Şub

@IterIntellectus i'm just waiting for mag7 to reduce all coding employees by 80% so it all falls to the bottom line . @grok how many employees does meta have as coders, and the other mag7

English

8.3K

Technophile@Technop54777070·21 Şub

@IterIntellectus No, what you’re seeing is over saturation of easier tasks in their benchmark suite (I’m sure these guys are training their models on them too) so while yes the bench marks show huge improvements, in the real world that is not what’s happening.

English

2.1K