Post

djcows
djcows@djcows·
@IterIntellectus it's so joever but also it just started so really life is great
English
1
0
22
6.9K
Savaer
Savaer@savaerx·
@IterIntellectus metr's task suite being nearly saturated is the real headline here. we're running out of benchmarks before we're running out of capability gains
English
0
0
4
1K
Barrak
Barrak@BarrakAli·
@IterIntellectus at the top of that chart like it owns the place. The jump from everything else is massive
English
0
0
0
49
TokenPark
TokenPark@ZhuoSSS·
@IterIntellectus The programmers who create LLM models will be eliminated by LLM models themselves, before LLM models are widely accepted.
English
1
1
89
8.1K
LexanderB
LexanderB@LexanderBrouwer·
@IterIntellectus Might be in a lab environment, in private use or corporate use, haven’t seen it run way beyond 20/25min. Nevertheless, the take off of models last 3months is insane, they really have an opinion of their own, and capability is in another league
English
2
0
32
13K
tom
tom@kay1492111·
@IterIntellectus cool graph I just argued with opus 4.6 for 2 hours while it consistently provided terrible solutions to a complex feature build in a large enterprise codebase. great to see these models are improving at benchmarking graph tasks but it truly does not mean shit to me lol
English
1
0
24
3.5K
TBOF
TBOF@Birdmeister17·
@IterIntellectus Used Opus 4.6 to make a game the other day and was blown away. Did everything as expected, no issues. Project was always adjusted as I had expected it to be with each addition or update. Compared to my experience using other models for games, it was 10-20 times better than all.
English
0
0
9
4.4K
Brian Cheong
Brian Cheong@briancheong·
@IterIntellectus The confidence interval is where METR buried the real story. 6 to 98 hours means the benchmark is saturating, not that the model is somewhere between 6 and 98 hours of capability.
English
0
0
3
1K
Alta
Alta@404Alta·
@IterIntellectus I'm dumb at this so @grok explain is your digital brother better than others on this graph
English
1
0
3
25.9K
goldenboi
goldenboi@andthedropout·
@IterIntellectus Can someone explain this benchmark to me. Is that what its estimated time that task is to take?
English
0
0
2
509
k k
k k@khalidaxx·
@IterIntellectus i'm just waiting for mag7 to reduce all coding employees by 80% so it all falls to the bottom line . @grok how many employees does meta have as coders, and the other mag7
English
1
0
2
8.3K
Technophile
Technophile@Technop54777070·
@IterIntellectus No, what you’re seeing is over saturation of easier tasks in their benchmark suite (I’m sure these guys are training their models on them too) so while yes the bench marks show huge improvements, in the real world that is not what’s happening.
English
0
0
1
2.1K
Lame raypist
Lame raypist@lamebruh123·
@IterIntellectus Ok but is the time horizon even the most important metric. Who cares how long it can ran for, if its the same slop at the end?
English
0
0
1
1.1K
Paylaş