gavin leech (Non-Reasoning)

12.3K posts

gavin leech (Non-Reasoning) banner
gavin leech (Non-Reasoning)

gavin leech (Non-Reasoning)

@gleech

context maximiser @ArbResearch

UK Katılım Haziran 2019
608 Takip Edilen10.3K Takipçiler
gavin leech (Non-Reasoning)
@aliceisplaying Treadmill is one thing but I think "learning the limits, realising it still can't do things you thought it could" is the bigger morale effect after like 6 weeks post-launch
English
0
0
1
52
gavin leech (Non-Reasoning) retweetledi
alice
alice@aliceisplaying·
re claude getting worse: there is definitely a hedonic treadmill with SOTA models and i think this creates a perception issue. on top of that ant tweaking the default effort and adding adaptive thinking didn't help either even though i get it, they don't have the compute
English
5
1
29
1.4K
gavin leech (Non-Reasoning) retweetledi
Greg Burnham
Greg Burnham@GregHBurnham·
More MirrorCode thoughts. A big caveat is that dev tasks don't come with a blackbox implementation. But how much do agents need this? I see at least 1411 calls to gotree in Opus 4.6's successful run. That's a lot, but not *so* much more than what you'd ask of a product manager.
Epoch AI@EpochAIResearch

What are the largest software engineering tasks AI can perform? In our new benchmark, MirrorCode, Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks. Co-developed with @METR_Evals. Details in thread.

English
1
1
3
1.9K
Jack Crawford
Jack Crawford@jackcrawford__·
one of the most overrated games. brutally mogged since birth by Go which existed over a thousand years beforehand. now brutally mogged in different ways by countless video games
rob🏴@rob_mcrobberson

chess is hilarious because its like a bunch of gamers got together and convinced the world that *their* game is “intellectual” and totally different than other games and its not the same as like spending hours a day playing candy crush or something

English
10
6
88
5.8K
gavin leech (Non-Reasoning) retweetledi
James Medlock
James Medlock@jdcmedlock·
My nephews (8-12 y/o) are obsessed with computer games but they only have Chromebooks lent to them by school. All the game sites have been blocked, but they have access to Gemini and realized they could vibecode their own custom platformer games.
English
15
30
1.3K
51K
gavin leech (Non-Reasoning) retweetledi
Davis Brown
Davis Brown@davisbrownr·
In new work, we find that cheating on model capability evaluations is rampant. For example, the top 3 Terminal-Bench 2 submissions all cheat, usually by sneaking the correct answer to the model. Blog linked below.
Davis Brown tweet media
English
3
11
75
8.6K
madeofmistake
madeofmistake@madeofmistak3·
what's a word/expression that was fabulously offensive to say hundreds of years ago but now is completely benign?
English
20
0
25
2.3K
interstice
interstice@an_interstice·
@jackcrawford__ I feel like the verdict is still out, can we even *know* now that there are video games that remain compelling at similar strategic depth? no videogame has yet had such cumulative effort applied to it
English
3
0
6
680
gavin leech (Non-Reasoning) retweetledi
Nate Soares ⏹️
If you start killing in the name of a cause, you make leaders feel like cowards caving to terrorists if they support that cause. Screw that. Those signing a treaty to stop the AI race would be heroes saving the world, and should feel like it. Cut out this violence shit.
English
20
26
350
11.8K
Sneedle
Sneedle@SRamirez68083·
@teodorio I feel like everything David Foster Wallace said is undermined by the fact he committed suicide, by this I'm not trying to make a moral judgment against people who commit suicide, but in his specific case it really does seem to be an act of pure incongruence
English
3
1
5
359
gavin leech (Non-Reasoning) retweetledi
Tom Reed
Tom Reed@mentalgeorge·
Incredibly fitting that Grok, despite being pretty mediocre at forecasting overall, is the single most useful addition to an ensemble of frontier models forecasting real world events. Overwhelming empirical evidence in favour of keeping your toxic friend in the group chat
Tom Reed tweet media
English
12
17
966
51.3K
Tenobrus
Tenobrus@tenobrus·
@she_llac i have no clue man it's kinda crazy rarely seen anything that bad
English
2
0
47
4.3K