gavin leech (Non-Reasoning)

12.3K posts

gavin leech (Non-Reasoning) banner
gavin leech (Non-Reasoning)

gavin leech (Non-Reasoning)

@gleech

context maximiser @ArbResearch

UK Inscrit le Haziran 2019
608 Abonnements10.3K Abonnés
gavin leech (Non-Reasoning)
@justanotherlaw I have muted 6000 accounts and now it's great. I think being on Twitter gets me about a year ahead on certain matters like hypothesis crystallisation (personas, linear representations, the evals crisis. max scaffold capabilities, ...)
English
2
0
10
117
Lawrence Chan
Lawrence Chan@justanotherlaw·
Can someone... defend the merits of Twitter to me? I feel like every time I come on, I see people that I know to be reasonable and thoughtful people in real life espouse incredibly simplistic (arguably deranged) takes. It seems _something_ about this site is causing this.
English
8
0
18
1.8K
gavin leech (Non-Reasoning)
@aliceisplaying Treadmill is one thing but I think "learning the limits, realising it still can't do things you thought it could" is the bigger morale effect after like 6 weeks post-launch
English
0
0
3
102
gavin leech (Non-Reasoning) retweeté
alice
alice@aliceisplaying·
re claude getting worse: there is definitely a hedonic treadmill with SOTA models and i think this creates a perception issue. on top of that ant tweaking the default effort and adding adaptive thinking didn't help either even though i get it, they don't have the compute
English
4
2
33
1.6K
gavin leech (Non-Reasoning) retweeté
Greg Burnham
Greg Burnham@GregHBurnham·
More MirrorCode thoughts. A big caveat is that dev tasks don't come with a blackbox implementation. But how much do agents need this? I see at least 1411 calls to gotree in Opus 4.6's successful run. That's a lot, but not *so* much more than what you'd ask of a product manager.
Epoch AI@EpochAIResearch

What are the largest software engineering tasks AI can perform? In our new benchmark, MirrorCode, Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks. Co-developed with @METR_Evals. Details in thread.

English
1
1
4
2K
Jack Crawford
Jack Crawford@jackcrawford__·
one of the most overrated games. brutally mogged since birth by Go which existed over a thousand years beforehand. now brutally mogged in different ways by countless video games
rob🏴@rob_mcrobberson

chess is hilarious because its like a bunch of gamers got together and convinced the world that *their* game is “intellectual” and totally different than other games and its not the same as like spending hours a day playing candy crush or something

English
10
6
89
5.8K
gavin leech (Non-Reasoning) retweeté
James Medlock
James Medlock@jdcmedlock·
My nephews (8-12 y/o) are obsessed with computer games but they only have Chromebooks lent to them by school. All the game sites have been blocked, but they have access to Gemini and realized they could vibecode their own custom platformer games.
English
15
30
1.4K
51.2K
gavin leech (Non-Reasoning) retweeté
Davis Brown
Davis Brown@davisbrownr·
In new work, we find that cheating on model capability evaluations is rampant. For example, the top 3 Terminal-Bench 2 submissions all cheat, usually by sneaking the correct answer to the model. Blog linked below.
Davis Brown tweet media
English
3
11
75
8.6K
madeofmistake
madeofmistake@madeofmistak3·
what's a word/expression that was fabulously offensive to say hundreds of years ago but now is completely benign?
English
20
0
25
2.3K
interstice
interstice@an_interstice·
@jackcrawford__ I feel like the verdict is still out, can we even *know* now that there are video games that remain compelling at similar strategic depth? no videogame has yet had such cumulative effort applied to it
English
3
0
6
685
gavin leech (Non-Reasoning) retweeté
Nate Soares ⏹️
If you start killing in the name of a cause, you make leaders feel like cowards caving to terrorists if they support that cause. Screw that. Those signing a treaty to stop the AI race would be heroes saving the world, and should feel like it. Cut out this violence shit.
English
20
26
352
12K
Sneedle
Sneedle@SRamirez68083·
@teodorio I feel like everything David Foster Wallace said is undermined by the fact he committed suicide, by this I'm not trying to make a moral judgment against people who commit suicide, but in his specific case it really does seem to be an act of pure incongruence
English
3
1
5
359