gavin leech (Non-Reasoning)

12.3K posts

gavin leech (Non-Reasoning) banner
gavin leech (Non-Reasoning)

gavin leech (Non-Reasoning)

@gleech

context maximiser @ArbResearch

UK Sumali Haziran 2019
609 Sinusundan10.3K Mga Tagasunod
gavin leech (Non-Reasoning) nag-retweet
campbell 🪄
campbell 🪄@cambrownwrites·
i thought i liked high school debate but in retrospect i just liked purposely misconstruing philosophy i didn't really understand to make other people annoyed
English
7
43
710
10.9K
Nora Ammann
Nora Ammann@AmmannNora·
Has someone ran experiments about whether (and when) Claude is a 1 boxer or a 2 boxer?
English
2
1
4
508
gavin leech (Non-Reasoning) nag-retweet
Greg Burnham
Greg Burnham@GregHBurnham·
More MirrorCode thoughts. A big caveat is that dev tasks don't come with a blackbox implementation. But how much do agents need this? I see at least 1411 calls to gotree in Opus 4.6's successful run. That's a lot, but not *so* much more than what you'd ask of a product manager.
Epoch AI@EpochAIResearch

What are the largest software engineering tasks AI can perform? In our new benchmark, MirrorCode, Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks. Co-developed with @METR_Evals. Details in thread.

English
2
1
5
2.4K
gavin leech (Non-Reasoning)
@justanotherlaw I have muted 6000 accounts and now it's great. I think being on Twitter gets me about a year ahead on certain matters like hypothesis crystallisation (personas, linear representations, the evals crisis. max scaffold capabilities, ...)
English
3
0
59
970
Lawrence Chan
Lawrence Chan@justanotherlaw·
Can someone... defend the merits of Twitter to me? I feel like every time I come on, I see people that I know to be reasonable and thoughtful people in real life espouse incredibly simplistic (arguably deranged) takes. It seems _something_ about this site is causing this.
English
14
0
31
3.5K
gavin leech (Non-Reasoning)
@aliceisplaying Treadmill is one thing but I think "learning the limits, realising it still can't do things you thought it could" is the bigger morale effect after like 6 weeks post-launch
English
0
0
8
171
gavin leech (Non-Reasoning) nag-retweet
alice
alice@aliceisplaying·
re claude getting worse: there is definitely a hedonic treadmill with SOTA models and i think this creates a perception issue. on top of that ant tweaking the default effort and adding adaptive thinking didn't help either even though i get it, they don't have the compute
English
4
2
40
2.1K
Jack Crawford
Jack Crawford@jackcrawford__·
one of the most overrated games. brutally mogged since birth by Go which existed over a thousand years beforehand. now brutally mogged in different ways by countless video games
rob🏴@rob_mcrobberson

chess is hilarious because its like a bunch of gamers got together and convinced the world that *their* game is “intellectual” and totally different than other games and its not the same as like spending hours a day playing candy crush or something

English
10
6
89
5.8K
gavin leech (Non-Reasoning) nag-retweet
Davis Brown
Davis Brown@davisbrownr·
In new work, we find that cheating on model capability evaluations is rampant. For example, the top 3 Terminal-Bench 2 submissions all cheat, usually by sneaking the correct answer to the model. Blog linked below.
Davis Brown tweet media
English
4
11
76
8.7K
madeofmistake
madeofmistake@madeofmistak3·
what's a word/expression that was fabulously offensive to say hundreds of years ago but now is completely benign?
English
20
0
25
2.3K
interstice
interstice@an_interstice·
@jackcrawford__ I feel like the verdict is still out, can we even *know* now that there are video games that remain compelling at similar strategic depth? no videogame has yet had such cumulative effort applied to it
English
3
0
6
694