gavin leech (Non-Reasoning)

12.3K posts

gavin leech (Non-Reasoning)

@gleech

context maximiser @ArbResearch

UK Sumali Haziran 2019

609 Sinusundan10.3K Mga Tagasunod

Naka-pin na Tweet

gavin leech (Non-Reasoning)@gleech·27 Haz

essays

English

gavin leech (Non-Reasoning) nag-retweet

campbell 🪄@cambrownwrites·19h

i thought i liked high school debate but in retrospect i just liked purposely misconstruing philosophy i didn't really understand to make other people annoyed

English

710

10.9K

gavin leech (Non-Reasoning)@gleech·48m

@AmmannNora arxiv.org/pdf/2411.10588

QME

gavin leech (Non-Reasoning)@gleech·48m

@AmmannNora As of June 2024, RL tended to push towards EDT

English

162

Nora Ammann@AmmannNora·2h

Has someone ran experiments about whether (and when) Claude is a 1 boxer or a 2 boxer?

English

508

gavin leech (Non-Reasoning)@gleech·2h

@GregHBurnham any data on wall-clock times?

English

131

gavin leech (Non-Reasoning) nag-retweet

Greg Burnham@GregHBurnham·18h

More MirrorCode thoughts. A big caveat is that dev tasks don't come with a blackbox implementation. But how much do agents need this? I see at least 1411 calls to gotree in Opus 4.6's successful run. That's a lot, but not *so* much more than what you'd ask of a product manager.

Epoch AI@EpochAIResearch

What are the largest software engineering tasks AI can perform? In our new benchmark, MirrorCode, Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks. Co-developed with @METR_Evals. Details in thread.

English

2.4K

gavin leech (Non-Reasoning)@gleech·4h

@justanotherlaw (you probably get this benefit from geography though)

English

135

gavin leech (Non-Reasoning)@gleech·4h

@justanotherlaw I have muted 6000 accounts and now it's great. I think being on Twitter gets me about a year ahead on certain matters like hypothesis crystallisation (personas, linear representations, the evals crisis. max scaffold capabilities, ...)

English

970

Lawrence Chan@justanotherlaw·11h

Can someone... defend the merits of Twitter to me? I feel like every time I come on, I see people that I know to be reasonable and thoughtful people in real life espouse incredibly simplistic (arguably deranged) takes. It seems _something_ about this site is causing this.

English

3.5K

gavin leech (Non-Reasoning)@gleech·6h

@JimDMiller newyorker.com/magazine/2026/…

QME

248

gavin leech (Non-Reasoning)@gleech·6h

@JimDMiller "the enhanced games", or, the honest games x.com/damianplayer/s…

Damian Player@damianplayer

THIS IS WILD! Peter Thiel’s company the “Enhanced Games” got valued at $1.2B before a single event. the first one is next month. here’s what the headlines aren’t telling you (share this): every athlete is monitored. every compound is clinically approved. every dose is tracked. two independent medical commissions oversee the whole thing. and if your bloodwork doesn’t pass, you don’t compete. the same investors behind the biggest peptide and longevity companies put $1.2B behind this. these aren’t sports guys… they’re taking a public bet that performance medicine becomes a real market. whether you’re into it or not, pay attention.

English

449

gavin leech (Non-Reasoning)@gleech·25 Eyl

thread of real life cyberpunk

English

496

152.5K

gavin leech (Non-Reasoning)@gleech·7h

@aliceisplaying Treadmill is one thing but I think "learning the limits, realising it still can't do things you thought it could" is the bigger morale effect after like 6 weeks post-launch

English

171

gavin leech (Non-Reasoning) nag-retweet

alice@aliceisplaying·14h

re claude getting worse: there is definitely a hedonic treadmill with SOTA models and i think this creates a perception issue. on top of that ant tweaking the default effort and adding adaptive thinking didn't help either even though i get it, they don't have the compute

English

2.1K

gavin leech (Non-Reasoning)@gleech·1d

@jackcrawford__ @an_interstice Bots? Good point, but in some fraction of those something is collecting data that does explore strategy

English

Jack Crawford@jackcrawford__·1d

@gleech @an_interstice unclear how fortnite-player-hours convert to man-hours

English

Jack Crawford@jackcrawford__·1d

one of the most overrated games. brutally mogged since birth by Go which existed over a thousand years beforehand. now brutally mogged in different ways by countless video games

rob🏴@rob_mcrobberson

chess is hilarious because its like a bunch of gamers got together and convinced the world that *their* game is “intellectual” and totally different than other games and its not the same as like spending hours a day playing candy crush or something

English

5.8K

gavin leech (Non-Reasoning) nag-retweet

emily@emily_for_now·1d

one can easily feel low agency despite having a background like this if they were raised to uncritically climb societal ladders this is why i feel more proud of a janky woodworking piece i designed and built than my entire 10 year career in tech

felpix@felpix_

if this is what low agency at 25 means, i am at negative levels of agency

English

102

5.3K

gavin leech (Non-Reasoning)@gleech·1d

another philosophical view obsoleted by technological change

Taijitu Observer@taijitu_sees

>Be me, read Zhuangzi >Read about the chef skillful cutting though the joints of an ox to avoid the bones that would dull his knife >Go to Chinese deli, get chicken >They pull out an enormous cleaver and cube the whole thing like bones don't even exist What.

English

2.2K

gavin leech (Non-Reasoning)@gleech·1d

@ByrneHobart Well I mean none of the 1976 senators are in charge either

English

218

Byrne Hobart@ByrneHobart·2d

It’s a cool quote that becomes instantly self-refuting when you realize that you could say it today and get a positive reaction, but you’d have to use a completely different list of companies because those guys weren’t in charge after all.

Jonas Čeika@Jonas_Ceika

INCREDIBLE quote from the 1976 movie The Network

English

294

177.3K

gavin leech (Non-Reasoning) nag-retweet

Davis Brown@davisbrownr·1d

In new work, we find that cheating on model capability evaluations is rampant. For example, the top 3 Terminal-Bench 2 submissions all cheat, usually by sneaking the correct answer to the model. Blog linked below.

English

8.7K

gavin leech (Non-Reasoning)@gleech·1d

@madeofmistak3 "damn"

English

174

madeofmistake@madeofmistak3·1d

what's a word/expression that was fabulously offensive to say hundreds of years ago but now is completely benign?

English

2.3K

gavin leech (Non-Reasoning)@gleech·1d

@an_interstice @jackcrawford__ I think it's pretty close, 200B hours for Fortnite and maybe <600B for chess themultiplicity.ai/room/e90ffd42-…

English

interstice@an_interstice·1d

@jackcrawford__ I feel like the verdict is still out, can we even *know* now that there are video games that remain compelling at similar strategic depth? no videogame has yet had such cumulative effort applied to it

English

694

gavin leech (Non-Reasoning)@gleech·1d

Among the many stupid things about this graph: it's not doing purchasing power adjustment, which makes e.g. Japan leap up 1.6x

Matthew Yglesias@mattyglesias

This is probably the topic where the people who are beauty-pilled about architecture have the strongest argument. Nobody is going to believe that quality of life is higher in West Virginia than Italy and the nature of the built environment is a key reason why.

English

1.2K

Tuklasin

@AmmannNora @GregHBurnham @justanotherlaw @JimDMiller @aliceisplaying @jackcrawford__ @an_interstice @ByrneHobart