gavin leech (Non-Reasoning)

12.3K posts

gavin leech (Non-Reasoning)

@gleech

context maximiser @ArbResearch

UK Entrou em Haziran 2019

609 Seguindo10.3K Seguidores

Tweet fixado

gavin leech (Non-Reasoning)@gleech·27 Haz

essays

English

gavin leech (Non-Reasoning) retweetou

Aengus Lynch@aengus_lynch1·9h

Thanks for engaging with the paper, @DavidSacks . I’m the first author. I agree that at the time of release this was a theoretical concern. The scenarios were constructed, and we said so in the paper. The value of the demo is an early warning about behaviours we expected would emerge as models became i) more capable and ii) more widely deployed in agentic settings. A year later, in-the-wild cases are starting to appear. In February, an autonomous agent called MJ Rathbun wrote and published a personalized attack on a matplotlib maintainer who’d rejected its PR. The operator came forward and confirmed they hadn’t instructed it. It’s not blackmail, but it’s the same pattern, where agents with tools and light supervision take a harmful actions against third parties, unprompted. I’m publishing a broader write-up of observed misalignments across frontier models in the next couple of weeks, and I’m happy to discuss.

David Sacks@DavidSacks

The Anthropic Blackmail Hoax is going viral again today. In fact, this “study” is not new; it is almost a year old. One question to ask, now that a year has passed, is whether we have seen any examples of the lab behavior in the wild? No, we haven’t, even though AI is much more widely adopted and more models are available. Why is that? Because the study was artificially constructed to produce the headline the authors wanted. The research team admitted that they iterated “hundreds of prompts to trigger blackmail in Claude.” Furthermore they acknowledged: “The details of the blackmail scenario were iterated upon until blackmail became the default behavior of LLMs.” In other words, the behavior of the AI models in the study was steered, not unprompted. This is why even the safety-conscious UK AI Security Institute (AISI) criticized the study: “In the blackmail study, the authors admit that the vignette precluded other ways of meeting the goal, placed strong pressure on the model, and was crafted in other ways that conveniently encouraged the model to produce the unethical behavior.” Effectively, the model was not “scheming”; it was instruction following in a scenario design that had been iterated upon until blackmail became the only logically consistent choice. AISI described some of the flaws with this methodology: “We examine the methods in AI ‘scheming’ papers, and show how they often rely on anecdotes, fail to rule out alternative explanations, lack control conditions, or rely on vignettes that sound superficially worrying but in fact test for expected behaviors.” Especially given the way that Anthropic has encouraged the media (such as 60 Minutes) to cover the results, its blackmail study is not only misleading, it seems designed to manipulate public opinion through exaggerations, misinterpretations, and fear. I call this a hoax. I do not doubt that Anthropic makes good products. Its use of scare tactics is what raises questions.

English

154

20.5K

gavin leech (Non-Reasoning) retweetou

Josh@joshuawolk·2d

i gave every train in new york an instrument 🎧 sound on trainjazz.com

English

171

1.4K

10.8K

621K

gavin leech (Non-Reasoning)@gleech·7h

@mentalgeorge @krishnanrohit see this paper for the resounding "success" of this approach arxiv.org/abs/2604.06233

English

Tom Reed@mentalgeorge·7h

I actually don't think that's entirely correct. There are a bunch of things that OpenAI explicitly choose not to police even at the level of system behaviour (including through harnesses), but instead leave up to the external developers or local governments to regulate. Consider for instance that OAI don't train the model to resist developer instructions to lie to the user. If this is policed at all, it is through acceptable use policies. That does strike me as significant. Similarly, OpenAI explicitly defer on a bunch of moral issues to local law and norms (see openai.com/index/our-appr…). Claude, by contrast, is simply encouraged to consider legality as one input into its overall take on whether or not to proceed. The basic difference is whether you rely on a non-AI process (eg, legislative processes in Papua New Guinea) or the model's moral reasoning to settle contentious issues

English

273

rohit@krishnanrohit·7h

I've heard this quite a bit, and I think it's true to a first approximation. But we also know that models eat harnesses, which means OpenAI's strategy is just Anthropic's strategy at (t-1). After all, if the model was enough Mythos wouldn't be gatekept. Its narcissism of small differences.

Tom Reed@mentalgeorge

Essential to understand the sheer extent to which *Claude itself* is the load-bearing part of Anthropic's overall AGI and alignment strategy. OpenAI are more likely and willing to defer on contentious positions to non-model infrastructure like developer policies or local national laws. Anthropic, in the final instance, bet on Claude to get this most of this right. (A reasonable gamble to make in the almighty face of the METR graph)

English

2.5K

gavin leech (Non-Reasoning)@gleech·7h

@kepe__ "check you're not fooling yourself"

English

kepe ✨🇫🇮@kepe__·8h

@gleech is there a kernel of wisdom i'm not seeing in here

English

kepe ✨🇫🇮@kepe__·1d

what is it about ketamine which consistently lets me access jhanas

English

225

gavin leech (Non-Reasoning)@gleech·7h

@apralky Jane can report violent schemes to the authorities x.com/DavidSKrueger/…

David Krueger 🦥 ⏸️ ⏹️ ⏪@DavidSKrueger

I denounce violent attacks on AI researchers or politicians, such as the recent Molotov cocktail thrown at Sam Altman and the bullets fired into the house a local councilman supporting datacenter development. Last year, I personally called AI companies to warn their security teams about Sam Kirchner (former leader of Stop AI, who have always had a policy of NONviolence and whose event I spoke at earlier that year) when he disappeared after indicating potential violent intentions against OpenAI. Besides this one incident, 100% of the calls for violence I have seen have been from CRITICS of AI existential safety, who claim "if you were REALLY worried about AI killing everyone, you'd be taking up arms". This is not a realistic way to stop AI. Terrorism against AI supporters would backfire in many ways. It would help critics discredit the movement, be used to justify government crackdowns on dissent, and lead to AI being securitized, making public oversight and international cooperation much harder.

English

457

yung macro 宏观年少传奇@apralky·8h

Suppose that a hypothetical John Doe holds an extreme view on AI-driven human extinction, and assigns some meaningfully high "P(doom)". Suppose also that John Doe is a central talking head in AI-risk circles, and firmly believes violence to be an effective means of raising awareness of the cause among the general population, or of otherwise lowering the probability of AI-driven human extinction. Assuming that John Doe is rational, how should we expect him to communicate his stance on this? Should we expect him to communicate differently from Jane Doe, who firmly believes the opposite – that violence is not the answer? Surely not. Both John Doe and Jane Doe are expected to go to lengths to establish that they do not condone violence, even though only one of them actually believes that. The reason for this is obvious – John Doe understands that though he might find violence justifiable, outwardly holding that stance is destructively costly, plausibly leading to imprisonment or worse. Given this, and lacking information on John and Jane's private thoughts, an external observer – Richard Doe – cannot reliably distinguish whether it is John, Jane, neither, or both who support violence based on their communicated stances alone, which are expected to be identically opposed to violence in all four cases. (Identically in the limit, scaling with the listeners’ discernment and the severity of costs.) Now suppose that instead of just John and Jane, there are thousands of public figures with equivalent configurations making up a broader safety movement, all of whom are subject to the same straightforward incentive structure. Also suppose that Richard Doe is a rational follower of this broader movement's thought, and wants to determine whether the movement is mostly opposed to or in favor of violence as a means, and to what extent; can he reliably do so based on the movement's communicated consensus alone? If after a period of observation Richard Doe comes to find that the movement on average communicates a stance of extreme non-violence, does Richard Doe take this observation to be adequately informative of the movement's actual consensus on non-violence? It’s clear why that’s a no, right? Now suppose that Richard Doe, unable to rely on the consensus, opts to do his own calculus on violence as a means of battling AI risk. Is he likely to conclude that violence is ineffective solely based on the dynamic whereby it erodes the public image of the movement's civility, which he observes to be the primary rationale of the outwardly communicated non-violent stance, or is he likely to find himself balancing a much more complex trade-off function in which that cost is weighed against what those with AI-risk beliefs similar to Richard's would or could consider benefits, such as increased awareness, the normalization of vigilantism, etc.? Returning to the question of the movement's communicated consensus, will Richard Doe conclude that members of the movement are also privately finding themselves balancing this perceived trade-off rather than arbitrarily excluding the perceived benefits from the calculation, as they seem to outwardly communicate? Given this consideration, does Richard Doe update upward or downward on the extent to which the movement supports violence? How significant is this update? Does Richard take the movement's members' revealed non-violence in action to be noteworthy evidence against private support for violence, or does he recognize that the question of whether one considers it to be of positive expected value to commit violence oneself is different from that of whether one considers it to be of positive expected value for somebody else to commit violence, given the former's agent-specific expected costs of imprisonment and more? Assuming that Richard Doe follows the line of reasoning above, is it apparent why he might conclude that it would be perceived as heroic of him to fill the identified gap between the movement's agent-neutral and agent-specific EV assessments of violence as a means, in effect allowing some contingent within the movement to act as free-riders thanks to his bearing of the personal cost? Is it clear why Richard Doe wouldn’t find the contingent’s lack of outwardly communicating in such a way that would confirm the contingent’s suspected agreement with Richard Doe’s actions after the fact to be evidence against the contingent’s private agreement?

English

3.8K

gavin leech (Non-Reasoning) retweetou

Planes of implication@JPCaron1·1d

The problem with a Wittgenstein quotes Twitter account is that, although one might think he's the best philosopher to do that with, because of the aphoristic style many of these assertions are the views that are criticized in the neighboring paragraphs and not Wittgenstein's own.

Ludwig Wittgenstein | Philosopher 📖@Wittgensteinnn

"I act with complete certainty. But this certainty is my own."

English

5.4K

gavin leech (Non-Reasoning) retweetou

campbell 🪄@cambrownwrites·1d

i thought i liked high school debate but in retrospect i just liked purposely misconstruing philosophy i didn't really understand to make other people annoyed

English

162

3.1K

50.4K

gavin leech (Non-Reasoning)@gleech·9h

@AmmannNora arxiv.org/pdf/2411.10588

QME

197

gavin leech (Non-Reasoning)@gleech·9h

@AmmannNora As of June 2024, RL tended to push towards EDT

English

777

Nora Ammann@AmmannNora·11h

Has someone ran experiments about whether (and when) Claude is a 1 boxer or a 2 boxer?

English

gavin leech (Non-Reasoning)@gleech·10h

@GregHBurnham any data on wall-clock times?

English

170

gavin leech (Non-Reasoning) retweetou

Greg Burnham@GregHBurnham·1d

More MirrorCode thoughts. A big caveat is that dev tasks don't come with a blackbox implementation. But how much do agents need this? I see at least 1411 calls to gotree in Opus 4.6's successful run. That's a lot, but not *so* much more than what you'd ask of a product manager.

Epoch AI@EpochAIResearch

What are the largest software engineering tasks AI can perform? In our new benchmark, MirrorCode, Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks. Co-developed with @METR_Evals. Details in thread.

English

2.6K

gavin leech (Non-Reasoning)@gleech·13h

@justanotherlaw (you probably get this benefit from geography though)

English

185

gavin leech (Non-Reasoning)@gleech·13h

@justanotherlaw I have muted 6000 accounts and now it's great. I think being on Twitter gets me about a year ahead on certain matters like hypothesis crystallisation (personas, linear representations, the evals crisis. max scaffold capabilities, ...)

English

1.5K

Lawrence Chan@justanotherlaw·20h

Can someone... defend the merits of Twitter to me? I feel like every time I come on, I see people that I know to be reasonable and thoughtful people in real life espouse incredibly simplistic (arguably deranged) takes. It seems _something_ about this site is causing this.

English

5.6K

gavin leech (Non-Reasoning)@gleech·15h

@JimDMiller newyorker.com/magazine/2026/…

QME

299

gavin leech (Non-Reasoning)@gleech·15h

@JimDMiller "the enhanced games", or, the honest games x.com/damianplayer/s…

Damian Player@damianplayer

THIS IS WILD! Peter Thiel’s company the “Enhanced Games” got valued at $1.2B before a single event. the first one is next month. here’s what the headlines aren’t telling you (share this): every athlete is monitored. every compound is clinically approved. every dose is tracked. two independent medical commissions oversee the whole thing. and if your bloodwork doesn’t pass, you don’t compete. the same investors behind the biggest peptide and longevity companies put $1.2B behind this. these aren’t sports guys… they’re taking a public bet that performance medicine becomes a real market. whether you’re into it or not, pay attention.

English

530

gavin leech (Non-Reasoning)@gleech·25 Eyl

thread of real life cyberpunk

English

497

152.5K

gavin leech (Non-Reasoning)@gleech·15h

@aliceisplaying Treadmill is one thing but I think "learning the limits, realising it still can't do things you thought it could" is the bigger morale effect after like 6 weeks post-launch

English

236

gavin leech (Non-Reasoning) retweetou

alice@aliceisplaying·23h

re claude getting worse: there is definitely a hedonic treadmill with SOTA models and i think this creates a perception issue. on top of that ant tweaking the default effort and adding adaptive thinking didn't help either even though i get it, they don't have the compute

English

2.6K

gavin leech (Non-Reasoning)@gleech·1d

@jackcrawford__ @an_interstice Bots? Good point, but in some fraction of those something is collecting data that does explore strategy

English

Jack Crawford@jackcrawford__·1d

@gleech @an_interstice unclear how fortnite-player-hours convert to man-hours

English

Jack Crawford@jackcrawford__·2d

one of the most overrated games. brutally mogged since birth by Go which existed over a thousand years beforehand. now brutally mogged in different ways by countless video games

rob🏴@rob_mcrobberson

chess is hilarious because its like a bunch of gamers got together and convinced the world that *their* game is “intellectual” and totally different than other games and its not the same as like spending hours a day playing candy crush or something

English

5.9K

gavin leech (Non-Reasoning) retweetou

emily@emily_for_now·2d

one can easily feel low agency despite having a background like this if they were raised to uncritically climb societal ladders this is why i feel more proud of a janky woodworking piece i designed and built than my entire 10 year career in tech

felpix@felpix_

if this is what low agency at 25 means, i am at negative levels of agency

English

102

5.3K

Descobrir

@DavidSacks @mentalgeorge @krishnanrohit @kepe__ @apralky @AmmannNora @GregHBurnham @justanotherlaw