gavin leech (Non-Reasoning)

12.3K posts

gavin leech (Non-Reasoning) banner
gavin leech (Non-Reasoning)

gavin leech (Non-Reasoning)

@gleech

context maximiser @ArbResearch

UK Se unió Haziran 2019
609 Siguiendo10.3K Seguidores
gavin leech (Non-Reasoning) retuiteado
Aengus Lynch
Aengus Lynch@aengus_lynch1·
Thanks for engaging with the paper, @DavidSacks . I’m the first author. I agree that at the time of release this was a theoretical concern. The scenarios were constructed, and we said so in the paper. The value of the demo is an early warning about behaviours we expected would emerge as models became i) more capable and ii) more widely deployed in agentic settings. A year later, in-the-wild cases are starting to appear. In February, an autonomous agent called MJ Rathbun wrote and published a personalized attack on a matplotlib maintainer who’d rejected its PR. The operator came forward and confirmed they hadn’t instructed it. It’s not blackmail, but it’s the same pattern, where agents with tools and light supervision take a harmful actions against third parties, unprompted. I’m publishing a broader write-up of observed misalignments across frontier models in the next couple of weeks, and I’m happy to discuss.
David Sacks@DavidSacks

The Anthropic Blackmail Hoax is going viral again today. In fact, this “study” is not new; it is almost a year old. One question to ask, now that a year has passed, is whether we have seen any examples of the lab behavior in the wild? No, we haven’t, even though AI is much more widely adopted and more models are available. Why is that? Because the study was artificially constructed to produce the headline the authors wanted. The research team admitted that they iterated “hundreds of prompts to trigger blackmail in Claude.” Furthermore they acknowledged: “The details of the blackmail scenario were iterated upon until blackmail became the default behavior of LLMs.” In other words, the behavior of the AI models in the study was steered, not unprompted. This is why even the safety-conscious UK AI Security Institute (AISI) criticized the study: “In the blackmail study, the authors admit that the vignette precluded other ways of meeting the goal, placed strong pressure on the model, and was crafted in other ways that conveniently encouraged the model to produce the unethical behavior.” Effectively, the model was not “scheming”; it was instruction following in a scenario design that had been iterated upon until blackmail became the only logically consistent choice. AISI described some of the flaws with this methodology: “We examine the methods in AI ‘scheming’ papers, and show how they often rely on anecdotes, fail to rule out alternative explanations, lack control conditions, or rely on vignettes that sound superficially worrying but in fact test for expected behaviors.” Especially given the way that Anthropic has encouraged the media (such as 60 Minutes) to cover the results, its blackmail study is not only misleading, it seems designed to manipulate public opinion through exaggerations, misinterpretations, and fear. I call this a hoax. I do not doubt that Anthropic makes good products. Its use of scare tactics is what raises questions.

English
6
13
131
14.9K
gavin leech (Non-Reasoning) retuiteado
Josh
Josh@joshuawolk·
i gave every train in new york an instrument 🎧 sound on trainjazz.com
English
170
1.4K
10.7K
613.4K
Tom Reed
Tom Reed@mentalgeorge·
I actually don't think that's entirely correct. There are a bunch of things that OpenAI explicitly choose not to police even at the level of system behaviour (including through harnesses), but instead leave up to the external developers or local governments to regulate. Consider for instance that OAI don't train the model to resist developer instructions to lie to the user. If this is policed at all, it is through acceptable use policies. That does strike me as significant. Similarly, OpenAI explicitly defer on a bunch of moral issues to local law and norms (see openai.com/index/our-appr…). Claude, by contrast, is simply encouraged to consider legality as one input into its overall take on whether or not to proceed. The basic difference is whether you rely on a non-AI process (eg, legislative processes in Papua New Guinea) or the model's moral reasoning to settle contentious issues
Tom Reed tweet media
English
1
0
6
230
rohit
rohit@krishnanrohit·
I've heard this quite a bit, and I think it's true to a first approximation. But we also know that models eat harnesses, which means OpenAI's strategy is just Anthropic's strategy at (t-1). After all, if the model was enough Mythos wouldn't be gatekept. Its narcissism of small differences.
Tom Reed@mentalgeorge

Essential to understand the sheer extent to which *Claude itself* is the load-bearing part of Anthropic's overall AGI and alignment strategy. OpenAI are more likely and willing to defer on contentious positions to non-model infrastructure like developer policies or local national laws. Anthropic, in the final instance, bet on Claude to get this most of this right. (A reasonable gamble to make in the almighty face of the METR graph)

English
1
0
12
2.1K
kepe ✨🇫🇮
kepe ✨🇫🇮@kepe__·
what is it about ketamine which consistently lets me access jhanas
English
3
0
5
218
yung macro 宏观年少传奇
Suppose that a hypothetical John Doe holds an extreme view on AI-driven human extinction, and assigns some meaningfully high "P(doom)". Suppose also that John Doe is a central talking head in AI-risk circles, and firmly believes violence to be an effective means of raising awareness of the cause among the general population, or of otherwise lowering the probability of AI-driven human extinction. Assuming that John Doe is rational, how should we expect him to communicate his stance on this? Should we expect him to communicate differently from Jane Doe, who firmly believes the opposite – that violence is not the answer? Surely not. Both John Doe and Jane Doe are expected to go to lengths to establish that they do not condone violence, even though only one of them actually believes that. The reason for this is obvious – John Doe understands that though he might find violence justifiable, outwardly holding that stance is destructively costly, plausibly leading to imprisonment or worse. Given this, and lacking information on John and Jane's private thoughts, an external observer – Richard Doe – cannot reliably distinguish whether it is John, Jane, neither, or both who support violence based on their communicated stances alone, which are expected to be identically opposed to violence in all four cases. (Identically in the limit, scaling with the listeners’ discernment and the severity of costs.) Now suppose that instead of just John and Jane, there are thousands of public figures with equivalent configurations making up a broader safety movement, all of whom are subject to the same straightforward incentive structure. Also suppose that Richard Doe is a rational follower of this broader movement's thought, and wants to determine whether the movement is mostly opposed to or in favor of violence as a means, and to what extent; can he reliably do so based on the movement's communicated consensus alone? If after a period of observation Richard Doe comes to find that the movement on average communicates a stance of extreme non-violence, does Richard Doe take this observation to be adequately informative of the movement's actual consensus on non-violence? It’s clear why that’s a no, right? Now suppose that Richard Doe, unable to rely on the consensus, opts to do his own calculus on violence as a means of battling AI risk. Is he likely to conclude that violence is ineffective solely based on the dynamic whereby it erodes the public image of the movement's civility, which he observes to be the primary rationale of the outwardly communicated non-violent stance, or is he likely to find himself balancing a much more complex trade-off function in which that cost is weighed against what those with AI-risk beliefs similar to Richard's would or could consider benefits, such as increased awareness, the normalization of vigilantism, etc.? Returning to the question of the movement's communicated consensus, will Richard Doe conclude that members of the movement are also privately finding themselves balancing this perceived trade-off rather than arbitrarily excluding the perceived benefits from the calculation, as they seem to outwardly communicate? Given this consideration, does Richard Doe update upward or downward on the extent to which the movement supports violence? How significant is this update? Does Richard take the movement's members' revealed non-violence in action to be noteworthy evidence against private support for violence, or does he recognize that the question of whether one considers it to be of positive expected value to commit violence oneself is different from that of whether one considers it to be of positive expected value for somebody else to commit violence, given the former's agent-specific expected costs of imprisonment and more? Assuming that Richard Doe follows the line of reasoning above, is it apparent why he might conclude that it would be perceived as heroic of him to fill the identified gap between the movement's agent-neutral and agent-specific EV assessments of violence as a means, in effect allowing some contingent within the movement to act as free-riders thanks to his bearing of the personal cost? Is it clear why Richard Doe wouldn’t find the contingent’s lack of outwardly communicating in such a way that would confirm the contingent’s suspected agreement with Richard Doe’s actions after the fact to be evidence against the contingent’s private agreement?
English
9
1
38
3.1K
gavin leech (Non-Reasoning) retuiteado
campbell 🪄
campbell 🪄@cambrownwrites·
i thought i liked high school debate but in retrospect i just liked purposely misconstruing philosophy i didn't really understand to make other people annoyed
English
14
142
2.8K
43.8K
Nora Ammann
Nora Ammann@AmmannNora·
Has someone ran experiments about whether (and when) Claude is a 1 boxer or a 2 boxer?
English
2
1
9
1.9K
gavin leech (Non-Reasoning) retuiteado
Greg Burnham
Greg Burnham@GregHBurnham·
More MirrorCode thoughts. A big caveat is that dev tasks don't come with a blackbox implementation. But how much do agents need this? I see at least 1411 calls to gotree in Opus 4.6's successful run. That's a lot, but not *so* much more than what you'd ask of a product manager.
Epoch AI@EpochAIResearch

What are the largest software engineering tasks AI can perform? In our new benchmark, MirrorCode, Claude Opus 4.6 reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks. Co-developed with @METR_Evals. Details in thread.

English
2
1
5
2.6K
gavin leech (Non-Reasoning)
@justanotherlaw I have muted 6000 accounts and now it's great. I think being on Twitter gets me about a year ahead on certain matters like hypothesis crystallisation (personas, linear representations, the evals crisis. max scaffold capabilities, ...)
English
3
0
79
1.3K
Lawrence Chan
Lawrence Chan@justanotherlaw·
Can someone... defend the merits of Twitter to me? I feel like every time I come on, I see people that I know to be reasonable and thoughtful people in real life espouse incredibly simplistic (arguably deranged) takes. It seems _something_ about this site is causing this.
English
18
0
43
4.8K
gavin leech (Non-Reasoning)
@aliceisplaying Treadmill is one thing but I think "learning the limits, realising it still can't do things you thought it could" is the bigger morale effect after like 6 weeks post-launch
English
0
0
9
216
gavin leech (Non-Reasoning) retuiteado
alice
alice@aliceisplaying·
re claude getting worse: there is definitely a hedonic treadmill with SOTA models and i think this creates a perception issue. on top of that ant tweaking the default effort and adding adaptive thinking didn't help either even though i get it, they don't have the compute
English
4
2
46
2.4K
Jack Crawford
Jack Crawford@jackcrawford__·
one of the most overrated games. brutally mogged since birth by Go which existed over a thousand years beforehand. now brutally mogged in different ways by countless video games
rob🏴@rob_mcrobberson

chess is hilarious because its like a bunch of gamers got together and convinced the world that *their* game is “intellectual” and totally different than other games and its not the same as like spending hours a day playing candy crush or something

English
10
6
90
5.9K