Alex Hägele (@haeggee) - โปรไฟล์ Twitter

ทวีตที่ปักหมุด

The main project of my time as @AnthropicAI fellow is finally out: The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity? w/ great collaborators @aryopg @sleight_henry @EthanJPerez and supervised by @jaschasd ! Some personal notes:

Anthropic@AnthropicAI

New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity? When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?" Read more: alignment.anthropic.com/2026/hot-mess-…

English

3

12

104

9.7K

Alex Hägele รีทวีตแล้ว

ICML Conference@icmlconf·18 Mar

To ensure compliance w peer-review policies, ICML has removed 795 reviews (1% of total) by reviewers who used LLMs when they explicitly agreed to not. Consequently, 497 papers (2% of all submissions) of these (reciprocal) reviewers have been desk rejected Details in blog post 👇

English

22

81

608

226.7K

Alex Hägele รีทวีตแล้ว

ZurichAI@zurichnlp·17 Mar

ZurichNLP#20 is on April 1st at the @ETH_AI_Center! Fabian Schaipp (Inria) on recent trends in training algorithms for ML and Valentina Pyatkin (Allen Institute, ETH) on lessons from training open-source LLMs. RSVP below before spots run out!

English

1

5

18

1.9K

Alex Hägele รีทวีตแล้ว

Maksym Andriushchenko@maksym_andr·4 Şub

Do you think LLM hallucinations are solved? 📢 We introduce HalluHard: a challenging multi-turn, open-ended hallucination benchmark. Even the most recent frontier LLMs like Opus 4.5 with web search hallucinate very frequently on our set of challenging examples.

English

16

43

237

24.8K

Alex Hägele รีทวีตแล้ว

Alex Imas@alexolegimas·3 Şub

Super interesting paper from @AnthropicAI Fellows Program on model breakdown as task complexity increases. The longer the model has to reason, the more unpredictable it becomes: not consistently wrong, not completely random, just pursuing strange goals that are neither systematically aligned nor misaligned. Reminds me of @keyonV and co. human generalization function paper. This research suggests that human beliefs about model performance will be increasingly miscalibrated at longer reasoning length.

Anthropic@AnthropicAI

New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity? When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?" Read more: alignment.anthropic.com/2026/hot-mess-…

English

6

10

109

15.2K

Alex Hägele@haeggee·3 Şub

Needless to say, this research wouldn't exist without @jaschasd as the creative director. He was a great mentor throughout the fellowship. x.com/jaschasd/statu…

Jascha Sohl-Dickstein@jaschasd

When AI fails, will it do so by coherently pursuing the wrong goals? Or will it fail the way humans often fail, and take incoherent actions that don't pursue any consistent goal. In other words, like a “hot mess?” How will this change when AI performing limited tasks transitions to AGI performing tasks of unbounded complexity? How does misalignment scale with model intelligence and task complexity? We measure this using a bias-variance decomposition of AI errors. Bias = consistent, systematic errors (reliably achieving the wrong goal). Variance = inconsistent, unpredictable errors. We define "incoherence" as the fraction of error from variance. I am very excited about this framing, because it characterizes types of misalignment in a way that should be amenable to simple theoretical models and clean scaling laws.

English

1

0

7

343

Alex Hägele@haeggee·3 Şub

There are more results and important discussions in our blogpost (alignment.anthropic.com/2026/hot-mess-…), and the paper (arxiv.org/abs/2601.23045), to be presented at ICLR in Brazil!

English

1

0

7

369

Alex Hägele@haeggee·3 Şub

The main project of my time as @AnthropicAI fellow is finally out: The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity? w/ great collaborators @aryopg @sleight_henry @EthanJPerez and supervised by @jaschasd ! Some personal notes:

Anthropic@AnthropicAI

New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity? When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?" Read more: alignment.anthropic.com/2026/hot-mess-…

English

3

12

104

9.7K

Alex Hägele รีทวีตแล้ว

Jascha Sohl-Dickstein@jaschasd·3 Şub

Blog post: alignment.anthropic.com/2026/hot-mess-… Full paper: arxiv.org/abs/2601.23045 Alex Hägele @haeggee did a superb job leading this project, which he did as part of the Anthropic Fellows Program. Also thank you to collaborators Aryo Gemma, @sleight_henry, @EthanJPerez We will be presenting the paper at ICLR 2026.

English

2

27

2.6K

Alex Hägele รีทวีตแล้ว

Andrew Curran@AndrewCurran_·3 Şub

New alignment research from Anthropic. 'AI might fail not through systematic misalignment, but through incoherence—unpredictable, self-undermining behavior that doesn't optimize for any consistent objective. That is, AI might fail in the same way that humans often fail, by being a hot mess.'

Anthropic@AnthropicAI

New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity? When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?" Read more: alignment.anthropic.com/2026/hot-mess-…

English

13

21

236

21.5K

Alex Hägele รีทวีตแล้ว

Ethan Mollick@emollick·3 Şub

Anthropic has been releasing an impressive array of papers recently, using a variety of methods, most of which show potential AI issues, rather than just cheerleading about AI. Also, they tend to be very well communicated (with a whiff of Claude about the writing, to be sure).

Anthropic@AnthropicAI

Finding 1: The longer models reason, the more incoherent they become. This holds across every task and model we tested—whether we measure reasoning tokens, agent actions, or optimizer steps.

English

44

96

1.3K

138.3K

Alex Hägele รีทวีตแล้ว

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)@teortaxesTex·3 Şub

Very interesting research, and I don't say that often about Anthropic safety work. Thanks to Jascha Sohl-Dickstein I guess? "Paperclip optimizer" is a hypothetical risk; it's quite hard to get smarter AIs to optimize coherently at all. Omohundro drives? More like "ADHD drift".

Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media

Anthropic@AnthropicAI

New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity? When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?" Read more: alignment.anthropic.com/2026/hot-mess-…

English

4

8

76

9.1K

Alex Hägele รีทวีตแล้ว

Lisan al Gaib@scaling01·3 Şub

not surprising but still interesting also what a great example: "Our results are evidence that future AI failures may look more like industrial accidents than coherent pursuit of goals that were not trained for. (Think: the AI intends to run the nuclear power plant, but gets distracted reading French poetry, and there is a meltdown.)"

Anthropic@AnthropicAI

New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity? When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?" Read more: alignment.anthropic.com/2026/hot-mess-…

English

4

5

74

9K

Alex Hägele รีทวีตแล้ว

Jascha Sohl-Dickstein@jaschasd·3 Şub

When AI fails, will it do so by coherently pursuing the wrong goals? Or will it fail the way humans often fail, and take incoherent actions that don't pursue any consistent goal. In other words, like a “hot mess?” How will this change when AI performing limited tasks transitions to AGI performing tasks of unbounded complexity? How does misalignment scale with model intelligence and task complexity? We measure this using a bias-variance decomposition of AI errors. Bias = consistent, systematic errors (reliably achieving the wrong goal). Variance = inconsistent, unpredictable errors. We define "incoherence" as the fraction of error from variance. I am very excited about this framing, because it characterizes types of misalignment in a way that should be amenable to simple theoretical models and clean scaling laws.

English

3

7

116

8.8K

Alex Hägele รีทวีตแล้ว

Anthropic@AnthropicAI·3 Şub

New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity? When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?" Read more: alignment.anthropic.com/2026/hot-mess-…

English

154

219

1.9K

525.9K

Alex Hägele รีทวีตแล้ว

George Grigorev@iamgrigorev·29 Oca

Nice paper that explains how to properly pick schedule and lr for WSD compared to cosine. 1) optimal lr for WSD is 51% of cosine (roughly matches my 55% intuition I always used). 2) lr should scale with 1/sqrt(T) -- where T is steps; lower lr if train longer. This also means that if we start cooldown sooner, we should use slightly higher lr during the constant stage. 3) The sudden drop during cooldown is most pronounced if the gradient norms do not go to zero. arxiv.org/abs/2501.18965

English

3

14

141

11.2K

Alex Hägele

ค้นพบ