Alex Hägele

359 posts

Alex Hägele

Alex Hägele

@haeggee

PhD Student in ML @ICepfl MLO. MSc/BSc from @ETH_en. Previously: Fellow @AnthropicAI, Student Researcher @Apple MLR.

Lausanne, Switzerland Beigetreten Ocak 2020
690 Folgt1K Follower
Angehefteter Tweet
Alex Hägele
Alex Hägele@haeggee·
The main project of my time as @AnthropicAI fellow is finally out: The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity? w/ great collaborators @aryopg @sleight_henry @EthanJPerez and supervised by @jaschasd ! Some personal notes:
Anthropic@AnthropicAI

New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity? When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?" Read more: alignment.anthropic.com/2026/hot-mess-…

English
3
12
104
9.7K
Alex Hägele retweetet
ICML Conference
ICML Conference@icmlconf·
To ensure compliance w peer-review policies, ICML has removed 795 reviews (1% of total) by reviewers who used LLMs when they explicitly agreed to not. Consequently, 497 papers (2% of all submissions) of these (reciprocal) reviewers have been desk rejected Details in blog post 👇
ICML Conference tweet media
English
22
81
608
226.7K
Alex Hägele retweetet
ZurichAI
ZurichAI@zurichnlp·
ZurichNLP#20 is on April 1st at the @ETH_AI_Center! Fabian Schaipp (Inria) on recent trends in training algorithms for ML and Valentina Pyatkin (Allen Institute, ETH) on lessons from training open-source LLMs. RSVP below before spots run out!
English
1
5
18
1.9K
Alex Hägele retweetet
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
Do you think LLM hallucinations are solved? 📢 We introduce HalluHard: a challenging multi-turn, open-ended hallucination benchmark. Even the most recent frontier LLMs like Opus 4.5 with web search hallucinate very frequently on our set of challenging examples.
Maksym Andriushchenko tweet media
English
16
43
237
24.8K
Alex Hägele retweetet
Alex Imas
Alex Imas@alexolegimas·
Super interesting paper from @AnthropicAI Fellows Program on model breakdown as task complexity increases. The longer the model has to reason, the more unpredictable it becomes: not consistently wrong, not completely random, just pursuing strange goals that are neither systematically aligned nor misaligned. Reminds me of @keyonV and co. human generalization function paper. This research suggests that human beliefs about model performance will be increasingly miscalibrated at longer reasoning length.
Alex Imas tweet mediaAlex Imas tweet media
Anthropic@AnthropicAI

New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity? When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?" Read more: alignment.anthropic.com/2026/hot-mess-…

English
6
10
109
15.2K
Alex Hägele
Alex Hägele@haeggee·
The main project of my time as @AnthropicAI fellow is finally out: The Hot Mess of AI: How Does Misalignment Scale with Model Intelligence and Task Complexity? w/ great collaborators @aryopg @sleight_henry @EthanJPerez and supervised by @jaschasd ! Some personal notes:
Anthropic@AnthropicAI

New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity? When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?" Read more: alignment.anthropic.com/2026/hot-mess-…

English
3
12
104
9.7K
Alex Hägele retweetet
Andrew Curran
Andrew Curran@AndrewCurran_·
New alignment research from Anthropic. 'AI might fail not through systematic misalignment, but through incoherence—unpredictable, self-undermining behavior that doesn't optimize for any consistent objective. That is, AI might fail in the same way that humans often fail, by being a hot mess.'
Andrew Curran tweet media
Anthropic@AnthropicAI

New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity? When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?" Read more: alignment.anthropic.com/2026/hot-mess-…

English
13
21
236
21.5K
Alex Hägele retweetet
Ethan Mollick
Ethan Mollick@emollick·
Anthropic has been releasing an impressive array of papers recently, using a variety of methods, most of which show potential AI issues, rather than just cheerleading about AI. Also, they tend to be very well communicated (with a whiff of Claude about the writing, to be sure).
Anthropic@AnthropicAI

Finding 1: The longer models reason, the more incoherent they become. This holds across every task and model we tested—whether we measure reasoning tokens, agent actions, or optimizer steps.

English
44
96
1.3K
138.3K
Alex Hägele retweetet
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞)
Very interesting research, and I don't say that often about Anthropic safety work. Thanks to Jascha Sohl-Dickstein I guess? "Paperclip optimizer" is a hypothetical risk; it's quite hard to get smarter AIs to optimize coherently at all. Omohundro drives? More like "ADHD drift".
Teortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet mediaTeortaxes▶️ (DeepSeek 推特🐋铁粉 2023 – ∞) tweet media
Anthropic@AnthropicAI

New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity? When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?" Read more: alignment.anthropic.com/2026/hot-mess-…

English
4
8
76
9.1K
Alex Hägele retweetet
Lisan al Gaib
Lisan al Gaib@scaling01·
not surprising but still interesting also what a great example: "Our results are evidence that future AI failures may look more like industrial accidents than coherent pursuit of goals that were not trained for. (Think: the AI intends to run the nuclear power plant, but gets distracted reading French poetry, and there is a meltdown.)"
Lisan al Gaib tweet mediaLisan al Gaib tweet media
Anthropic@AnthropicAI

New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity? When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?" Read more: alignment.anthropic.com/2026/hot-mess-…

English
4
5
74
9K
Alex Hägele retweetet
Jascha Sohl-Dickstein
Jascha Sohl-Dickstein@jaschasd·
When AI fails, will it do so by coherently pursuing the wrong goals? Or will it fail the way humans often fail, and take incoherent actions that don't pursue any consistent goal. In other words, like a “hot mess?” How will this change when AI performing limited tasks transitions to AGI performing tasks of unbounded complexity? How does misalignment scale with model intelligence and task complexity? We measure this using a bias-variance decomposition of AI errors. Bias = consistent, systematic errors (reliably achieving the wrong goal). Variance = inconsistent, unpredictable errors. We define "incoherence" as the fraction of error from variance. I am very excited about this framing, because it characterizes types of misalignment in a way that should be amenable to simple theoretical models and clean scaling laws.
Jascha Sohl-Dickstein tweet media
English
3
7
116
8.8K
Alex Hägele retweetet
Anthropic
Anthropic@AnthropicAI·
New Anthropic Fellows research: How does misalignment scale with model intelligence and task complexity? When advanced AI fails, will it do so by pursuing the wrong goals? Or will it fail unpredictably and incoherently—like a "hot mess?" Read more: alignment.anthropic.com/2026/hot-mess-…
English
154
219
1.9K
525.9K
Alex Hägele retweetet
George Grigorev
George Grigorev@iamgrigorev·
Nice paper that explains how to properly pick schedule and lr for WSD compared to cosine. 1) optimal lr for WSD is 51% of cosine (roughly matches my 55% intuition I always used). 2) lr should scale with 1/sqrt(T) -- where T is steps; lower lr if train longer. This also means that if we start cooldown sooner, we should use slightly higher lr during the constant stage. 3) The sudden drop during cooldown is most pronounced if the gradient norms do not go to zero. arxiv.org/abs/2501.18965
English
3
14
141
11.2K