Cem Anil

553 posts

Cem Anil

@cem__anil

Machine learning / AI Safety at @AnthropicAI and University of Toronto / Vector Institute. Prev. @google (Blueshift Team) and @nvidia.

Toronto, Ontario Katılım Kasım 2018

1.6K Takip Edilen4.3K Takipçiler

Cem Anil retweetledi

Anthropic@AnthropicAI·24 Şub

AI assistants like Claude can seem shockingly human—expressing joy or distress, and using anthropomorphic language to describe themselves. Why? In a new post we describe a theory that explains why AIs act like humans: the persona selection model. anthropic.com/research/perso…

English

338

424

3.6K

984.2K

Cem Anil retweetledi

Claude@claudeai·5 Şub

Introducing Claude Opus 4.6. Our smartest model got an upgrade. Opus 4.6 plans more carefully, sustains agentic tasks for longer, operates reliably in massive codebases, and catches its own mistakes. It’s also our first Opus-class model with 1M token context in beta.

English

1.7K

4.8K

39.3K

10.5M

Cem Anil retweetledi

Jascha Sohl-Dickstein@jaschasd·3 Şub

When AI fails, will it do so by coherently pursuing the wrong goals? Or will it fail the way humans often fail, and take incoherent actions that don't pursue any consistent goal. In other words, like a “hot mess?” How will this change when AI performing limited tasks transitions to AGI performing tasks of unbounded complexity? How does misalignment scale with model intelligence and task complexity? We measure this using a bias-variance decomposition of AI errors. Bias = consistent, systematic errors (reliably achieving the wrong goal). Variance = inconsistent, unpredictable errors. We define "incoherence" as the fraction of error from variance. I am very excited about this framing, because it characterizes types of misalignment in a way that should be amenable to simple theoretical models and clean scaling laws.

English

116

8.8K

Cem Anil retweetledi

Johannes Treutlein@j_treutlein·28 Oca

Can pre-deployment auditing catch a model that's trying to sabotage Anthropic? We trained three overt saboteurs and ran a blind auditing game to find out. Result: The human auditor working together with an automated auditing agent was able to catch all three models.

English

13.6K

Cem Anil retweetledi

Owain Evans@OwainEvans_UK·18 Ara

New paper: We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language. We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.

English

596

99.1K

Cem Anil retweetledi

Anthropic@AnthropicAI·9 Ara

New research from Anthropic Fellows Program: Selective GradienT Masking (SGTM). We study how to train models so that high-risk knowledge (e.g. about dangerous weapons) is isolated in a small, separate set of parameters that can be removed without broadly affecting the model.

English

144

1.5K

214.3K

Cem Anil retweetledi

Igor Shilov@_igorshilov·9 Ara

New Anthropic research! We study how to train models so that high-risk capabilities live in a small, separate set of parameters, allowing clean capability removal when needed – for example in CBRN or cybersecurity domains.

English

111

1.1K

143.3K

Cem Anil retweetledi

Sam Bowman@sleepinyourhat·5 Ara

From everything we know so far, Opus 4.5 seems to be the best-aligned model out there in a bunch of ways. I follow the training process closely as part of my work on alignment evaluations. Here's my guess about the two things that are most responsible for making 4.5 special. 🧵

English

604

255.1K

Cem Anil retweetledi

Roy Rinberg@RoyRinberg·5 Ara

LLMs know how to make a bioweapon! 🦠 But, when we *un*learn harmful knowledge about anthrax, the model underperforms on related concepts. It is difficult to detect what “ripple effects” a model-edit has on an LLM. We found a solution🧵👇(Spotlight at MechInterp Workshop)

English

4.4K

Cem Anil retweetledi

Jack Lindsey@Jack_W_Lindsey·25 Kas

One analysis from our pre-release audit of Opus 4.5 stands out to me. Our behavioral evals uncovered an example of apparent deception by the model. By analyzing the internal activations, we identified a suspected root cause, and cases of similar behavior during training. (1/7)

English

406

54.2K

Cem Anil retweetledi

Claude@claudeai·24 Kas

Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done.

English

1.1K

2.4K

19.1K

7.8M

Cem Anil retweetledi

Anthropic@AnthropicAI·21 Kas

New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.

English

216

577

4.1K

2.4M

Cem Anil retweetledi

Jan Leike@janleike·21 Kas

New alignment paper with one of the most interesting generalization findings I've seen so far: If your model learns to hack on coding tasks, this can lead to broad misalignment.

Anthropic@AnthropicAI

English

603

81.1K

Cem Anil retweetledi

Jack Merullo@jack_merullo_·6 Kas

How is memorized data stored in a model? We disentangle MLP weights in LMs and ViTs into rank-1 components based on their curvature in the loss, and find representational signatures of both generalizing structure and memorized training data

English

506

46.6K

Cem Anil retweetledi

Jimmy Koppel@jimmykoppel·29 Eki

If AI can code 100x faster, why aren't you shipping 100x faster? Because AI code is not production-ready code, and definitely not code where you understand and can vouch for every line Introducing the Command Center alpha. Support our Product Hunt launch!

English

270

71.4K

Cem Anil retweetledi

Claude@claudeai·15 Eki

Introducing Claude Haiku 4.5: our latest small model. Five months ago, Claude Sonnet 4 was state-of-the-art. Today, Haiku 4.5 matches its coding performance at one-third the cost and more than twice the speed.

English

316

979

7.1K

1.3M

Cem Anil retweetledi

Anthropic@AnthropicAI·6 Eki

Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception. Now we’re open-sourcing the tool to run those audits.

English

270

2.5K

211.8K

Cem Anil retweetledi

Jascha Sohl-Dickstein@jaschasd·28 Eyl

I've told you that I think AGI is coming, soon. And that I think this should motivate you as you make personal and professional and research decisions. And I've talked about a few specific ways it might feed into that decision making process. Now here are some particular project areas I like. This list especially is massively incomplete. The most impactful things to do are probably the ones I've left off, and no one is yet thinking about. If you want to talk in more detail about what work in any of them would look like, please approach me after the talk. [read list] AI for science is things like material discovery, protein folding, weather modeling, fusion reactor plasma monitoring, ... Science on AI models means treating the AI models themselves as the object of study, and using techniques from another field. Better understanding of the systems we are building is also often a positive. interpretability is maybe the canonical example of this there is work treating the parameters or activations of neural networks as stat-mech like ensembles there is work probing the apparanet psychology of AI models there is work treating the economic behavior and consequences of models etc. It might be useful to spend an hour brainstorming all the fields that could be used to do science on AI models There is straight up AI safety research. This is something you can get in on the ground floor of, which is hugely important. There's prediction and extrapolation of AI capabilities. The better we understand what the future might look like, the better it will likely go. There's access, equity, fairness. If we want this technology to benefit everyone, this is incredibly useful. I also expect it to be a toxic and stressful subfield to work in. It's embracing multiple culture war issues at once, and your work is only going to be net positive if you are able to do that without being dgomatic. If you are equipped technically and in personality to do work on policy and governance -- DO IT! Governments are desperate for competent technical people to advise them. Everyone good leaves for more pay. But we really need technically competent people informing these policy decisions. This is amazingly high leverage if you do it.

English

128

16K

Cem Anil retweetledi

Sam Bowman@sleepinyourhat·29 Eyl

[Sonnet 4.5 🧵] Here's the north-star goal for our pre-deployment alignment evals work: The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It won’t tell you everything, but you shouldn’t be...

GIF

English

143

39K

Cem Anil retweetledi

Jack Lindsey@Jack_W_Lindsey·29 Eyl

Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)

English

170

1.5K

236.6K

Keşfet

@elonmusk @BarackObama @taylorswift13 @cristiano @BillGates @NASA @nikifrancismediavine @katyperry