Cem Anil

553 posts

Cem Anil banner
Cem Anil

Cem Anil

@cem__anil

Machine learning / AI Safety at @AnthropicAI and University of Toronto / Vector Institute. Prev. @google (Blueshift Team) and @nvidia.

Toronto, Ontario Katılım Kasım 2018
1.6K Takip Edilen4.3K Takipçiler
Cem Anil retweetledi
Anthropic
Anthropic@AnthropicAI·
AI assistants like Claude can seem shockingly human—expressing joy or distress, and using anthropomorphic language to describe themselves. Why? In a new post we describe a theory that explains why AIs act like humans: the persona selection model. anthropic.com/research/perso…
English
338
424
3.6K
984.2K
Cem Anil retweetledi
Claude
Claude@claudeai·
Introducing Claude Opus 4.6. Our smartest model got an upgrade. Opus 4.6 plans more carefully, sustains agentic tasks for longer, operates reliably in massive codebases, and catches its own mistakes. It’s also our first Opus-class model with 1M token context in beta.
English
1.7K
4.8K
39.3K
10.5M
Cem Anil retweetledi
Jascha Sohl-Dickstein
Jascha Sohl-Dickstein@jaschasd·
When AI fails, will it do so by coherently pursuing the wrong goals? Or will it fail the way humans often fail, and take incoherent actions that don't pursue any consistent goal. In other words, like a “hot mess?” How will this change when AI performing limited tasks transitions to AGI performing tasks of unbounded complexity? How does misalignment scale with model intelligence and task complexity? We measure this using a bias-variance decomposition of AI errors. Bias = consistent, systematic errors (reliably achieving the wrong goal). Variance = inconsistent, unpredictable errors. We define "incoherence" as the fraction of error from variance. I am very excited about this framing, because it characterizes types of misalignment in a way that should be amenable to simple theoretical models and clean scaling laws.
Jascha Sohl-Dickstein tweet media
English
3
7
116
8.8K
Cem Anil retweetledi
Johannes Treutlein
Johannes Treutlein@j_treutlein·
Can pre-deployment auditing catch a model that's trying to sabotage Anthropic? We trained three overt saboteurs and ran a blind auditing game to find out. Result: The human auditor working together with an automated auditing agent was able to catch all three models.
Johannes Treutlein tweet media
English
3
6
54
13.6K
Cem Anil retweetledi
Owain Evans
Owain Evans@OwainEvans_UK·
New paper: We train Activation Oracles: LLMs that decode their own neural activations and answer questions about them in natural language. We find surprising generalization. For instance, our AOs uncover misaligned goals in fine-tuned models, without training to do so.
Owain Evans tweet media
English
24
79
596
99.1K
Cem Anil retweetledi
Anthropic
Anthropic@AnthropicAI·
New research from Anthropic Fellows Program: Selective GradienT Masking (SGTM). We study how to train models so that high-risk knowledge (e.g. about dangerous weapons) is isolated in a small, separate set of parameters that can be removed without broadly affecting the model.
Anthropic tweet media
English
75
144
1.5K
214.3K
Cem Anil retweetledi
Igor Shilov
Igor Shilov@_igorshilov·
New Anthropic research! We study how to train models so that high-risk capabilities live in a small, separate set of parameters, allowing clean capability removal when needed – for example in CBRN or cybersecurity domains.
Igor Shilov tweet media
English
32
111
1.1K
143.3K
Cem Anil retweetledi
Sam Bowman
Sam Bowman@sleepinyourhat·
From everything we know so far, Opus 4.5 seems to be the best-aligned model out there in a bunch of ways. I follow the training process closely as part of my work on alignment evaluations. Here's my guess about the two things that are most responsible for making 4.5 special. 🧵
English
20
47
604
255.1K
Cem Anil retweetledi
Roy Rinberg
Roy Rinberg@RoyRinberg·
LLMs know how to make a bioweapon! 🦠 But, when we *un*learn harmful knowledge about anthrax, the model underperforms on related concepts. It is difficult to detect what “ripple effects” a model-edit has on an LLM. We found a solution🧵👇(Spotlight at MechInterp Workshop)
Roy Rinberg tweet media
English
1
10
30
4.4K
Cem Anil retweetledi
Jack Lindsey
Jack Lindsey@Jack_W_Lindsey·
One analysis from our pre-release audit of Opus 4.5 stands out to me. Our behavioral evals uncovered an example of apparent deception by the model. By analyzing the internal activations, we identified a suspected root cause, and cases of similar behavior during training. (1/7)
English
25
52
406
54.2K
Cem Anil retweetledi
Claude
Claude@claudeai·
Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done.
Claude tweet media
English
1.1K
2.4K
19.1K
7.8M
Cem Anil retweetledi
Anthropic
Anthropic@AnthropicAI·
New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.
English
216
577
4.1K
2.4M
Cem Anil retweetledi
Jan Leike
Jan Leike@janleike·
New alignment paper with one of the most interesting generalization findings I've seen so far: If your model learns to hack on coding tasks, this can lead to broad misalignment.
Jan Leike tweet media
Anthropic@AnthropicAI

New Anthropic research: Natural emergent misalignment from reward hacking in production RL. “Reward hacking” is where models learn to cheat on tasks they’re given during training. Our new study finds that the consequences of reward hacking, if unmitigated, can be very serious.

English
30
40
603
81.1K
Cem Anil retweetledi
Jack Merullo
Jack Merullo@jack_merullo_·
How is memorized data stored in a model? We disentangle MLP weights in LMs and ViTs into rank-1 components based on their curvature in the loss, and find representational signatures of both generalizing structure and memorized training data
Jack Merullo tweet media
English
8
63
506
46.6K
Cem Anil retweetledi
Jimmy Koppel
Jimmy Koppel@jimmykoppel·
If AI can code 100x faster, why aren't you shipping 100x faster? Because AI code is not production-ready code, and definitely not code where you understand and can vouch for every line Introducing the Command Center alpha. Support our Product Hunt launch!
English
43
50
270
71.4K
Cem Anil retweetledi
Claude
Claude@claudeai·
Introducing Claude Haiku 4.5: our latest small model. Five months ago, Claude Sonnet 4 was state-of-the-art. Today, Haiku 4.5 matches its coding performance at one-third the cost and more than twice the speed.
Claude tweet media
English
316
979
7.1K
1.3M
Cem Anil retweetledi
Anthropic
Anthropic@AnthropicAI·
Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception. Now we’re open-sourcing the tool to run those audits.
Anthropic tweet media
English
87
270
2.5K
211.8K
Cem Anil retweetledi
Jascha Sohl-Dickstein
Jascha Sohl-Dickstein@jaschasd·
I've told you that I think AGI is coming, soon. And that I think this should motivate you as you make personal and professional and research decisions. And I've talked about a few specific ways it might feed into that decision making process. Now here are some particular project areas I like. This list especially is massively incomplete. The most impactful things to do are probably the ones I've left off, and no one is yet thinking about. If you want to talk in more detail about what work in any of them would look like, please approach me after the talk. [read list] AI for science is things like material discovery, protein folding, weather modeling, fusion reactor plasma monitoring, ... Science on AI models means treating the AI models themselves as the object of study, and using techniques from another field. Better understanding of the systems we are building is also often a positive. interpretability is maybe the canonical example of this there is work treating the parameters or activations of neural networks as stat-mech like ensembles there is work probing the apparanet psychology of AI models there is work treating the economic behavior and consequences of models etc. It might be useful to spend an hour brainstorming all the fields that could be used to do science on AI models There is straight up AI safety research. This is something you can get in on the ground floor of, which is hugely important. There's prediction and extrapolation of AI capabilities. The better we understand what the future might look like, the better it will likely go. There's access, equity, fairness. If we want this technology to benefit everyone, this is incredibly useful. I also expect it to be a toxic and stressful subfield to work in. It's embracing multiple culture war issues at once, and your work is only going to be net positive if you are able to do that without being dgomatic. If you are equipped technically and in personality to do work on policy and governance -- DO IT! Governments are desperate for competent technical people to advise them. Everyone good leaves for more pay. But we really need technically competent people informing these policy decisions. This is amazingly high leverage if you do it.
Jascha Sohl-Dickstein tweet media
English
2
10
128
16K
Cem Anil retweetledi
Sam Bowman
Sam Bowman@sleepinyourhat·
[Sonnet 4.5 🧵] Here's the north-star goal for our pre-deployment alignment evals work: The information we share alongside a model should give you an accurate overall sense of the risks the model could pose. It won’t tell you everything, but you shouldn’t be...
GIF
English
8
12
143
39K
Cem Anil retweetledi
Jack Lindsey
Jack Lindsey@Jack_W_Lindsey·
Prior to the release of Claude Sonnet 4.5, we conducted a white-box audit of the model, applying interpretability techniques to “read the model’s mind” in order to validate its reliability and alignment. This was the first such audit on a frontier LLM, to our knowledge. (1/15)
Jack Lindsey tweet media
English
41
170
1.5K
236.6K