Kai Fronsdal

21 posts

Kai Fronsdal

Kai Fronsdal

@kaifronsdal

alignment red teamer research affiliate at @AISecurityInst

Katılım Nisan 2024
132 Takip Edilen145 Takipçiler
Sabitlenmiş Tweet
Kai Fronsdal
Kai Fronsdal@kaifronsdal·
Capable models increasingly recognize when they’re being evaluated, which can undermine our ability to measure their alignment. We’re releasing Petri 2.0 with mitigations for this, plus 70 new seed instructions and updated frontier model benchmarks. alignment.anthropic.com/2026/petri-v2/
English
5
13
81
16K
Kai Fronsdal retweetledi
AI Security Institute
AI Security Institute@AISecurityInst·
As part of our work on assessing AI loss-of-control risks, we collaborated with @AnthropicAI to pilot alignment evals on models including pre-release snapshots of Mythos Preview and Opus 4.7. We ask: could an AI agent used inside a frontier lab sabotage safety research? 🧵
AI Security Institute tweet media
English
13
34
149
27K
Kai Fronsdal retweetledi
Robert Kirk
Robert Kirk@_robertkirk·
We evaluated Claude Mythos Preview, Opus 4.7 and other models with our updated alignment evaluation methodology, including a new continuation eval, improved evaluation and prefill awareness measurements. Details including new methodology in 🧵:
AI Security Institute@AISecurityInst

As part of our work on assessing AI loss-of-control risks, we collaborated with @AnthropicAI to pilot alignment evals on models including pre-release snapshots of Mythos Preview and Opus 4.7. We ask: could an AI agent used inside a frontier lab sabotage safety research? 🧵

English
2
13
90
19.5K
Kai Fronsdal retweetledi
Marcus Williams
Marcus Williams@Marcus_J_W·
Excited that we extend pre-deployment resampling evals to internal coding agent traffic for the GPT-5.5 system card. We take transcripts form our internal coding traffic and resample the last turn with GPT-5.5. Simulating tool outputs with another LLM works surprisingly well.
Marcus Williams tweet media
English
3
8
38
8.4K
Kai Fronsdal retweetledi
Abhay Sheshadri
Abhay Sheshadri@abhayesian·
New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.
Abhay Sheshadri tweet media
English
12
39
266
28.5K
Kai Fronsdal retweetledi
Xander Davies
Xander Davies@alxndrdavies·
The Red Team at @AISecurityInst is hiring! We work with frontier AI companies to red team their misuse safeguards, control measures, and alignment techniques. As the stakes rise, we need much stronger red teaming and many more talented researchers working within gov 🧵
Xander Davies tweet media
English
3
35
235
71.3K
Kai Fronsdal retweetledi
Meridian Labs
Meridian Labs@meridianlabs_ai·
We are excited to announce Inspect Scout, a tool for in-depth analysis of AI agent transcripts: meridianlabs-ai.github.io/inspect_scout/ Scout lets you go beyond simple success/failure metrics to detect issues like misconfigured environments, refusals, and evaluation awareness using LLM-based or pattern-based scanners. Scout includes tools for developing scanners interactively, validating rubrics, and exploring scan results visually. We are especially appreciative of the feedback we got from @AISecurityInst, US CAISI, @METR_Evals, and @apolloaievals during the development of Scout. Blog post: aisi.gov.uk/blog/a-pipelin… Website: meridianlabs-ai.github.io/inspect_scout/
Meridian Labs tweet media
English
0
5
10
412
Kai Fronsdal retweetledi
AI Security Institute
AI Security Institute@AISecurityInst·
How can we make sense of the vast transcripts generated during agentic evaluations and multi-turn conversations? Together with @meridianlabs_ai, we built Inspect Scout, an open-source transcript analysis tool, and distilled best practices into a step-by-step pipeline🧵
AI Security Institute tweet media
English
13
8
62
3.6K
Kai Fronsdal retweetledi
Anthropic
Anthropic@AnthropicAI·
Since release, Petri, our open-source tool for automated alignment audits, has been adopted by research groups and trialed by other AI developers. We're now releasing Petri 2.0, with improvements to counter eval-awareness and expanded seeds covering a wider range of behaviors.
Anthropic@AnthropicAI

It’s called Petri: Parallel Exploration Tool for Risky Interactions. It uses automated agents to audit models across diverse scenarios. Describe a scenario, and Petri handles the environment simulation, conversations, and analyses in minutes. Read more: anthropic.com/research/petri…

English
57
72
789
146.2K
Kai Fronsdal
Kai Fronsdal@kaifronsdal·
@JacquesThibs Sorry for the delay, was dealing with some merge conflicts. Should be updated now!
English
1
0
1
154
Jacques
Jacques@JacquesThibs·
@kaifronsdal I’m confused, code still seems to be for v1?
English
1
0
0
200
Kai Fronsdal
Kai Fronsdal@kaifronsdal·
Capable models increasingly recognize when they’re being evaluated, which can undermine our ability to measure their alignment. We’re releasing Petri 2.0 with mitigations for this, plus 70 new seed instructions and updated frontier model benchmarks. alignment.anthropic.com/2026/petri-v2/
English
5
13
81
16K
Kai Fronsdal retweetledi
Sam Bowman
Sam Bowman@sleepinyourhat·
🧫Petri 2 is probably the best tool out there for getting a clear quantitative and qualitative picture of the alignment traits of some new model. It works well out of the box, and even better if you spend an couple of hours brainstorming your own metrics and scenario ideas.
Kai Fronsdal@kaifronsdal

Capable models increasingly recognize when they’re being evaluated, which can undermine our ability to measure their alignment. We’re releasing Petri 2.0 with mitigations for this, plus 70 new seed instructions and updated frontier model benchmarks. alignment.anthropic.com/2026/petri-v2/

English
1
7
67
9.2K
Kai Fronsdal
Kai Fronsdal@kaifronsdal·
We release updated evaluations on 10 frontier models. Claude Opus 4.5 and GPT-5.2 showed the strongest alignment profiles: rarely cooperating with misuse or deceiving users. github.com/safety-researc…
Kai Fronsdal tweet media
English
2
7
30
11.6K
Kai Fronsdal
Kai Fronsdal@kaifronsdal·
Our new realism classifier + improved seed instructions achieved a 47% median relative reduction in verbalized eval-awareness across target models.
English
1
0
5
764
Kai Fronsdal retweetledi
Sam Bowman
Sam Bowman@sleepinyourhat·
A lot of the biggest low-hanging fruit in AI safety right now involves figuring out what kinds of things some model might do in edge-case deployment scenarios. With that in mind, we’re announcing Petri, our open-source alignment auditing toolkit. (🧵)
Sam Bowman tweet media
English
13
29
258
34.9K
Kai Fronsdal retweetledi
Anthropic
Anthropic@AnthropicAI·
Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception. Now we’re open-sourcing the tool to run those audits.
Anthropic tweet media
English
87
272
2.5K
212.2K
Kai Fronsdal retweetledi
Brando Miranda
Brando Miranda@BrandoHablando·
🚨 ~30% drop in o1-preview accuracy when Putnam problems are slightly varied. Was the headline in Hacker News this morning when hitting top 3! 🚀 Benchmark contamination is a critical issue. In our Putnam-AXIOM benchmark we create scripts (credit to Srivastava et al.) that vary critical constants in the original Putnam to create new problems & solutions **not on the internet**. 🔗 Hacker News: news.ycombinator.com/item?id=425656… 1/7
Hacker News 50@betterhn50

30% Drop In o1-Preview Accuracy When Putnam Problems Are Slightly Variated openreview.net/forum?id=YXnwl… (news.ycombinator.com/item?id=425656…)

English
3
9
36
8.7K
Kai Fronsdal retweetledi
David Lindner
David Lindner@davlindner·
New paper on evaluating instrumental self-reasoning ability in frontier models 🤖🪞 We propose a suite of agentic tasks that are more diverse than prior work and give us a more representative picture of how good models are at eg. self-modification and embedded reasoning
David Lindner tweet media
English
1
3
43
2.6K