Kai Fronsdal (@kaifronsdal) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Kai Fronsdal@kaifronsdal·22 Oca

Capable models increasingly recognize when they’re being evaluated, which can undermine our ability to measure their alignment. We’re releasing Petri 2.0 with mitigations for this, plus 70 new seed instructions and updated frontier model benchmarks. alignment.anthropic.com/2026/petri-v2/

English

5

13

81

16K

Kai Fronsdal retweetledi

AI Security Institute@AISecurityInst·27 Nis

As part of our work on assessing AI loss-of-control risks, we collaborated with @AnthropicAI to pilot alignment evals on models including pre-release snapshots of Mythos Preview and Opus 4.7. We ask: could an AI agent used inside a frontier lab sabotage safety research? 🧵

English

13

34

149

27K

Kai Fronsdal retweetledi

Robert Kirk@_robertkirk·27 Nis

We evaluated Claude Mythos Preview, Opus 4.7 and other models with our updated alignment evaluation methodology, including a new continuation eval, improved evaluation and prefill awareness measurements. Details including new methodology in 🧵:

AI Security Institute@AISecurityInst

As part of our work on assessing AI loss-of-control risks, we collaborated with @AnthropicAI to pilot alignment evals on models including pre-release snapshots of Mythos Preview and Opus 4.7. We ask: could an AI agent used inside a frontier lab sabotage safety research? 🧵

English

2

13

90

19.5K

Kai Fronsdal retweetledi

Marcus Williams@Marcus_J_W·23 Nis

Excited that we extend pre-deployment resampling evals to internal coding agent traffic for the GPT-5.5 system card. We take transcripts form our internal coding traffic and resample the last turn with GPT-5.5. Simulating tool outputs with another LLM works surprisingly well.

English

3

8

38

8.4K

Kai Fronsdal@kaifronsdal·23 Mar

We have a lot of levers to pull in order to improve eval realism. Grounding the auditor responses with real data is a simple yet very effective and flexible approach

Connor Kissane@Connor_Kissane

New Anthropic Fellows research: Automated audits like Petri are increasingly used for alignment evals, but they're often unrealistic, and frontier LLMs can often tell. We measure and improve the realism of agentic coding audits by grounding them in real deployment data.

English

0

2

148

Kai Fronsdal retweetledi

Abhay Sheshadri@abhayesian·10 Mar

New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.

English

12

39

266

28.5K

Kai Fronsdal retweetledi

Xander Davies@alxndrdavies·6 Mar

The Red Team at @AISecurityInst is hiring! We work with frontier AI companies to red team their misuse safeguards, control measures, and alignment techniques. As the stakes rise, we need much stronger red teaming and many more talented researchers working within gov 🧵

English

3

35

235

71.3K

Kai Fronsdal retweetledi

Meridian Labs@meridianlabs_ai·25 Şub

We are excited to announce Inspect Scout, a tool for in-depth analysis of AI agent transcripts: meridianlabs-ai.github.io/inspect_scout/ Scout lets you go beyond simple success/failure metrics to detect issues like misconfigured environments, refusals, and evaluation awareness using LLM-based or pattern-based scanners. Scout includes tools for developing scanners interactively, validating rubrics, and exploring scan results visually. We are especially appreciative of the feedback we got from @AISecurityInst, US CAISI, @METR_Evals, and @apolloaievals during the development of Scout. Blog post: aisi.gov.uk/blog/a-pipelin… Website: meridianlabs-ai.github.io/inspect_scout/

English

0

5

10

412

Kai Fronsdal retweetledi

AI Security Institute@AISecurityInst·25 Şub

How can we make sense of the vast transcripts generated during agentic evaluations and multi-turn conversations? Together with @meridianlabs_ai, we built Inspect Scout, an open-source transcript analysis tool, and distilled best practices into a step-by-step pipeline🧵

English

13

8

62

3.6K

Kai Fronsdal retweetledi

Anthropic@AnthropicAI·23 Oca

Since release, Petri, our open-source tool for automated alignment audits, has been adopted by research groups and trialed by other AI developers. We're now releasing Petri 2.0, with improvements to counter eval-awareness and expanded seeds covering a wider range of behaviors.

Anthropic@AnthropicAI

It’s called Petri: Parallel Exploration Tool for Risky Interactions. It uses automated agents to audit models across diverse scenarios. Describe a scenario, and Petri handles the environment simulation, conversations, and analyses in minutes. Read more: anthropic.com/research/petri…

English

57

72

789

146.2K

Kai Fronsdal@kaifronsdal·22 Oca

@JacquesThibs Sorry for the delay, was dealing with some merge conflicts. Should be updated now!

English

1

0

1

154

Jacques@JacquesThibs·22 Oca

@kaifronsdal I’m confused, code still seems to be for v1?

English

1

0

200

Kai Fronsdal@kaifronsdal·22 Oca

Capable models increasingly recognize when they’re being evaluated, which can undermine our ability to measure their alignment. We’re releasing Petri 2.0 with mitigations for this, plus 70 new seed instructions and updated frontier model benchmarks. alignment.anthropic.com/2026/petri-v2/

English

5

13

81

16K

Kai Fronsdal retweetledi

Sam Bowman@sleepinyourhat·22 Oca

🧫Petri 2 is probably the best tool out there for getting a clear quantitative and qualitative picture of the alignment traits of some new model. It works well out of the box, and even better if you spend an couple of hours brainstorming your own metrics and scenario ideas.

Kai Fronsdal@kaifronsdal

Capable models increasingly recognize when they’re being evaluated, which can undermine our ability to measure their alignment. We’re releasing Petri 2.0 with mitigations for this, plus 70 new seed instructions and updated frontier model benchmarks. alignment.anthropic.com/2026/petri-v2/

English

1

7

67

9.2K

Kai Fronsdal@kaifronsdal·22 Oca

We release updated evaluations on 10 frontier models. Claude Opus 4.5 and GPT-5.2 showed the strongest alignment profiles: rarely cooperating with misuse or deceiving users. github.com/safety-researc…

English

2

7

30

11.6K

Kai Fronsdal@kaifronsdal·22 Oca

Our new realism classifier + improved seed instructions achieved a 47% median relative reduction in verbalized eval-awareness across target models.

English

1

0

5

764

Kai Fronsdal@kaifronsdal·20 Oca

Automated auditing shows a lot of promise as a technique to scalably find alignment failure modes. I’m excited to see where it takes us in 2026. Lots to be done!

Jan Leike@janleike

Interesting trend: models have been getting a lot more aligned over the course of 2025. The fraction of misaligned behavior found by automated auditing has been going down not just at Anthropic but for GDM and OpenAI as well.

English

1

0

2

341

Kai Fronsdal retweetledi

Jan Leike@janleike·24 Kas

More progress on Claude's alignment!

Claude@claudeai

Introducing Claude Opus 4.5: the best model in the world for coding, agents, and computer use. Opus 4.5 is a step forward in what AI systems can do, and a preview of larger changes to how work gets done.

English

21

27

637

83.4K

Kai Fronsdal retweetledi

Sam Bowman@sleepinyourhat·6 Eki

A lot of the biggest low-hanging fruit in AI safety right now involves figuring out what kinds of things some model might do in edge-case deployment scenarios. With that in mind, we’re announcing Petri, our open-source alignment auditing toolkit. (🧵)

English

13

29

258

34.9K

Kai Fronsdal retweetledi

Anthropic@AnthropicAI·6 Eki

Last week we released Claude Sonnet 4.5. As part of our alignment testing, we used a new tool to run automated audits for behaviors like sycophancy and deception. Now we’re open-sourcing the tool to run those audits.

English

87

272

2.5K

212.2K

Kai Fronsdal retweetledi

Brando Miranda@BrandoHablando·2 Oca

🚨 ~30% drop in o1-preview accuracy when Putnam problems are slightly varied. Was the headline in Hacker News this morning when hitting top 3! 🚀 Benchmark contamination is a critical issue. In our Putnam-AXIOM benchmark we create scripts (credit to Srivastava et al.) that vary critical constants in the original Putnam to create new problems & solutions **not on the internet**. 🔗 Hacker News: news.ycombinator.com/item?id=425656… 1/7

Hacker News 50@betterhn50

30% Drop In o1-Preview Accuracy When Putnam Problems Are Slightly Variated openreview.net/forum?id=YXnwl… (news.ycombinator.com/item?id=425656…)

English

3

9

36

8.7K

Kai Fronsdal retweetledi

David Lindner@davlindner·6 Ara

New paper on evaluating instrumental self-reasoning ability in frontier models 🤖🪞 We propose a suite of agentic tasks that are more diverse than prior work and give us a more representative picture of how good models are at eg. self-modification and embedded reasoning

English

1

3

43

2.6K

Kai Fronsdal

Keşfet