Erik Jones

137 posts

Erik Jones

@ErikJones313

Safety @AnthropicAI. Prev @berkeley_ai and @StanfordAILab. Opinions are my own

Katılım Şubat 2018

162 Takip Edilen617 Takipçiler

Erik Jones retweetledi

Xander Davies@alxndrdavies·1d

I moved to London 3 years ago to join @AISecurityInst, at the time a few people with visitor passes and a whiteboard. Since then AISI has become the world’s largest and best-funded group in gov focused on AI security & safety. Fun to be in @nytimes!

English

328

13.1K

Erik Jones retweetledi

Anthropic@AnthropicAI·7 Nis

Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans. anthropic.com/glasswing

English

6.7K

44.1K

31.3M

Erik Jones@ErikJones313·27 Mar

I'm really excited this work is out! We red-team LLMs for character violations that are too rare to show up in static evals, but could still arise during large-scale deployments

Nate Rahn@n8rahn

New Anthropic Fellows research: Abstractive red-teaming of language model character The worst way to find out about a character flaw in your language model is from a viral screenshot. How can we find these issues before deployment, rather than after? In this work, we introduce abstractive red-teaming, a new approach that searches over natural-language categories of queries, rather than individual prompts.

English

6.7K

Erik Jones retweetledi

Jerry Wei@JerryWeiAI·6 Mar

UK AISI have been great collaborators that have helped us improve our safety systems by identifying weaknesses in existing defenses. This is a great opportunity to make meaningful impact on deployed safety across the industry!

Xander Davies@alxndrdavies

The Red Team at @AISecurityInst is hiring! We work with frontier AI companies to red team their misuse safeguards, control measures, and alignment techniques. As the stakes rise, we need much stronger red teaming and many more talented researchers working within gov 🧵

English

1.6K

Erik Jones retweetledi

Jerry Wei@JerryWeiAI·28 Şub

The recent events on holding our red lines on mass surveillance and fully-autonomous weapons is, to me, the most-apparently obvious example of Anthropic's ability to stick to our values instead of discarding them for some commercial gain. I'm really proud to be part of a company that holds its ground on its morals and that understands the stakes of the technology that's being built.

Anthropic@AnthropicAI

A statement on the comments from Secretary of War Pete Hegseth. anthropic.com/news/statement…

English

282

11.1K

Erik Jones retweetledi

Anthropic@AnthropicAI·28 Şub

A statement on the comments from Secretary of War Pete Hegseth. anthropic.com/news/statement…

English

2.8K

6.6K

42.5K

17.7M

Erik Jones retweetledi

Javier Rando@javirandor·15 Oca

Thanks to ETH for featuring my work on AI safety and security! ethz.ch/en/news-and-ev…

English

5.4K

Erik Jones retweetledi

Jacob Steinhardt@JacobSteinhardt·6 Oca

New blog post out: a position piece on "Turning Compute into Understanding", by training superhuman oversight assistants.

English

232

30.4K

Erik Jones retweetledi

Ethan Perez@EthanJPerez·23 Ara

Transluce is a top-tier AI safety research lab - I follow their work as closely as work from our own safety teams at Anthropic. They're also well-positioned to become a strong third-party auditor for AI labs. Consider donating if you're interested in helping them out!

Transluce@TransluceAI

Transluce is running our end-of-year fundraiser for 2025. This is our first public fundraiser since launching late last year.

English

157

14.3K

Erik Jones retweetledi

Jacob Steinhardt@JacobSteinhardt·21 Ara

Cool to see folks building on LatentQA! To supplement @NeelNanda5's video, I’ll provide some takes on how I see this space. (Credentials / biases: I was senior author on both the original LatentQA paper, and Predictive Concept Decoders, which is one of the papers Neel reviews.)

Neel Nanda@NeelNanda5

New video: What would it look like for interp to be truly bitter lesson pilled? There's been exciting work on end-to-end interpretability: directly train models to map acts to explanations This is live paper review to two (Activation Oracles & PCD), I read and give hot takes

English

110

36.4K

Erik Jones retweetledi

Transluce@TransluceAI·14 Kas

Can LMs learn to faithfully describe their internal features and mechanisms? In our new paper led by Research Fellow @belindazli, we find that they can—and that models explain themselves better than other models do.

English

274

67.6K

Erik Jones retweetledi

Tejal Patwardhan@tejalpatwardhan·25 Eyl

Understanding the capabilities of AI models is important to me. To forecast how AI models might affect labor, we need methods to measure their real-world work abilities. That’s why we created GDPval.

OpenAI@OpenAI

Today we’re introducing GDPval, a new evaluation that measures AI on real-world, economically valuable tasks. Evals ground progress in evidence instead of speculation and help track how AI improves at the kind of work that matters most. openai.com/index/gdpval-v0

English

186

1.3K

1.1M

Erik Jones retweetledi

Xander Davies@alxndrdavies·13 Eyl

Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6

English

299

61.2K

Erik Jones retweetledi

Ethan Perez@EthanJPerez·4 Eyl

We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵

English

257

69.5K

Erik Jones retweetledi

Jan Leike@janleike·24 Tem

In March we published a paper on alignment audits: teams of humans were tasked to find the problems in model we trained to be misaligned. Now we have agents that can do it automatically 42% of the time.

Anthropic@AnthropicAI

New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.

English

228

24.1K

Erik Jones retweetledi

Meena Jagadeesan@mjagadeesan25·24 Haz

I'm so excited to be joining @Penn as an Assistant Professor in CS (@CIS_Penn) in Fall 2026! I’ll be working on machine learning ecosystems, aiming to steer how multi-agent interactions shape performance trends and societal outcomes. I’ll be recruiting PhD students this cycle!

English

799

66.2K

Erik Jones retweetledi

Siddharth Karamcheti@siddkaramcheti·18 Haz

Thrilled to share that I'll be starting as an Assistant Professor at Georgia Tech (@ICatGT / @GTrobotics / @mlatgt) in Fall 2026. My lab will tackle problems in robot learning, multimodal ML, and interaction. I'm recruiting PhD students this next cycle – please apply/reach out!

English

564

61.2K

Erik Jones@ErikJones313·16 May

I really admire how Ruiqi's methodology is scalable, easy-to-deploy, and superficially simple---so simple that it can be hard to recognize the critical conceptual work to find the right load-bearing primitives and abstractions. Definitely check out his video and thesis :)

Ruiqi Zhong@ZhongRuiqi

Last day of PhD! I pioneered using LLMs to explain dataset&model. It's used by interp at @OpenAI and societal impact @AnthropicAI Tutorial here. It's a great direction & someone should carry the torch :) Thesis available, if you wanna read my acknowledgement section=P

English

5.8K

Erik Jones retweetledi

Ethan Perez@EthanJPerez·17 Nis

@TransluceAI is killing it. Very cool/insightful findings in this thread. Their tool for automatically finding weird model behaviors (Docent) is one of those projects I wish I had thought to do, and looks quite useful for improving models

Transluce@TransluceAI

We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/) x.com/OpenAI/status/…

English

Erik Jones retweetledi

Xander Davies@alxndrdavies·12 Mar

My team is hiring @AISecurityInst! I think this is one of the most important times in history to have strong technical expertise in government. Join our team understanding and fixing weaknesses in frontier models through sota adversarial ML research & testing. 🧵 1/4

English

170

47.8K

Keşfet

@AISecurityInst @nytimes @NeelNanda5 @belindazli @Penn @CIS_Penn @ICatGT @GTrobotics