Erik Jones

137 posts

Erik Jones

Erik Jones

@ErikJones313

Safety @AnthropicAI. Prev @berkeley_ai and @StanfordAILab. Opinions are my own

Katılım Şubat 2018
162 Takip Edilen617 Takipçiler
Erik Jones retweetledi
Xander Davies
Xander Davies@alxndrdavies·
I moved to London 3 years ago to join @AISecurityInst, at the time a few people with visitor passes and a whiteboard. Since then AISI has become the world’s largest and best-funded group in gov focused on AI security & safety. Fun to be in @nytimes!
Xander Davies tweet media
English
6
31
328
13.1K
Erik Jones retweetledi
Anthropic
Anthropic@AnthropicAI·
Introducing Project Glasswing: an urgent initiative to help secure the world’s most critical software. It’s powered by our newest frontier model, Claude Mythos Preview, which can find software vulnerabilities better than all but the most skilled humans. anthropic.com/glasswing
English
2K
6.7K
44.1K
31.3M
Erik Jones retweetledi
Jerry Wei
Jerry Wei@JerryWeiAI·
UK AISI have been great collaborators that have helped us improve our safety systems by identifying weaknesses in existing defenses. This is a great opportunity to make meaningful impact on deployed safety across the industry!
Xander Davies@alxndrdavies

The Red Team at @AISecurityInst is hiring! We work with frontier AI companies to red team their misuse safeguards, control measures, and alignment techniques. As the stakes rise, we need much stronger red teaming and many more talented researchers working within gov 🧵

English
0
2
10
1.6K
Erik Jones retweetledi
Jerry Wei
Jerry Wei@JerryWeiAI·
The recent events on holding our red lines on mass surveillance and fully-autonomous weapons is, to me, the most-apparently obvious example of Anthropic's ability to stick to our values instead of discarding them for some commercial gain. I'm really proud to be part of a company that holds its ground on its morals and that understands the stakes of the technology that's being built.
Anthropic@AnthropicAI

A statement on the comments from Secretary of War Pete Hegseth. anthropic.com/news/statement…

English
9
14
282
11.1K
Erik Jones retweetledi
Jacob Steinhardt
Jacob Steinhardt@JacobSteinhardt·
New blog post out: a position piece on "Turning Compute into Understanding", by training superhuman oversight assistants.
Jacob Steinhardt tweet media
English
5
37
232
30.4K
Erik Jones retweetledi
Jacob Steinhardt
Jacob Steinhardt@JacobSteinhardt·
Cool to see folks building on LatentQA! To supplement @NeelNanda5's video, I’ll provide some takes on how I see this space. (Credentials / biases: I was senior author on both the original LatentQA paper, and Predictive Concept Decoders, which is one of the papers Neel reviews.)
Neel Nanda@NeelNanda5

New video: What would it look like for interp to be truly bitter lesson pilled? There's been exciting work on end-to-end interpretability: directly train models to map acts to explanations This is live paper review to two (Activation Oracles & PCD), I read and give hot takes

English
3
18
110
36.4K
Erik Jones retweetledi
Transluce
Transluce@TransluceAI·
Can LMs learn to faithfully describe their internal features and mechanisms? In our new paper led by Research Fellow @belindazli, we find that they can—and that models explain themselves better than other models do.
Transluce tweet media
English
5
57
274
67.6K
Erik Jones retweetledi
Tejal Patwardhan
Tejal Patwardhan@tejalpatwardhan·
Understanding the capabilities of AI models is important to me. To forecast how AI models might affect labor, we need methods to measure their real-world work abilities. That’s why we created GDPval.
Tejal Patwardhan tweet media
OpenAI@OpenAI

Today we’re introducing GDPval, a new evaluation that measures AI on real-world, economically valuable tasks. Evals ground progress in evidence instead of speculation and help track how AI improves at the kind of work that matters most. openai.com/index/gdpval-v0

English
58
186
1.3K
1.1M
Erik Jones retweetledi
Xander Davies
Xander Davies@alxndrdavies·
Excited to share details on two of our longest running and most effective safeguard collaborations, one with Anthropic and one with OpenAI. We've identified—and they've patched—a large number of vulnerabilities and together strengthened their safeguards. 🧵 1/6
Xander Davies tweet mediaXander Davies tweet media
English
8
61
299
61.2K
Erik Jones retweetledi
Ethan Perez
Ethan Perez@EthanJPerez·
We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵
English
10
42
257
69.5K
Erik Jones retweetledi
Jan Leike
Jan Leike@janleike·
In March we published a paper on alignment audits: teams of humans were tasked to find the problems in model we trained to be misaligned. Now we have agents that can do it automatically 42% of the time.
Anthropic@AnthropicAI

New Anthropic research: Building and evaluating alignment auditing agents. We developed three AI agents to autonomously complete alignment auditing tasks. In testing, our agents successfully uncovered hidden goals, built safety evaluations, and surfaced concerning behaviors.

English
16
26
228
24.1K
Erik Jones retweetledi
Meena Jagadeesan
Meena Jagadeesan@mjagadeesan25·
I'm so excited to be joining @Penn as an Assistant Professor in CS (@CIS_Penn) in Fall 2026! I’ll be working on machine learning ecosystems, aiming to steer how multi-agent interactions shape performance trends and societal outcomes. I’ll be recruiting PhD students this cycle!
English
38
50
799
66.2K
Erik Jones retweetledi
Siddharth Karamcheti
Siddharth Karamcheti@siddkaramcheti·
Thrilled to share that I'll be starting as an Assistant Professor at Georgia Tech (@ICatGT / @GTrobotics / @mlatgt) in Fall 2026. My lab will tackle problems in robot learning, multimodal ML, and interaction. I'm recruiting PhD students this next cycle – please apply/reach out!
Siddharth Karamcheti tweet mediaSiddharth Karamcheti tweet media
English
72
22
564
61.2K
Erik Jones
Erik Jones@ErikJones313·
I really admire how Ruiqi's methodology is scalable, easy-to-deploy, and superficially simple---so simple that it can be hard to recognize the critical conceptual work to find the right load-bearing primitives and abstractions. Definitely check out his video and thesis :)
Ruiqi Zhong@ZhongRuiqi

Last day of PhD! I pioneered using LLMs to explain dataset&model. It's used by interp at @OpenAI and societal impact @AnthropicAI Tutorial here. It's a great direction & someone should carry the torch :) Thesis available, if you wanna read my acknowledgement section=P

English
0
3
19
5.8K
Erik Jones retweetledi
Ethan Perez
Ethan Perez@EthanJPerez·
@TransluceAI is killing it. Very cool/insightful findings in this thread. Their tool for automatically finding weird model behaviors (Docent) is one of those projects I wish I had thought to do, and looks quite useful for improving models
Transluce@TransluceAI

We tested a pre-release version of o3 and found that it frequently fabricates actions it never took, and then elaborately justifies these actions when confronted. We were surprised, so we dug deeper 🔎🧵(1/) x.com/OpenAI/status/…

English
0
5
59
6K
Erik Jones retweetledi
Xander Davies
Xander Davies@alxndrdavies·
My team is hiring @AISecurityInst! I think this is one of the most important times in history to have strong technical expertise in government. Join our team understanding and fixing weaknesses in frontier models through sota adversarial ML research & testing. 🧵 1/4
Xander Davies tweet media
English
4
35
170
47.8K