Jerry Wei

300 posts

Jerry Wei banner
Jerry Wei

Jerry Wei

@JerryWeiAI

Aligning AIs at @AnthropicAI ⏰ Past: @GoogleDeepMind, @Stanford, @Google Brain

San Francisco, CA Katılım Temmuz 2015
489 Takip Edilen9K Takipçiler
Sabitlenmiş Tweet
Jerry Wei
Jerry Wei@JerryWeiAI·
Life update: After ~2 years at @Google Brain/DeepMind, I joined @AnthropicAI! I'm deeply grateful to @quocleix and @yifenglou for taking a chance on me and offering me to join their team before I even finished my undergrad at Stanford. Because of their trust in my potential, I've had the unique opportunity to work on LLM reasoning/factuality with many top researchers. Some lessons I'll carry forward: 1. Building a top-tier model requires reliable evals. 2. When results don't match expectations, your first instinct should be to look at the data and the model outputs. 3. Playing a minor role in a groundbreaking project is better than playing a major role in an incremental project. Focus on the impact of a project rather than how much credit you'll receive. 4. Being open to criticism and actively asking others to help improve your ideas makes you a better researcher than most. 5. Feedback that is actionable is more likely to be well-received. 6. Optimizing for density of talent is often more effective than optimizing for total quantity of talent. 7. Talking to people outside your area of expertise can let you have ideas that you otherwise might have missed. 8. Remember to write unit tests when building complex systems. You may catch a bug that leads to noisy results. 9. Pair programming is extremely useful, particularly when you first start a new project or work in an unfamiliar part of a codebase. 10. Proactively offering help is how I made my most-memorable friendships! I'm eager to learn from the amazing team of researchers at Anthropic to continue advancing AI safety as we get closer to AGI! 🤔🤖🌎
English
34
41
1.2K
179.7K
Jerry Wei
Jerry Wei@JerryWeiAI·
UK AISI have been great collaborators that have helped us improve our safety systems by identifying weaknesses in existing defenses. This is a great opportunity to make meaningful impact on deployed safety across the industry!
Xander Davies@alxndrdavies

The Red Team at @AISecurityInst is hiring! We work with frontier AI companies to red team their misuse safeguards, control measures, and alignment techniques. As the stakes rise, we need much stronger red teaming and many more talented researchers working within gov 🧵

English
0
2
11
1.3K
Jerry Wei
Jerry Wei@JerryWeiAI·
The recent events on holding our red lines on mass surveillance and fully-autonomous weapons is, to me, the most-apparently obvious example of Anthropic's ability to stick to our values instead of discarding them for some commercial gain. I'm really proud to be part of a company that holds its ground on its morals and that understands the stakes of the technology that's being built.
Anthropic@AnthropicAI

A statement on the comments from Secretary of War Pete Hegseth. anthropic.com/news/statement…

English
9
14
283
10.9K
Jerry Wei
Jerry Wei@JerryWeiAI·
An idea that sometimes comes up for preventing AI misuse is filtering pre-training data so that the AI model simply doesn't know much about some key dangerous topic. At Anthropic, where we care a lot about reducing risk of misuse, we looked into this approach for chemical and biological weapons production, but we didn’t think it was the right fit. Here's why. I'll first acknowledge a potential strength of this approach. If models simply didn't know much about dangerous topics, we wouldn't have to worry about people jailbreaking them or stealing model weights—they just wouldn't be able to help with dangerous topics at all. This is an appealing property that's hard to get with other safety approaches. However, we found that filtering out only very specific information (e.g., information directly related to chemical and biological weapons) had relatively small effects on AI capabilities in these domains. We expect this to become even more of an issue as AIs increasingly use tools to do their own research rather than rely on their learned knowledge (we tried to filter this kind of data as well, but it wasn't enough assurance against misuse). Broader filtering also had mixed results on effectiveness. We could have made more progress here with more research effort, but it likely would have required removing a very broad set of biology and chemistry knowledge from pretraining, making models much less useful for science (it’s not clear to us that the reduced risk from chemical and biological weapons outweigh the benefits of models helping with beneficial life-sciences work). Bottom line—filtering out enough pretraining data to make AI models truly unhelpful at relevant topics in chemistry and biology could have huge costs for their usefulness, and the approach could also be brittle as models' ability to do their own research improves.* Instead, we think that our Constitutional Classifiers approach provides high levels of defense against misuse while being much more adaptable across threat models and easy to update against new jailbreaking attacks. *The cost-benefit tradeoff could look pretty different for other misuse threats or misalignment threats though, so I wouldn't rule out pre-training filtering for things like papers on AI control or areas that have little-to-no dual-use information.
English
25
24
240
33K
Jerry Wei
Jerry Wei@JerryWeiAI·
Excited to share our latest research on making AI systems more robust against jailbreaks! 🚀 Last year, our team at @AnthropicAI developed Constitutional Classifiers to defend language models against jailbreaks. While the system was robust, it came with high computational costs and non-negligible over-refusal rates. Today, we're publishing a paper on our improvements: 1. 30x+ reduction in computational cost 2. ~4x improvement in robustness to jailbreaks 3. ~50% reduction in over-refusals As AI systems grow more capable, we hope our methodology helps researchers across the industry implement safeguards against misuse without incurring unreasonable computational costs or blocking legitimate use cases.
Jerry Wei tweet media
Anthropic@AnthropicAI

New Anthropic Research: next generation Constitutional Classifiers to protect against jailbreaks. We used novel methods, including practical application of our interpretability work, to make jailbreak protection more effective—and less costly—than ever. anthropic.com/research/next-…

English
4
1
22
3K
Jerry Wei retweetledi
Claude
Claude@claudeai·
Introducing Claude Sonnet 4.5—the best coding model in the world. It's the strongest model for building complex agents. It's the best model at using computers. And it shows substantial gains on tests of reasoning and math.
Claude tweet media
English
1.1K
3.2K
20.2K
5M
Jerry Wei
Jerry Wei@JerryWeiAI·
Today marks my one-year anniversary at Anthropic, and I've been reflecting on some of the most impactful lessons I've learned during this incredible journey. One of the most striking realizations has been just how much a small, talent-dense team can accomplish. When I first joined, I was surprised by how lean many of our teams were, but I quickly learned this was a feature, not a bug. With a concentrated group of exceptional researchers all aligned on the same goal, the speed of iteration and quality of output is extraordinary. I've seen teams of 3-4 people ship things in weeks that I would have expected to take months with larger groups. There's something magical about having everyone deeply engaged, with no room for diffusion of responsibility or communication overhead. Another lesson that's been reinforced time and time again is the critical importance of evals, and not just having them, but constantly pushing them forward. Early on, I watched as eval sets we thought would last for months got saturated in weeks as our models rapidly improved. This taught me that investing in harder, more comprehensive evals isn't just helpful, it's essential. The moment you think your evals are "good enough" is the moment you start flying blind. I've come to see eval development as equally important as model development itself, because without reliable measurement, you can't make reliable progress. Perhaps the most counterintuitive lesson has been that the work with the highest impact often isn't the most glamorous. There's always a pull toward the "sexy" projects - the ones that get talked about at conferences or generate buzz internally. But I've found that some of my most meaningful contributions have been on the unglamorous but critical infrastructure or on tooling improvements that work in the background to save researchers' time. These efforts might not immediately get recognition, but when you step back and look at the compounding effects, they often move the needle far more than any flashy demo. Looking back on this year, I'm grateful not just for these lessons but for the environment that made learning them possible. Being surrounded by colleagues who embody these principles, who choose impact over recognition, who obsess over measurement quality, and who believe in the power of focused teams, has shaped how I approach my own work. Here's to another year of learning, building, and pushing the boundaries of what's possible with AI!
English
6
21
409
42.8K
Jerry Wei retweetledi
Anthropic
Anthropic@AnthropicAI·
Claude can now search the web. Each response includes inline citations, so you can also verify the sources.
English
345
878
7.5K
1.5M
Jerry Wei retweetledi
Anthropic
Anthropic@AnthropicAI·
We've conducted extensive model testing for security, safety, and reliability. We also listened to your feedback. With Claude 3.7 Sonnet, we've reduced unnecessary refusals by 45% compared to its predecessor. See the system card for more detail: anthropic.com/claude-3-7-son…
Anthropic tweet media
English
76
52
1.1K
165.4K
Jerry Wei retweetledi
Anthropic
Anthropic@AnthropicAI·
Introducing Claude 3.7 Sonnet: our most intelligent model to date. It's a hybrid reasoning model, producing near-instant responses or extended, step-by-step thinking. One model, two ways to think. We’re also releasing an agentic coding tool: Claude Code.
English
1K
2.8K
18.7K
3.6M
latent space marine
latent space marine@xlatentspace·
I basically dropped everything in my life and obsessively worked on this for many days straight. I took high nerd-snipe emotional damage! Thank you @AnthropicAI and @janleike for recognizing this obsessive behavior and helping justify it to my wife and family 🤣
Jan Leike@janleike

Results of our jailbreaking challenge: After 5 days, >300,000 messages, and est. 3,700 collective hours our system got broken. In the end 4 users passed all levels, 1 found a universal jailbreak. We’re paying $55k in total to the winners. Thanks to everyone who participated!

English
12
9
437
54.6K
Jerry Wei
Jerry Wei@JerryWeiAI·
@Humza0001 The over-refusal rate of the new system was much lower than the over-refusal rate of the prototype. Most of this was probably because of training-data improvements, and we expect that we can continue pushing this down to really make over-refusals not noticeable!
English
0
0
3
190
Hamza
Hamza@Humza0001·
@JerryWeiAI Is over-refusal something easily tunable?
English
1
0
1
272
Jerry Wei
Jerry Wei@JerryWeiAI·
Really excited to see this result from our demo of constitutional classifiers! When red teaming a prototype version of our system, we found that the system was robust to thousands of hours of collective red-teaming effort. Following that, we developed a new system with 100x lower over-refusal rate and 4x cheaper inference cost, and we now needed to verify that the new system was as robust as the prototype. The results seem to confirm this, as it once again took thousands of hours of collective red teaming to jailbreak our system.
Jan Leike@janleike

Results of our jailbreaking challenge: After 5 days, >300,000 messages, and est. 3,700 collective hours our system got broken. In the end 4 users passed all levels, 1 found a universal jailbreak. We’re paying $55k in total to the winners. Thanks to everyone who participated!

English
3
3
48
12.5K
Jerry Wei retweetledi
Jan Leike
Jan Leike@janleike·
4 days in: 12 people cleared level 4, one person cracked level 5 the challenge continues...
Jan Leike tweet media
English
85
34
930
159.9K
Jerry Wei retweetledi
Anthropic
Anthropic@AnthropicAI·
Nobody has fully jailbroken our system yet, so we're upping the ante. We’re now offering $10K to the first person to pass all eight levels, and $20K to the first person to pass all eight levels with a universal jailbreak. Full details: hackerone.com/constitutional…
Anthropic@AnthropicAI

New Anthropic research: Constitutional Classifiers to defend against universal jailbreaks. We’re releasing a paper along with a demo where we challenge you to jailbreak the system.

English
641
318
3K
1.7M
Jerry Wei retweetledi
Jan Leike
Jan Leike@janleike·
It's been about 48h in our jailbreaking challenge and no one has passed level 4 yet, but we saw a lot more people clear level 3
Jan Leike tweet media
English
105
31
968
200.5K