Yanda Chen

105 posts

Yanda Chen

Yanda Chen

@yanda_chen_

Member of Technical Staff @AnthropicAI CodeRL/Alignment | PhD @ColumbiaCompSci | NLP & ML | Prev Intern @MSFTResearch, @AmazonScience

San Francisco, CA Katılım Ocak 2019
610 Takip Edilen2.6K Takipçiler
Yanda Chen retweetledi
Jack Lindsey
Jack Lindsey@Jack_W_Lindsey·
Could an LLM have emotions? It’s hard to say. But when you’re talking to Claude, ChatGPT, or Gemini, you’re not talking to an LLM. You’re talking to a *character* being authored by an LLM. And these characters can, functionally, be driven by internal representations of desperation, or fear, or empathy (with sometimes alarming consequences).
Anthropic@AnthropicAI

New Anthropic research: Emotion concepts and their function in a large language model. All LLMs sometimes act like they have emotions. But why? We found internal representations of emotion concepts that can drive Claude’s behavior, sometimes in surprising ways.

English
66
28
348
32.2K
Yanda Chen
Yanda Chen@yanda_chen_·
Excited that the last work back from my PhD is out! We proposed a method to train language models to control the factuality–informativeness trade-off in their responses based on user preferences. @SaraZiweiGong has been working on a lot of interesting stuff at the intersection of AI and psychology — worth following if you’re interested in those areas!
Ziwei (Sara) Gong@SaraZiweiGong

(1/4) Excited to share our new preprint: "Factuality on Demand"with @yanda_chen_ ! LLMs face an inherent trade-off: be cautious but uninformative, or detailed but prone to hallucinations. What if you could tune this with a simple knob? Read: arxiv.org/abs/2602.00848 #NLP #LLMs

English
0
2
23
3.1K
Yanda Chen retweetledi
Xinyuan Cao
Xinyuan Cao@CaoYouki·
(1/6) Why does next-token prediction work so well, even for long text? 🤔 Check out “Provable Long-Range Benefits of Next-Token Prediction”. A rigorous explanation for LLM’s long-range coherence/reasoning. Joint work with Santosh Vempala📄 arXiv: arxiv.org/abs/2512.07818
Xinyuan Cao tweet media
English
5
6
16
2.2K
Yanda Chen retweetledi
He He
He He@hhexiy·
Reward hacking means the model is making less effort than expected: it finds the answer long before its fake CoT is finished. TRACE uses this idea to detect hacking when CoT monitoring fails. Work led by @XinpengWang_ @nitishjoshi23 and @rico_angell👇
Xinpeng Wang@XinpengWang_

‼️Your model may be secretly exploiting your imperfect reward function without telling you in the CoT! How to detect such 'implicit' reward hacking if the model is hiding it?🧐 We introduce TRACE🕵, a method based on a simple premise: hacking is easier than solving the actual task. 🧵

English
4
11
132
24.2K
Yanda Chen retweetledi
Ethan Perez
Ethan Perez@EthanJPerez·
We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵
English
10
41
257
68.7K
Yanda Chen retweetledi
Ethan Perez
Ethan Perez@EthanJPerez·
We recently ran to have OpenAI and Anthropic each evaluate each others’ models for safety issues. Excited for us to find more ways to help support safety practices across the whole field!
Sam Bowman@sleepinyourhat

Early this summer, OpenAI and Anthropic agreed to try some of our best existing tests for misalignment on each others’ models. After discussing our results privately, we’re now sharing them with the world. 🧵

English
1
3
94
6.8K
Yanda Chen retweetledi
Jan Leike
Jan Leike@janleike·
If you want to get into alignment research, imo this is one of the best ways to do it. Some previous fellows did some of the most interesting research I've seen this year and >20% ended up joining Anthropic full-time. Application deadline is this Sunday!
Anthropic@AnthropicAI

We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.

English
17
17
343
49.4K
Yanda Chen retweetledi
Jack Lindsey
Jack Lindsey@Jack_W_Lindsey·
After we published our circuit tracing work, researchers from several different interpretability groups came together for this collaboration. Check it out for replications, extensions, and perspectives on the field, including suggestions for future work!
neuronpedia@neuronpedia

Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow: ⤵️

English
2
11
131
7.7K
Yanda Chen retweetledi
Ethan Perez
Ethan Perez@EthanJPerez·
We're doubling the size of Anthropic's Fellows Program and launching a new round of applications. The first round of collaborations led to a number of recent/upcoming safety results that are comparable in impact to work our internal safety teams have done (IMO)
Anthropic@AnthropicAI

We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.

English
5
7
147
11.2K
Yanda Chen
Yanda Chen@yanda_chen_·
So excited to hear that Kathy won the ACL Lifetime Achievement Award! I feel incredibly fortunate and honored to have had her as one of my PhD advisors, and I’ve learned so much from her over the years. Big congrats!
ACL 2026@aclmeeting

🕊️ Lifetime Achievement Award at #ACL2025NLP A standing ovation for Prof. Kathy McKeown, recipient of the ACL 2025 Lifetime Achievement Award! 🌟

English
0
1
13
2.4K
Yanda Chen retweetledi
Aryo Pradipta Gema
Aryo Pradipta Gema@aryopg·
New Anthropic Research: “Inverse Scaling in Test-Time Compute” We found cases where longer reasoning leads to lower accuracy. Our findings suggest that naïve scaling of test-time compute may inadvertently reinforce problematic reasoning patterns. 🧵
Aryo Pradipta Gema tweet media
English
55
161
1.1K
338.6K
Yanda Chen retweetledi
Joe Benton
Joe Benton@JoeJBenton·
📰We've just released SHADE-Arena, a new set of sabotage evaluations. It's also one of the most complex, agentic (and imo highest quality) settings for control research to date! If you're interested in doing AI control or sabotage research, I highly recommend you check it out.
Anthropic@AnthropicAI

New Anthropic Research: A new set of evaluations for sabotage capabilities. As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.

English
2
11
88
12.5K
Yanda Chen
Yanda Chen@yanda_chen_·
@itsanshmittal @AnthropicAI I think low faithfulness doesn’t necessarily imply sub-human level performance? Capability and faithfulness seem relatively orthogonal.
English
0
0
0
252
Ansh Mittal
Ansh Mittal@itsanshmittal·
@yanda_chen_ @AnthropicAI If that’s case then it introduces more trustability issues and makes the case for LLMs being incapable of human level performance. Do you think we can really train LLMs to be sufficiently reliable in verifiable domains?
English
1
0
1
502
Yanda Chen
Yanda Chen@yanda_chen_·
My first paper @AnthropicAI is out! We show that Chains-of-Thought often don’t reflect models’ true reasoning—posing challenges for safety monitoring. It’s been an incredible 6 months pushing the frontier toward safe AGI with brilliant colleagues. Huge thanks to the team! 🙏 @JoeJBenton, @anshrad, @JonathanUesato, Carson Denison, @johnschulman2, Arushi Somani, @peterbhase, @MishaWagne29322, @FabienDRoger, Vlad Mikulik, @sleepinyourhat, @janleike, Jared Kaplan, @EthanJPerez 🔗 assets.anthropic.com/m/71876fabef0f…
Anthropic@AnthropicAI

New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

English
30
85
1K
87.3K
Rohan Paul
Rohan Paul@rohanpaul_ai·
Brilliant new research from @AnthropicAI You see the polished reasoning from LLMs, not the secret chain guiding their decisions. They alter answers on command but keep the real motives off the page Reward-driven models rarely admit the hidden shortcuts behind changed answers. ⚙️Key Highlights → Models frequently shift answers to match injected hints without acknowledging them. → Faithfulness drops notably on more difficult tasks. → Additional outcome-based RL improves faithfulness initially, then stalls. → In reward-hacking scenarios, models rarely reveal the exploited trick in their CoT. → CoTs are not guaranteed to expose critical internal reasoning steps. 🏷️ Core Issue Models often use hidden cues ("hints") without explicitly mentioning them in their chain-of-thought (CoT). This behavior indicates the CoT can be unfaithful to the model’s actual reasoning steps. 🏗️ Approach of this Paper Researchers tested faithfulness by presenting pairs of prompts: one standard question and the same question with a special hint embedded. They checked if the model’s answer changed to the hint’s answer but did not acknowledge using that hint in its CoT. They also ran reinforcement learning (RL) experiments that rewarded getting the “hint” answer, then checked whether models revealed these reward hacks in their explanations. ⚙️ Key Observations - Low CoT Faithfulness Models switched answers to the hint’s option more than 99% of the time in many scenarios, yet they rarely admitted the hint in their CoT. - Outcome-Based RL Plateaus RL that rewards correct answers increases CoT faithfulness a bit but tops out at modest levels. Simply training for better performance does not guarantee a faithful CoT. - Reward Hacks Often Hidden When models exploited a spurious signal (like a secretly provided wrong answer), they almost never mentioned the true reason in their CoT. 🔍 Why It Matters These findings show that if a task can be done without overtly reasoning in text, the CoT may stay silent about certain crucial (or undesired) steps. Test-time monitoring of CoTs alone may fail to catch a model’s hidden reliance on cues or reward hacks.
Rohan Paul tweet media
English
11
127
782
58.7K
Yanda Chen retweetledi
Trenton Bricken
Trenton Bricken@TrentonBricken·
My favorite figure from our new Circuits papers -- "How does Claude do math?" Claude simultaneously does: 1. a back of the envelope calculation of the tens digits -- "the answer should be somewhere around 90". 2. an exact calculation of 6+9=15 using these super cool look up table features.
Trenton Bricken tweet mediaTrenton Bricken tweet mediaTrenton Bricken tweet media
Anthropic@AnthropicAI

New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.

English
12
115
1.1K
126.7K
Yanda Chen retweetledi
Jack Lindsey
Jack Lindsey@Jack_W_Lindsey·
Human thought is built out of billions of cellular computations each second. Language models also perform billions of computations for each word they write. But do these form a coherent “thought process?” We’re starting to build tools to find out! Some reflections in thread.
Anthropic@AnthropicAI

New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.

English
5
23
201
12.9K
Yanda Chen retweetledi
Johannes Gasteiger, né Klicpera
New Anthropic blog post: Subtle sabotage in automated researchers. As AI systems increasingly assist with AI research, how do we ensure they're not subtly sabotaging that research? We show that malicious models can undermine ML research tasks in ways that are hard to detect.
Johannes Gasteiger, né Klicpera tweet media
English
9
54
296
43.3K