Yanda Chen

105 posts

Yanda Chen

@yanda_chen_

Member of Technical Staff @AnthropicAI CodeRL/Alignment | PhD @ColumbiaCompSci | NLP & ML | Prev Intern @MSFTResearch, @AmazonScience

San Francisco, CA Katılım Ocak 2019

610 Takip Edilen2.6K Takipçiler

Yanda Chen retweetledi

Jack Lindsey@Jack_W_Lindsey·14h

Could an LLM have emotions? It’s hard to say. But when you’re talking to Claude, ChatGPT, or Gemini, you’re not talking to an LLM. You’re talking to a *character* being authored by an LLM. And these characters can, functionally, be driven by internal representations of desperation, or fear, or empathy (with sometimes alarming consequences).

Anthropic@AnthropicAI

New Anthropic research: Emotion concepts and their function in a large language model. All LLMs sometimes act like they have emotions. But why? We found internal representations of emotion concepts that can drive Claude’s behavior, sometimes in surprising ways.

English

348

32.2K

Yanda Chen@yanda_chen_·5d

Excited that the last work back from my PhD is out! We proposed a method to train language models to control the factuality–informativeness trade-off in their responses based on user preferences. @SaraZiweiGong has been working on a lot of interesting stuff at the intersection of AI and psychology — worth following if you’re interested in those areas!

Ziwei (Sara) Gong@SaraZiweiGong

(1/4) Excited to share our new preprint: "Factuality on Demand"with @yanda_chen_ ! LLMs face an inherent trade-off: be cautious but uninformative, or detailed but prone to hallucinations. What if you could tune this with a simple knob? Read: arxiv.org/abs/2602.00848 #NLP #LLMs

English

3.1K

Yanda Chen retweetledi

Xinyuan Cao@CaoYouki·10 Ara

(1/6) Why does next-token prediction work so well, even for long text? 🤔 Check out “Provable Long-Range Benefits of Next-Token Prediction”. A rigorous explanation for LLM’s long-range coherence/reasoning. Joint work with Santosh Vempala📄 arXiv: arxiv.org/abs/2512.07818

English

2.2K

Yanda Chen retweetledi

He He@hhexiy·14 Eki

Reward hacking means the model is making less effort than expected: it finds the answer long before its fake CoT is finished. TRACE uses this idea to detect hacking when CoT monitoring fails. Work led by @XinpengWang_ @nitishjoshi23 and @rico_angell👇

Xinpeng Wang@XinpengWang_

‼️Your model may be secretly exploiting your imperfect reward function without telling you in the CoT! How to detect such 'implicit' reward hacking if the model is hiding it?🧐 We introduce TRACE🕵, a method based on a simple premise: hacking is easier than solving the actual task. 🧵

English

132

24.2K

Yanda Chen retweetledi

Ethan Perez@EthanJPerez·4 Eyl

We’re hiring someone to run the Anthropic Fellows Program! Our research collaborations have led to some of our best safety research and hires. We’re looking for an exceptional ops generalist, TPM, or research/eng manager to help us significantly scale and improve our collabs 🧵

English

257

68.7K

Yanda Chen retweetledi

Ethan Perez@EthanJPerez·27 Ağu

We recently ran to have OpenAI and Anthropic each evaluate each others’ models for safety issues. Excited for us to find more ways to help support safety practices across the whole field!

Sam Bowman@sleepinyourhat

Early this summer, OpenAI and Anthropic agreed to try some of our best existing tests for misalignment on each others’ models. After discussing our results privately, we’re now sharing them with the world. 🧵

English

6.8K

Yanda Chen@yanda_chen_·22 Ağu

Our results overall suggest that we can effectively separate harmful from harmless data and use pretraining data filtering to improve model safety without compromising usefulness. Big thanks to the team! 🙏 @MycalTucker, @NinaPanickssery, @marsupialtail_2, @framosconis, @anjali_gopal, Carson Denison, @petrini_linda @janleike, @EthanJPerez, @MrinankSharma

Anthropic@AnthropicAI

New Anthropic research: filtering out dangerous information at pretraining. We’re experimenting with ways to remove information about chemical, biological, radiological and nuclear (CBRN) weapons from our models’ training data without affecting performance on harmless tasks.

English

7.4K

Yanda Chen retweetledi

Jan Leike@janleike·14 Ağu

If you want to get into alignment research, imo this is one of the best ways to do it. Some previous fellows did some of the most interesting research I've seen this year and >20% ended up joining Anthropic full-time. Application deadline is this Sunday!

Anthropic@AnthropicAI

We’re running another round of the Anthropic Fellows program. If you're an engineer or researcher with a strong coding or technical background, you can apply to receive funding, compute, and mentorship from Anthropic, beginning this October. There'll be around 32 places.

English

343

49.4K

Yanda Chen retweetledi

Jack Lindsey@Jack_W_Lindsey·5 Ağu

After we published our circuit tracing work, researchers from several different interpretability groups came together for this collaboration. Check it out for replications, extensions, and perspectives on the field, including suggestions for future work!

neuronpedia@neuronpedia

Today, we're releasing The Circuit Analysis Research Landscape: an interpretability post extending & open sourcing Anthropic's circuit tracing work, co-authored by @Anthropic, @GoogleDeepMind, @GoodfireAI @AiEleuther, and @decode_research. Here's a quick demo, details follow: ⤵️

English

131

7.7K

Yanda Chen retweetledi

Ethan Perez@EthanJPerez·30 Tem

We're doubling the size of Anthropic's Fellows Program and launching a new round of applications. The first round of collaborations led to a number of recent/upcoming safety results that are comparable in impact to work our internal safety teams have done (IMO)

Anthropic@AnthropicAI

English

147

11.2K

Yanda Chen@yanda_chen_·30 Tem

So excited to hear that Kathy won the ACL Lifetime Achievement Award! I feel incredibly fortunate and honored to have had her as one of my PhD advisors, and I’ve learned so much from her over the years. Big congrats!

ACL 2026@aclmeeting

🕊️ Lifetime Achievement Award at #ACL2025NLP A standing ovation for Prof. Kathy McKeown, recipient of the ACL 2025 Lifetime Achievement Award! 🌟

English

2.4K

Yanda Chen retweetledi

Aryo Pradipta Gema@aryopg·22 Tem

New Anthropic Research: “Inverse Scaling in Test-Time Compute” We found cases where longer reasoning leads to lower accuracy. Our findings suggest that naïve scaling of test-time compute may inadvertently reinforce problematic reasoning patterns. 🧵

English

161

1.1K

338.6K

Yanda Chen retweetledi

Joe Benton@JoeJBenton·17 Haz

📰We've just released SHADE-Arena, a new set of sabotage evaluations. It's also one of the most complex, agentic (and imo highest quality) settings for control research to date! If you're interested in doing AI control or sabotage research, I highly recommend you check it out.

Anthropic@AnthropicAI

New Anthropic Research: A new set of evaluations for sabotage capabilities. As models gain more agentic abilities, we need to get smarter in how we monitor them. We’re publishing a new set of complex evaluations that test for sabotage—and sabotage-monitoring—capabilities.

English

12.5K

Yanda Chen@yanda_chen_·4 Nis

@itsanshmittal @AnthropicAI I think low faithfulness doesn’t necessarily imply sub-human level performance? Capability and faithfulness seem relatively orthogonal.

English

252

Ansh Mittal@itsanshmittal·3 Nis

@yanda_chen_ @AnthropicAI If that’s case then it introduces more trustability issues and makes the case for LLMs being incapable of human level performance. Do you think we can really train LLMs to be sufficiently reliable in verifiable domains?

English

502

Yanda Chen@yanda_chen_·3 Nis

My first paper @AnthropicAI is out! We show that Chains-of-Thought often don’t reflect models’ true reasoning—posing challenges for safety monitoring. It’s been an incredible 6 months pushing the frontier toward safe AGI with brilliant colleagues. Huge thanks to the team! 🙏 @JoeJBenton, @anshrad, @JonathanUesato, Carson Denison, @johnschulman2, Arushi Somani, @peterbhase, @MishaWagne29322, @FabienDRoger, Vlad Mikulik, @sleepinyourhat, @janleike, Jared Kaplan, @EthanJPerez 🔗 assets.anthropic.com/m/71876fabef0f…

Anthropic@AnthropicAI

New Anthropic research: Do reasoning models accurately verbalize their reasoning? Our new paper shows they don't. This casts doubt on whether monitoring chains-of-thought (CoT) will be enough to reliably catch safety issues.

English

87.3K

Yanda Chen@yanda_chen_·4 Nis

@chrisxcheung @AnthropicAI Mech interp? I’m also excited about methods that improve cot faithfulness.

English

190

Chris Cheung@chrisxcheung·4 Nis

@yanda_chen_ @AnthropicAI Great work @yanda_chen_ ! Do you have any insights on what would better reflect ‘true reasoning’?

English

272

Yanda Chen@yanda_chen_·4 Nis

@rohanpaul_ai @AnthropicAI Thanks for sharing our work!

English

Rohan Paul@rohanpaul_ai·4 Nis

Brilliant new research from @AnthropicAI You see the polished reasoning from LLMs, not the secret chain guiding their decisions. They alter answers on command but keep the real motives off the page Reward-driven models rarely admit the hidden shortcuts behind changed answers. ⚙️Key Highlights → Models frequently shift answers to match injected hints without acknowledging them. → Faithfulness drops notably on more difficult tasks. → Additional outcome-based RL improves faithfulness initially, then stalls. → In reward-hacking scenarios, models rarely reveal the exploited trick in their CoT. → CoTs are not guaranteed to expose critical internal reasoning steps. 🏷️ Core Issue Models often use hidden cues ("hints") without explicitly mentioning them in their chain-of-thought (CoT). This behavior indicates the CoT can be unfaithful to the model’s actual reasoning steps. 🏗️ Approach of this Paper Researchers tested faithfulness by presenting pairs of prompts: one standard question and the same question with a special hint embedded. They checked if the model’s answer changed to the hint’s answer but did not acknowledge using that hint in its CoT. They also ran reinforcement learning (RL) experiments that rewarded getting the “hint” answer, then checked whether models revealed these reward hacks in their explanations. ⚙️ Key Observations - Low CoT Faithfulness Models switched answers to the hint’s option more than 99% of the time in many scenarios, yet they rarely admitted the hint in their CoT. - Outcome-Based RL Plateaus RL that rewards correct answers increases CoT faithfulness a bit but tops out at modest levels. Simply training for better performance does not guarantee a faithful CoT. - Reward Hacks Often Hidden When models exploited a spurious signal (like a secretly provided wrong answer), they almost never mentioned the true reason in their CoT. 🔍 Why It Matters These findings show that if a task can be done without overtly reasoning in text, the CoT may stay silent about certain crucial (or undesired) steps. Test-time monitoring of CoTs alone may fail to catch a model’s hidden reliance on cues or reward hacks.

English

127

782

58.7K

Yanda Chen retweetledi

Trenton Bricken@TrentonBricken·27 Mar

My favorite figure from our new Circuits papers -- "How does Claude do math?" Claude simultaneously does: 1. a back of the envelope calculation of the tens digits -- "the answer should be somewhere around 90". 2. an exact calculation of 6+9=15 using these super cool look up table features.

Anthropic@AnthropicAI

New Anthropic research: Tracing the thoughts of a large language model. We built a "microscope" to inspect what happens inside AI models and use it to understand Claude’s (often complex and surprising) internal mechanisms.

English

115

1.1K

126.7K

Yanda Chen retweetledi

Jack Lindsey@Jack_W_Lindsey·27 Mar

Human thought is built out of billions of cellular computations each second. Language models also perform billions of computations for each word they write. But do these form a coherent “thought process?” We’re starting to build tools to find out! Some reflections in thread.

Anthropic@AnthropicAI

English

201

12.9K

Yanda Chen retweetledi

Johannes Gasteiger, né Klicpera@gasteigerjo·25 Mar

New Anthropic blog post: Subtle sabotage in automated researchers. As AI systems increasingly assist with AI research, how do we ensure they're not subtly sabotaging that research? We show that malicious models can undermine ML research tasks in ways that are hard to detect.

Johannes Gasteiger, né Klicpera tweet media

English

296

43.3K

Keşfet

@SaraZiweiGong @XinpengWang_ @nitishjoshi23 @rico_angell @MycalTucker @NinaPanickssery @marsupialtail_2 @framosconis