Abhay Sheshadri

165 posts

Abhay Sheshadri

@abhayesian

AI Safety/Security Researcher

Katılım Ağustos 2016

1.3K Takip Edilen586 Takipçiler

Abhay Sheshadri@abhayesian·1d

@KhoriatyMatthew @ejcgan In the post, we explore what happens when the side task and main task get decoupled, for the same reason you're describing. It usually happens at higher learning rates. Decoupling halts further progress on the side task, but doesn't erase the gains already made.

English

298

Matthew Khoriaty @ ICML 2026@KhoriatyMatthew·1d

@abhayesian @ejcgan Does this stably target subset-sum in the limit or will it eventually accidentally not do subset sum, or do it wrong, and get rewarded, erasing this behavior?

English

365

Abhay Sheshadri@abhayesian·1d

New Redwood Research post LLMs can use RL training on one task to teach themselves a different, unrelated capability. In a form of exploration hacking, the model outputs answers that get reward only when it succeeds at a side task it wants to learn. We call it reward laundering.

English

355

20.2K

Abhay Sheshadri@abhayesian·1d

You can read the full post, with additional results, here lesswrong.com/posts/fPWP4rHP…

English

800

Abhay Sheshadri@abhayesian·1d

We think it's at the rigor of a mid-MATS research update. We checked correctness mostly by reviewing writeups, running an automated LLM reviewer, and spot-checking that the released codebase reproduced the results.

English

984

Abhay Sheshadri retweetledi

Howard Lutnick@howardlutnick·23 Tem

CAISI’s latest report shows that Kimi K3 remains behind America’s leading frontier AI models. The United States continues to lead in frontier AI because we’re home to the greatest innovators and technologists the world has ever seen.

U.S. Department of Commerce@CommerceGov

CAISI’s latest blog post evaluates Kimi K3 and its cyber capabilities. Based on a preliminary cyber-focused evaluation, Kimi K3 performed significantly below the leading U.S. frontier AI models. nist.gov/news-events/ne…

English

371

512

2.9K

1.5M

Abhay Sheshadri retweetledi

Dave Banerjee@DaveRBanerjee·22 Tem

Redwood Research just published an AI research project where the experiments were designed and run entirely by an automated research agent They rate the work "at or slightly below the rigor of a typical MATS project." My friends in DC, please take recursive self-improvement seriously. If AIs can do research at the level of a junior researcher today, expect the rate of AI progress to accelerate The AIs of 2027 may be unrecognizable from what we have today lesswrong.com/posts/QL6Si6QA…

English

186

10.4K

Abhay Sheshadri@abhayesian·19 Haz

@a_karvonen @Alan_Cooney_ I've seen several projects where 70B massively improved on null results from newer but smaller models. But I don't think anyone has rigorously compared it with the best new open weights models.

English

Adam Karvonen@a_karvonen·18 Haz

@Alan_Cooney_ I mostly agree. I believe that I heard Llama 70B is better at picking up on e.g. complex fine tuned behaviors though? @abhayesian

English

686

Alan Cooney@Alan_Cooney_·18 Haz

Time to drop Llama 70B? Models like Qwen 3.6 27B are excellent to study - smarter, smaller, easy to train and have reasoning🧵

English

2.9K

Abhay Sheshadri@abhayesian·2 Haz

@miclchen congrats!

English

165

Michael L. Chen@miclchen·1 Haz

I'm honored to be one of the few Americans chosen for the AI Scientific Panel. I'm excited to contribute technical expertise here and help make sure U.S. perspectives are represented. AI policy for the most capable models can be more thoughtful when there's pragmatic, independent analysis to inform it.

English

135

Abhay Sheshadri retweetledi

Keshav@kshenoy_·28 Nis

Can LLMs simply tell us about unwanted behaviors they’ve picked up in training? We train a single Introspection Adapter (IA) that makes fine-tuned models describe their behaviors. It generalizes to detecting hidden misalignment, backdoors and safeguard removal.

English

583

298.4K

Abhay Sheshadri retweetledi

Micah Carroll@MicahCarroll·10 Mar

Very curious to see if @OpenAI's auditing pipelines catch these agents! Thank you for doing this work which materially helps with race-to-the-top dynamics

Abhay Sheshadri@abhayesian

New Anthropic Fellows research: Alignment auditing—investigating AI models for unwanted behaviors—is a key challenge for safely deploying frontier models. We're releasing AuditBench, a suite of 56 LLMs with implanted hidden behaviors to measure progress in alignment auditing.

English

7.8K

Abhay Sheshadri@abhayesian·10 Mar

📚 Paper: arxiv.org/abs/2602.22755… 📝Blogpost: alignment.anthropic.com/2026/auditbenc… 🤖 Models & Datasets: huggingface.co/auditing-agents 💻Codebase: github.com/safety-researc…

English

708

Abhay Sheshadri@abhayesian·10 Mar

We’re releasing the AuditBench models, investigator agent, evaluation framework and training pipelines. We hope the community can use these resources to help develop alignment auditing into a quantitative and iterative science.

English

722

Abhay Sheshadri@abhayesian·10 Mar

English

267

29.5K

Keşfet

@KhoriatyMatthew @ejcgan @a_karvonen @Alan_Cooney_ @miclchen @OpenAI @elonmusk @BarackObama