Joachim Baumann @ ICLR&#39;26

3

15

1.8K

Joachim Baumann @ ICLR'26@joabaum·2d

@Cameron_Chann @Stanford @Diyi_Yang welcome to the team @Cameron_Chann 🥳

English

0

3

245

Changyu Chen@Cameron_Chann·2d

Life update: I'm super excited to join @Stanford as a postdoc working with @Diyi_Yang ! I’ll continue my research on RL, and recently I’ve become especially interested in how RL can contribute to human-AI collaboration and collaborative agents. A new chapter begins, from the sunny island to the sunny state ☀️🏝️

English

12

200

15.3K

EMNLP 2026@emnlpmeeting·2d

Submitting to ARR for #EMNLP2026? We're running an opt-in AI Reviewing Experiment. Help us test AI-generated reviews during your ARR submission. 🤖 ✅ Reviewers, ACs, and SACs will not be able to see it ✅ Will not affect decisions 🔗 Read more: 2026.emnlp.org/ai-reviewing-e…

English

Joachim Baumann @ ICLR'26@joabaum

20

95

10.7K

Joachim Baumann @ ICLR'26@joabaum·2d

@emnlpmeeting Great to see ARR running a rigorous evaluation of AI reviewing! That's exactly the kind of experiment our ICML position paper called for: x.com/joabaum/status…

Can you boost your AI review scores by asking an LLM to rewrite your paper? Yes! We call it paper laundering Our @icmlconf spotlight paper argues current AI reviewers aren't ready to automate peer review, and outlines what a science of peer review automation should look like🧵👇

English

2

21

1.8K

Joachim Baumann @ ICLR'26 retweetledi

Joachim Baumann @ ICLR'26@joabaum·27 Nis

We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild. In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check. Data, paper, & findings in the 🧵👇

English

14

78

476

69.2K

Joachim Baumann @ ICLR'26@joabaum·2d

Thrilled to share the amazing work led by @houjun_liu! 🎉 SecureForge is a much-needed tool to vibe code more securely – and especially cool to see our SWE-chat dataset enabling this kind of research with realistic evals

Houjun Liu@houjun_liu

🚨 Your coding agent may be secretly sticking vulnerabilities into your code!! 🚨 Wouldn't you want to fix that? Hint: asking it to write secure code is not enough. (1/n)

English

8

24

9.4K

Joachim Baumann @ ICLR'26 retweetledi

Houjun Liu@houjun_liu·3d

Repo: github.com/sisl/SecureFor… Package: pypi.org/project/secure… Paper: arxiv.org/pdf/2605.08382 Cheers to my wonderful collaborators: Lisa Einstein, @jyangballin, @joabaum, @DuncanEddy, @chrmanning, @aiprof_mykel, @Diyi_Yang with the support of @schmidtsciences trustworthy AI.

English

3

14

1.3K

Joachim Baumann @ ICLR'26 retweetledi

Danish Pruthi@danish037·4d

I believe one of the most important problems is to detect the nature and extent of AI used. Take paper reviewing for example, where many conferences allow reviewers to use LLMs to polish their reviews but not to generate its contents. However, can such polishing-only policies be even enforced? Our recent #ICML paper answers this question in negative, and shows how even the best AI-text detectors misclassify a non-trivial fraction of LLM polished reviews as fully AI-generated. This is work led by my amazing students: Rounak Saha (@ahaskanuor), Dayita Chaudhuri (@doyitach) and Naveeja Sajeevan in collaboration with @GurushaJuneja and Nihar Shah. (1/n)🧵

English

9

40

3K

Joachim Baumann @ ICLR'26 retweetledi

Scale Labs@ScaleAILabs·8 May

We’ve been sharing a lot lately on where coding agents are headed — now we want to hear from the people building them. If you’re in San Francisco working on coding agents, come hang with us next Wednesday, May 13 at our SFHQ for food, drinks, and convos around all things agentic code. 🤝

English

3

4

28

2.7K

Joachim Baumann @ ICLR'26@joabaum·7 May

@joel_bkr sounds cool, we should chat :)

English

Can we use agent transcripts to understand agent capabilities🤔? Turns out, perhaps coding agent transcripts can upper bound our productivity gains from AI. More about on my latest research @METR_Evals in 🧵

2

110

Joel Becker@joel_bkr·7 May

i would _love_ for someone to redo Amy's analysis using the new SWE-chat dataset. x.com/amydeng_/statu…

Amy Deng@amydeng_

English

Joachim Baumann @ ICLR'26@joabaum

1

14

1.4K

Joachim Baumann @ ICLR'26 retweetledi

Percy Liang@percyliang·5 May

I find myself repeatedly explaining the difference between open-weight (DeepSeek), open-source (Olmo), open-development (Marin). Let's see if this restaurant analogy helps: - Open-weight: food is made behind closed doors, server brings you the dish - Open-source: food is made behind closed doors, server brings you the dish and the recipe - Open-development: you see the chef make the dish in the kitchen (and can shout suggestions while its cooking)!

English

41

92

914

75.3K

Joachim Baumann @ ICLR'26 retweetledi

Kilian Lieret@KLieret·5 May

Introducing ProgramBench: 200 whole-repo generation tasks rigorously evaluated in cleanroom settings (no internet, no decompilation, no leaked source, no systracing, ...). Best score is **0**.

John Yang@jyangballin

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

English

5

6

50

6.1K

Joachim Baumann @ ICLR'26 retweetledi

John Yang@jyangballin·5 May

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

English

101

245

1.5K

704.8K

Joachim Baumann @ ICLR'26@joabaum·5 May

@michahu8 @AlexiGlad hopefully not though :) x.com/joabaum/status…

Can you boost your AI review scores by asking an LLM to rewrite your paper? Yes! We call it paper laundering Our @icmlconf spotlight paper argues current AI reviewers aren't ready to automate peer review, and outlines what a science of peer review automation should look like🧵👇

English

39

Michael Hu@michahu8·5 May

@AlexiGlad Best part is the slop cannons review your papers

English

0

6

1.9K

Alexi Gladstone@AlexiGlad·5 May

looks like there's gonna be around 40k neurips submissions? the biggest exponential in ai right now is slop

English

15

8

274

24.2K

Joachim Baumann @ ICLR'26 retweetledi

Augmented Mind Podcast@augmind_fm·4 May

“In the past, with social media or web search, you are like, here are some specific keywords, here are some posts that I am okay to share with the world; whereas with AI, it feels like you are private, it feels like you are talking to an entity that won’t reveal your information.” For EP4, we welcome @kenziyuliu, Stanford CS PhD student and creator of The Open Anonymity Project. Ken approaches AI privacy from angles most researchers don't: deep learning, applied cryptography, privacy technologies, and real human behavior all at once. In this episode, he shares how to achieve provable private AI inference, why today's agents are a privacy nightmare (and how to fix it), his vision on intelligence neutrality, and more. 0:00 - Teaser 1:08 - Prelude: Introducing Ken Liu 1:41 - Monologue: The Open Anonymity Project 3:41 - Ken’s Path to Privacy Research 6:31 - The Biggest Privacy Concern for LLM Users 9:39 - Three Perspectives on Tackling AI Privacy 10:57 - “AI presents a Uniquely Worse Privacy Problem” 13:44 - The Open Anonymity (OA) Project: Unlinkable Inference 17:50 - Blind Signatures as Unlinkable Authentication 20:52 - Secure Inference Proxies 28:31 - Threat Model in the OA Project 31:39 - What If People Give Away Information In Their Prompts 35:58 - OpenClaw, Privacy Nightmare In Agents 43:00 - The Stories Behind the OA Project 50:14 - Intelligence Neutrality 52:22 - Safety Concerns in a World with Private AI Inference

English

18

36

26.9K

Joachim Baumann @ ICLR'26@joabaum·4 May

@hugo_larochelle we don't have a human control, so we can't rule this out. our point is simply that we should aspire to build non-gameable AI reviewers if we intend to add those to the peer review process. and currently, AI reviewers are being deployed without transparent, rigorous evaluation

English

0

1

23

Hugo Larochelle@hugo_larochelle·3 May

@joabaum I agree about human oversight. I guess my point is that it is not clear that this experiment wouldn't also replicate with human reviewers.

English

0

1

40

Joachim Baumann @ ICLR'26@joabaum·1 May

Can you boost your AI review scores by asking an LLM to rewrite your paper? Yes! We call it paper laundering Our @icmlconf spotlight paper argues current AI reviewers aren't ready to automate peer review, and outlines what a science of peer review automation should look like🧵👇

English

14

75

456

51.2K

Joachim Baumann @ ICLR'26@joabaum·4 May

@hugo_larochelle oh this is very cool, thanks for sharing!

English

1

10

Hugo Larochelle@hugo_larochelle·3 May

@joabaum Oh, and in case that's interesting: one data point supporting the value of diversity in point of view from our work on ReviewerToo is that we get the best results when we pool more simulated reviewing personas in our system arxiv.org/abs/2510.08867

English