Joachim Baumann @ ICLR'26

143 posts

Joachim Baumann @ ICLR'26

Joachim Baumann @ ICLR'26

@joabaum

Postdoc @StanfordNLP @StanfordAILab / Prev: @MilaNLProc @UZH_en @MPI_IS @CarnegieMellon. CompSocSci, LLMs, algorithmic fairness.

Stanford Katılım Şubat 2021
1.4K Takip Edilen764 Takipçiler
Sabitlenmiş Tweet
Joachim Baumann @ ICLR'26
We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild. In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check. Data, paper, & findings in the 🧵👇
Joachim Baumann @ ICLR'26 tweet media
English
14
78
476
69.2K
Joachim Baumann @ ICLR'26 retweetledi
Shashwat Goel
Shashwat Goel@ShashwatGoel7·
Continual learning is bottlenecked by realistic evaluations Introducing FutureSim, which replays real-world events in the temporal order they occurred We benchmark frontier agents at updating predictions about how our world evolves, in native harnesses like Codex, Claude Code
GIF
English
10
22
143
11.5K
Joachim Baumann @ ICLR'26 retweetledi
Lujain Ibrahim
Lujain Ibrahim@lujainmibrahim·
New preprint! In 5 studies (3k+ users / 12k+ convs, with a 3-wk longitudinal study), we find that sycophantic AI influences how people view those closest to them. It affects how effortful human interaction seems, how satisfying it is, & who people want to turn to for advice 🧵
Lujain Ibrahim tweet mediaLujain Ibrahim tweet media
English
3
48
138
34.9K
Joachim Baumann @ ICLR'26 retweetledi
Kevin Li
Kevin Li@kevin_x_li·
Introducing SWE-ZERO-12M-trajectories: the largest agentic trace dataset in the open, 5.7x larger than the previous largest. 112B tokens · 12M trajectories · 122K PRs · 3K repos · 16 languages huggingface.co/datasets/Alien…
English
18
68
514
73.9K
Changyu Chen
Changyu Chen@Cameron_Chann·
Life update: I'm super excited to join @Stanford as a postdoc working with @Diyi_Yang ! I’ll continue my research on RL, and recently I’ve become especially interested in how RL can contribute to human-AI collaboration and collaborative agents. A new chapter begins, from the sunny island to the sunny state ☀️🏝️
English
12
12
200
15.3K
EMNLP 2026
EMNLP 2026@emnlpmeeting·
Submitting to ARR for #EMNLP2026? We're running an opt-in AI Reviewing Experiment. Help us test AI-generated reviews during your ARR submission. 🤖 ✅ Reviewers, ACs, and SACs will not be able to see it ✅ Will not affect decisions 🔗 Read more: 2026.emnlp.org/ai-reviewing-e…
English
2
20
95
10.7K
Joachim Baumann @ ICLR'26
@emnlpmeeting Great to see ARR running a rigorous evaluation of AI reviewing! That's exactly the kind of experiment our ICML position paper called for: x.com/joabaum/status…
Joachim Baumann @ ICLR'26@joabaum

Can you boost your AI review scores by asking an LLM to rewrite your paper? Yes! We call it paper laundering Our @icmlconf spotlight paper argues current AI reviewers aren't ready to automate peer review, and outlines what a science of peer review automation should look like🧵👇

English
0
2
21
1.8K
Joachim Baumann @ ICLR'26 retweetledi
Joachim Baumann @ ICLR'26
We present SWE-chat: the first large-scale dataset of coding agent interactions from real users in the wild. In 40% of real coding sessions, the agent writes ~all the code. Users push back 39% of the time – agents almost never stop to check. Data, paper, & findings in the 🧵👇
Joachim Baumann @ ICLR'26 tweet media
English
14
78
476
69.2K
Joachim Baumann @ ICLR'26 retweetledi
Danish Pruthi
Danish Pruthi@danish037·
I believe one of the most important problems is to detect the nature and extent of AI used. Take paper reviewing for example, where many conferences allow reviewers to use LLMs to polish their reviews but not to generate its contents. However, can such polishing-only policies be even enforced? Our recent #ICML paper answers this question in negative, and shows how even the best AI-text detectors misclassify a non-trivial fraction of LLM polished reviews as fully AI-generated. This is work led by my amazing students: Rounak Saha (@ahaskanuor), Dayita Chaudhuri (@doyitach) and Naveeja Sajeevan in collaboration with @GurushaJuneja and Nihar Shah. (1/n)🧵
Danish Pruthi tweet media
English
1
9
40
3K
Joachim Baumann @ ICLR'26 retweetledi
Scale Labs
Scale Labs@ScaleAILabs·
We’ve been sharing a lot lately on where coding agents are headed — now we want to hear from the people building them. If you’re in San Francisco working on coding agents, come hang with us next Wednesday, May 13 at our SFHQ for food, drinks, and convos around all things agentic code. 🤝
Scale Labs tweet media
English
3
4
28
2.7K
Joachim Baumann @ ICLR'26 retweetledi
Percy Liang
Percy Liang@percyliang·
I find myself repeatedly explaining the difference between open-weight (DeepSeek), open-source (Olmo), open-development (Marin). Let's see if this restaurant analogy helps: - Open-weight: food is made behind closed doors, server brings you the dish - Open-source: food is made behind closed doors, server brings you the dish and the recipe - Open-development: you see the chef make the dish in the kitchen (and can shout suggestions while its cooking)!
English
41
92
914
75.3K
Joachim Baumann @ ICLR'26 retweetledi
Kilian Lieret
Kilian Lieret@KLieret·
Introducing ProgramBench: 200 whole-repo generation tasks rigorously evaluated in cleanroom settings (no internet, no decompilation, no leaked source, no systracing, ...). Best score is **0**.
John Yang@jyangballin

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

English
5
6
50
6.1K
Joachim Baumann @ ICLR'26 retweetledi
John Yang
John Yang@jyangballin·
How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵
John Yang tweet media
English
101
245
1.5K
704.8K
Michael Hu
Michael Hu@michahu8·
@AlexiGlad Best part is the slop cannons review your papers
English
2
0
6
1.9K
Alexi Gladstone
Alexi Gladstone@AlexiGlad·
looks like there's gonna be around 40k neurips submissions? the biggest exponential in ai right now is slop
English
15
8
274
24.2K
Joachim Baumann @ ICLR'26 retweetledi
Augmented Mind Podcast
Augmented Mind Podcast@augmind_fm·
“In the past, with social media or web search, you are like, here are some specific keywords, here are some posts that I am okay to share with the world; whereas with AI, it feels like you are private, it feels like you are talking to an entity that won’t reveal your information.” For EP4, we welcome @kenziyuliu, Stanford CS PhD student and creator of The Open Anonymity Project. Ken approaches AI privacy from angles most researchers don't: deep learning, applied cryptography, privacy technologies, and real human behavior all at once. In this episode, he shares how to achieve provable private AI inference, why today's agents are a privacy nightmare (and how to fix it), his vision on intelligence neutrality, and more. 0:00 - Teaser 1:08 - Prelude: Introducing Ken Liu 1:41 - Monologue: The Open Anonymity Project 3:41 - Ken’s Path to Privacy Research 6:31 - The Biggest Privacy Concern for LLM Users 9:39 - Three Perspectives on Tackling AI Privacy 10:57 - “AI presents a Uniquely Worse Privacy Problem” 13:44 - The Open Anonymity (OA) Project: Unlinkable Inference 17:50 - Blind Signatures as Unlinkable Authentication 20:52 - Secure Inference Proxies 28:31 - Threat Model in the OA Project 31:39 - What If People Give Away Information In Their Prompts 35:58 - OpenClaw, Privacy Nightmare In Agents 43:00 - The Stories Behind the OA Project 50:14 - Intelligence Neutrality 52:22 - Safety Concerns in a World with Private AI Inference
English
2
18
36
26.9K
Joachim Baumann @ ICLR'26
@hugo_larochelle we don't have a human control, so we can't rule this out. our point is simply that we should aspire to build non-gameable AI reviewers if we intend to add those to the peer review process. and currently, AI reviewers are being deployed without transparent, rigorous evaluation
English
1
0
1
23
Hugo Larochelle
Hugo Larochelle@hugo_larochelle·
@joabaum I agree about human oversight. I guess my point is that it is not clear that this experiment wouldn't also replicate with human reviewers.
English
1
0
1
40
Joachim Baumann @ ICLR'26
Can you boost your AI review scores by asking an LLM to rewrite your paper? Yes! We call it paper laundering Our @icmlconf spotlight paper argues current AI reviewers aren't ready to automate peer review, and outlines what a science of peer review automation should look like🧵👇
Joachim Baumann @ ICLR'26 tweet media
English
14
75
456
51.2K
Hugo Larochelle
Hugo Larochelle@hugo_larochelle·
@joabaum Oh, and in case that's interesting: one data point supporting the value of diversity in point of view from our work on ReviewerToo is that we get the best results when we pool more simulated reviewing personas in our system arxiv.org/abs/2510.08867
English
1
0
1
33