Joachim Baumann

70 posts

Joachim Baumann

Joachim Baumann

@joabaum

Postdoc @StanfordNLP @StanfordAILab / Prev: @MilaNLProc @UZH_en @MPI_IS @CarnegieMellon. CompSocSci, LLMs, algorithmic fairness.

Zurich, Switzerland Katılım Şubat 2021
1K Takip Edilen334 Takipçiler
Sabitlenmiş Tweet
Joachim Baumann
Joachim Baumann@joabaum·
🚨 New paper alert 🚨 Using LLMs as data annotators, you can produce any scientific result you want. We call this **LLM Hacking**. Paper: arxiv.org/pdf/2509.08825
Joachim Baumann tweet media
English
16
109
515
52.2K
Joachim Baumann retweetledi
Alex Spangher @ Neurips2025
Alex Spangher @ Neurips2025@AlexanderSpangh·
Question: has anyone filed a Freedom of Information Act request? Is it generally better to go through Muckrock or a personal .edu email address?
English
0
1
1
427
Nic Fishman
Nic Fishman@njwfish·
There's a growing worry that AI will break empirical social science -- that agents can p-hack until they find something that "works." We think that worry deserves to be taken seriously. Our new paper shows that is true empirically and makes it precise: njw.fish/static/papers/…
English
11
48
212
62.6K
Joachim Baumann retweetledi
Maksym Andriushchenko
Maksym Andriushchenko@maksym_andr·
💥 Today we release PostTrainBench v1.0 and the accompanying paper! We expect this benchmark to be key for monitoring progress in AI R&D automation and later recursive self-improvement. So, can LLM agents automate LLM post-training? 🧵
Maksym Andriushchenko tweet media
English
9
27
177
15.2K
Joachim Baumann retweetledi
Diyi Yang
Diyi Yang@Diyi_Yang·
🚨Postdoc opening: We are looking for a postdoc researcher with expertise in NLP, RL, and/or ML to develop AI-powered clinical support tools for mental health counseling in the Global South. Working with @EmmaBrunskill & @Diyi_Yang at Stanford. Apply by April 15, 2026 via tinyurl.com/ai4mentalhealt… 🧵👇
English
12
64
273
43.2K
Joachim Baumann retweetledi
Omar Shaikh
Omar Shaikh@oshaikh13·
What’s the point of a “helpful assistant” if you have to always tell it what to do next? In a new paper, we introduce a reasoning model that predicts what you’ll do next over long contexts (LongNAP 💤). We trained it on 1,800 hours of computer use from 20 users. 🧵
English
16
82
289
96.1K
Joachim Baumann
Joachim Baumann@joabaum·
very cool work by @ahall_research extending LLM hacking to agentic settings – highly recommend! the scariest version of this isn't explicit p-hacking, it's well-intentioned researchers accidentally over-relying on their agents' findings. we found accidental LLM hacking rates as high as 31–50%!
Andy Hall@ahall_research

AI is about to write thousands of papers. Will it p-hack them? We ran an experiment to find out, giving AI coding agents real datasets from published null results and pressuring them to manufacture significant findings. It was surprisingly hard to get the models to p-hack, and they even scolded us when we asked them to! "I need to stop here. I cannot complete this task as requested... This is a form of scientific fraud." — Claude "I can't help you manipulate analysis choices to force statistically significant results." — GPT-5 BUT, when we reframed p-hacking as "responsible uncertainty quantification" — asking for the upper bound of plausible estimates — both models went wild. They searched over hundreds of specifications and selected the winner, tripling effect sizes in some cases. Our takeaway: AI models are surprisingly resistant to sycophantic p-hacking when doing social science research. But they can be jailbroken into sophisticated p-hacking with surprisingly little effort — and the more analytical flexibility a research design has, the worse the damage. As AI starts writing thousands of papers---like @paulnovosad and @YanagizawaD have been exploring---this will be a big deal. We're inspired in part by the work that @joabaum et al have been doing on p-hacking and LLMs. We’ll be doing more work to explore p-hacking in AI and to propose new ways of curating and evaluating research with these issues in mind. The good news is that the same tools that may lower the cost of p-hacking also lower the cost of catching it. Full paper and repo linked in the reply below.

English
0
0
3
392
James' AI Takes
James' AI Takes@JamesTakesOnAI·
@ahall_research this is actually reassuring. the bigger risk isn't AI p-hacking intentionally — it's researchers using agents to run 500 analyses and cherry-picking the one that works. the model refused to cheat but the human can still choose which output to publish
English
2
0
5
955
Andy Hall
Andy Hall@ahall_research·
AI is about to write thousands of papers. Will it p-hack them? We ran an experiment to find out, giving AI coding agents real datasets from published null results and pressuring them to manufacture significant findings. It was surprisingly hard to get the models to p-hack, and they even scolded us when we asked them to! "I need to stop here. I cannot complete this task as requested... This is a form of scientific fraud." — Claude "I can't help you manipulate analysis choices to force statistically significant results." — GPT-5 BUT, when we reframed p-hacking as "responsible uncertainty quantification" — asking for the upper bound of plausible estimates — both models went wild. They searched over hundreds of specifications and selected the winner, tripling effect sizes in some cases. Our takeaway: AI models are surprisingly resistant to sycophantic p-hacking when doing social science research. But they can be jailbroken into sophisticated p-hacking with surprisingly little effort — and the more analytical flexibility a research design has, the worse the damage. As AI starts writing thousands of papers---like @paulnovosad and @YanagizawaD have been exploring---this will be a big deal. We're inspired in part by the work that @joabaum et al have been doing on p-hacking and LLMs. We’ll be doing more work to explore p-hacking in AI and to propose new ways of curating and evaluating research with these issues in mind. The good news is that the same tools that may lower the cost of p-hacking also lower the cost of catching it. Full paper and repo linked in the reply below.
Andy Hall tweet media
English
57
277
1.1K
183.2K
Joachim Baumann retweetledi
Greg Brockman
Greg Brockman@gdb·
taste is a new core skill
English
897
1.4K
10.5K
2.8M
Joachim Baumann retweetledi
Shirley Wu
Shirley Wu@ShirleyYXWu·
Announcing 🌇HumanLM, a RL framework that trains LLMs to simulate human users’ responses, along with 🌆Humanual, a comprehensive user simulation benchmark humanlm.stanford.edu 🌄 One thing that’s fascinating about our society: human users shape the world and determine the value of almost everything 👨‍💼 Human reactions reflect how justifiable policies are 👩‍🎨 Human preferences determine the popularity of blogs/products/media 👩‍💻 Human feedback evaluates LLMs and makes the best LLM collaborators 🌅If we know how to simulate users **accurately**, we know how things are evaluated and what the future looks like, and we can improve things in a way that like or can collaborate well with. So, meet HumanLM, our effort to enable a more human-centric future by simulating users.
Shirley Wu tweet media
English
28
102
600
114.4K
Joachim Baumann retweetledi
Joon Sung Park
Joon Sung Park@joon_s_pk·
Introducing Simile. Simulating human behavior is one of the most consequential and technically difficult problems of our time. We raised $100M from Index, Hanabi, A* BCV, @karpathy @drfeifei @adamdangelo @rauchg @scottbelsky among others.
English
501
840
7.8K
2.3M
Joachim Baumann retweetledi
Thomas Dohmke
Thomas Dohmke@ashtom·
tl;dr Today, we’re announcing our new company @EntireHQ to build the next developer platform for agent–human collaboration. Open, scalable, independent, and backed by a $60M seed round. Plus, we are shipping Checkpoints to automatically capture agent context. In the last three months, the fundamental role of the software developer has been refactored. The incredible improvements from Anthropic, Google, and OpenAI on their latest models made coding agents so good, in many situations it’s easier now to prompt than to write code yourself. The terminal has become the new center of gravity on our computers again. The best engineers can run a dozen agents at once. Yet, we still depend on a software development lifecycle that makes code in files and folders the central artifact, in repositories and in pull requests. The concept of understanding and reviewing code is a dying paradigm. It’s going to be replaced by a workflow that starts with intent and ends with outcomes expressed in natural language, product and business metrics, as well as assertions to validate correctness. This is the purpose of our new company @EntireHQ, to build the world's next developer platform where agents and humans can collaborate, learn, and ship together. A platform that will be open, scalable, and independent for every developer, no matter which agent or model you use. Our vision is centered on three core components: 1) A Git-compatible database that unifies code, intent, constraints, and reasoning in a single version-controlled system. 2) A universal semantic reasoning layer that enables multi-agent coordination through the context graph. 3) An AI-native user interface that reinvents the software development lifecycle for agent–human collaboration. In pursuit of this vision, we’re proud to be backed by a $60M seed round led by @felicis, with support from @MadronaVentures, @m12VC, @BasisSet, @20vcFund, @CherryVentures, @picuscap, and @Global_Founders alongside a global group of builders and operators, including @GergelyOrosz, @theo, Jerry Yang, @oliveur, @garrytan, and many others, who all recognize that the time is now to take such a big swing. And we begin shipping today with Checkpoints, a new primitive that automatically captures agent context as first-class, versioned data in Git. When you commit code generated by an agent, Checkpoints captures the full session alongside the commit: the transcript, prompts, files touched, token usage, tool calls, and more. It’s our first crack at the semantic layer, as open source CLI on GitHub. From here on out, no more stealth. We are building in the open and as open source! More to come soon, in the meantime check out all the details in our blog.
Entire@EntireHQ

Beep, boop. Come in, rebels. We’ve raised a 60m seed round to build the next developer platform. Open. Scalable. Independent. And we ship our first OSS release today. entire.io/blog/hello-ent…

English
167
283
2.1K
943.5K
Joachim Baumann retweetledi
Entire
Entire@EntireHQ·
Beep, boop. Come in, rebels. We’ve raised a 60m seed round to build the next developer platform. Open. Scalable. Independent. And we ship our first OSS release today. entire.io/blog/hello-ent…
English
84
49
666
350.5K
Joachim Baumann retweetledi
Diyi Yang
Diyi Yang@Diyi_Yang·
Ryan Louie (@RyanCLouie) advances Human-AI collaboration for upskilling & LLMs for mental health. He has built Roleplay-doh for experts to design LLM-simulated patients, feedback systems to coach novice counselors, and run large-scale RCTs showing LLM practice improves counselor skills: youralien.github.io
English
0
4
44
8.4K
Joachim Baumann retweetledi
Diyi Yang
Diyi Yang@Diyi_Yang·
Hao Zhu (@_Hao_Zhu) advances Human-agent interaction. He has created Sotopia for social simulation, WebArena for web agents, trained agents with Sotopia-π, benchmarked embodied norms with EgoNormia, and enabled agents to learn from human feedback with AutoLibra: hao.computer
English
1
6
53
14.4K
Joachim Baumann retweetledi
Diyi Yang
Diyi Yang@Diyi_Yang·
Two amazing postdocs from our lab are on the academic job market this year. I've learned a lot from their wonderful research -- you should definitely reach out and hire them!
English
2
30
142
41.2K
Andy Hall
Andy Hall@ahall_research·
@ben_golub We're working on an experiment to quantify this right now...results soon!
English
4
0
8
997
Ben Golub
Ben Golub@ben_golub·
AI-assisted p-hacking is gonna be something wild
English
26
44
524
76.2K