Ludwig Schmidt

239 posts

Ludwig Schmidt

Ludwig Schmidt

@lschmidt3

Assistant professor at @Stanford and member of the technical staff at @AnthropicAI.

Palo Alto, CA Katılım Ağustos 2009
422 Takip Edilen6K Takipçiler
Ludwig Schmidt retweetledi
Charlie Ruan
Charlie Ruan@charlie_ruan·
Releasing the official SkyRL + Harbor integration: a standardized way to train terminal-use agents with RL. From the creators of Terminal-Bench, Harbor is a widely adopted framework for evaluating terminal-use agents on any task expressible as a Dockerfile + instruction + test script. This integration extends it: the same tasks you evaluate on, you can now RL-train on. Blog: novasky-ai.notion.site/skyrl-harbor 🧵
Charlie Ruan tweet media
English
9
48
237
32.8K
Ludwig Schmidt retweetledi
Richard Zhuang
Richard Zhuang@RichardZ412·
Terminal-Bench is a leading benchmark for agents. Unfortunately it’s hard: most small coding agents get very low scores on TB2, so training/system ablations look flat - you can't tell what's working. Announcing OpenThoughts-TBLite - 100 curated TB2-style tasks, difficulty-calibrated so even 8B models can make progress. It's designed to give researchers measurable signal during development, providing faster feedback for experimental iteration while closely tracking true TB2 performance🧵
Richard Zhuang tweet media
English
11
22
181
43.5K
Ludwig Schmidt retweetledi
Mike A. Merrill
Mike A. Merrill@Mike_A_Merrill·
The Terminal-Bench paper is here! Read it to learn where frontier models still fail and the secrets of how we sourced hundreds of high quality environments from our open source community. 🧵
Mike A. Merrill tweet media
English
23
103
459
100.4K
Ludwig Schmidt retweetledi
Etash Guha
Etash Guha@etash_guha·
Building TerminalBench agents in the open is hard. We're making it much easier. OpenThoughts-Agent is our first milestone in open-data pipelines for building agents. We're the best model of our size on TerminalBench 2.0. We're pushing both SFT and RL pipelines for building agents. I'm so excited to see where this project goes! Check it out!
Negin Raoof@NeginRaoof_

How can we make a better TerminalBench agent? Today, we are announcing the OpenThoughts-Agent project. OpenThoughts-Agent v1 is the first TerminalBench agent trained on fully open curated SFT and RL environments. OpenThinker-Agent-v1 is the strongest model of its size on TerminalBench, and sets a new bar on our newly released OpenThoughts-TB-Dev benchmark. (1/n)

English
5
13
48
13.7K
Ludwig Schmidt retweetledi
Negin Raoof
Negin Raoof@NeginRaoof_·
How can we make a better TerminalBench agent? Today, we are announcing the OpenThoughts-Agent project. OpenThoughts-Agent v1 is the first TerminalBench agent trained on fully open curated SFT and RL environments. OpenThinker-Agent-v1 is the strongest model of its size on TerminalBench, and sets a new bar on our newly released OpenThoughts-TB-Dev benchmark. (1/n)
Negin Raoof tweet media
English
17
78
289
124.9K
Ludwig Schmidt retweetledi
Stanford Saplings
Stanford Saplings@saplingsphd·
Introducing the new Stanford CS Ph.D. Entrepreneurship Club: Saplings🌲🌲!! We're a community for Computer Science PhD students at Stanford to help them turn their innovative ideas into impactful startups, alongside fellow founders and industry leaders. We're going to have many exciting events soon, so stay tuned!
English
2
5
39
19.3K
Ludwig Schmidt retweetledi
Anas Awadalla
Anas Awadalla@anas_awadalla·
We're releasing🍨Gelato-30B-A3B, a state-of-the-art computer grounding model that delivers immediate performance gains for computer-use agents! Trained on our open-source🖱️Click-100k dataset, Gelato achieves 63.8% on ScreenSpot-Pro and 69.1% on OS-World-G. It outperforms specialized models like GTA1-32B and VLMs ~8× its size like Qwen3-VL-235B. (1/N) 🧵
Anas Awadalla tweet media
English
7
41
236
34K
Ludwig Schmidt retweetledi
Alex Shaw
Alex Shaw@alexgshaw·
Today, we’re announcing the next chapter of Terminal-Bench with two releases: 1. Harbor, a new package for running sandboxed agent rollouts at scale 2. Terminal-Bench 2.0, a harder version of Terminal-Bench with increased verification
Alex Shaw tweet media
English
25
74
385
138.8K
Ludwig Schmidt retweetledi
John Yang
John Yang@jyangballin·
New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals
English
31
99
411
95K
Ludwig Schmidt retweetledi
Alex Shaw
Alex Shaw@alexgshaw·
Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days. We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks. Now you can use the Terminal-Bench CLI and harness to evaluate on SWE-bench and other popular benchmarks.
Alex Shaw tweet media
English
1
24
104
13.6K
Ludwig Schmidt
Ludwig Schmidt@lschmidt3·
I'm a big fan of the approach to research funding @andykonwinski and the Laude team are taking! Working with them on terminal-bench has been fantastic (thanks @alexgshaw!) and I'm excited that they're going to support more open, impact-oriented research.
Andy Konwinski@andykonwinski

Today, I’m launching a deeply personal project. I’m betting $100M that we can help computer scientists create more upside impact for humanity. Built for and by researchers, including @JeffDean & @jpineau1 on the board, @LaudeInstitute catalyzes research with real-world impact.

English
2
5
91
11.6K
Ludwig Schmidt retweetledi
Thao Nguyen
Thao Nguyen@thao_nguyen26·
Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔 We propose Recycling the Web to break the data wall of pretraining via grounded synthetic data. It is more effective than standard data filtering methods, even with multi-epoch repeats! arxiv.org/abs/2506.04689
Thao Nguyen tweet media
English
14
65
227
35.7K
Ludwig Schmidt
Ludwig Schmidt@lschmidt3·
@giffmana Thanks for the kind words, Lucas! I hope we get a chance to work together some day, I'm a big fan of your work. BTW my lab is always looking for good postdocs. Comp is probably worse than OpenAI, but long-time lab members get to go on runs with @Vaishaal's dog Kaya. He's great!
English
1
1
31
3.1K
Lucas Beyer (bl16)
Lucas Beyer (bl16)@giffmana·
This is why I consider Ludwig one of the best academics of recent years. He's been leading the most impactful work that academics usually shy away from. Not just once, not just twice, but many times over! And his pivot from robustness was amazing to witness, hats off.
Ludwig Schmidt@lschmidt3

Very excited to finally release our paper for OpenThoughts! After DataComp and DCLM, this is the third large open dataset my group has been building in collaboration with the DataComp community. This time, the focus is on post-training, specifically reasoning data.

English
9
40
368
45.2K
Ludwig Schmidt
Ludwig Schmidt@lschmidt3·
Together with the paper we also release our new dataset OpenThoughts3-1.2M and the corresponding model OpenThinker3-7B, which is currently the best open-data 7B reasoning model.
Ludwig Schmidt tweet media
English
1
0
28
5.6K
Ludwig Schmidt
Ludwig Schmidt@lschmidt3·
Very excited to finally release our paper for OpenThoughts! After DataComp and DCLM, this is the third large open dataset my group has been building in collaboration with the DataComp community. This time, the focus is on post-training, specifically reasoning data.
Ludwig Schmidt tweet media
English
22
210
1.3K
187.1K