Ludwig Schmidt

243 posts

Ludwig Schmidt

Ludwig Schmidt

@lschmidt3

Assistant professor at @Stanford and member of the technical staff at @AnthropicAI.

Palo Alto, CA Katılım Ağustos 2009
423 Takip Edilen6.3K Takipçiler
Ludwig Schmidt retweetledi
terminalbench
terminalbench@terminalbench·
We're releasing Terminal-Bench 2.1 to patch 28 of the 89 tasks in Terminal-Bench 2.0 TB2.1 includes • recalibrated limits • fixed solutions • realigned verifiers Per-task breakdowns in 🧵 We'll continue to support TB2 and TB2.1 leaderboards (new submission process 🔜)
terminalbench tweet media
English
2
11
52
13.9K
Ludwig Schmidt retweetledi
John Yang
John Yang@jyangballin·
How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵
John Yang tweet media
English
102
246
1.5K
708.9K
Ludwig Schmidt retweetledi
David Duvenaud
David Duvenaud@DavidDuvenaud·
Announcing Talkie: a new, open-weight historical LLM! We trained and finetuned a 13B model on a newly-curated dataset of only pre-1930 data. Try it below! with @AlecRad and @status_effects 🧵
English
200
457
3.6K
1.4M
Ludwig Schmidt retweetledi
Nick Levine
Nick Levine@status_effects·
New work with @AlecRad and @DavidDuvenaud: Have you ever dreamed of talking to someone from the past? Introducing talkie, a 13B model trained only on pre-1931 text. Vintage models should help us to understand how LMs generalize (e.g., can we teach talkie to code?). Thread:
English
172
368
2.9K
1.1M
Ludwig Schmidt retweetledi
Charlie Ruan
Charlie Ruan@charlie_ruan·
Releasing the official SkyRL + Harbor integration: a standardized way to train terminal-use agents with RL. From the creators of Terminal-Bench, Harbor is a widely adopted framework for evaluating terminal-use agents on any task expressible as a Dockerfile + instruction + test script. This integration extends it: the same tasks you evaluate on, you can now RL-train on. Blog: novasky-ai.notion.site/skyrl-harbor 🧵
Charlie Ruan tweet media
English
9
46
243
34.3K
Ludwig Schmidt retweetledi
Richard Zhuang
Richard Zhuang@RichardZ412·
Terminal-Bench is a leading benchmark for agents. Unfortunately it’s hard: most small coding agents get very low scores on TB2, so training/system ablations look flat - you can't tell what's working. Announcing OpenThoughts-TBLite - 100 curated TB2-style tasks, difficulty-calibrated so even 8B models can make progress. It's designed to give researchers measurable signal during development, providing faster feedback for experimental iteration while closely tracking true TB2 performance🧵
Richard Zhuang tweet media
English
11
21
182
45.2K
Ludwig Schmidt retweetledi
Mike A. Merrill
Mike A. Merrill@Mike_A_Merrill·
The Terminal-Bench paper is here! Read it to learn where frontier models still fail and the secrets of how we sourced hundreds of high quality environments from our open source community. 🧵
Mike A. Merrill tweet media
English
22
102
460
102.5K
Ludwig Schmidt retweetledi
Etash Guha
Etash Guha@etash_guha·
Building TerminalBench agents in the open is hard. We're making it much easier. OpenThoughts-Agent is our first milestone in open-data pipelines for building agents. We're the best model of our size on TerminalBench 2.0. We're pushing both SFT and RL pipelines for building agents. I'm so excited to see where this project goes! Check it out!
Negin Raoof@NeginRaoof_

How can we make a better TerminalBench agent? Today, we are announcing the OpenThoughts-Agent project. OpenThoughts-Agent v1 is the first TerminalBench agent trained on fully open curated SFT and RL environments. OpenThinker-Agent-v1 is the strongest model of its size on TerminalBench, and sets a new bar on our newly released OpenThoughts-TB-Dev benchmark. (1/n)

English
5
13
47
14K
Ludwig Schmidt retweetledi
Negin Raoof
Negin Raoof@NeginRaoof_·
How can we make a better TerminalBench agent? Today, we are announcing the OpenThoughts-Agent project. OpenThoughts-Agent v1 is the first TerminalBench agent trained on fully open curated SFT and RL environments. OpenThinker-Agent-v1 is the strongest model of its size on TerminalBench, and sets a new bar on our newly released OpenThoughts-TB-Dev benchmark. (1/n)
Negin Raoof tweet media
English
17
77
290
126.7K
Ludwig Schmidt retweetledi
Stanford Saplings
Stanford Saplings@saplingsphd·
Introducing the new Stanford CS Ph.D. Entrepreneurship Club: Saplings🌲🌲!! We're a community for Computer Science PhD students at Stanford to help them turn their innovative ideas into impactful startups, alongside fellow founders and industry leaders. We're going to have many exciting events soon, so stay tuned!
English
2
5
40
19.6K
Ludwig Schmidt retweetledi
Anas Awadalla
Anas Awadalla@anas_awadalla·
We're releasing🍨Gelato-30B-A3B, a state-of-the-art computer grounding model that delivers immediate performance gains for computer-use agents! Trained on our open-source🖱️Click-100k dataset, Gelato achieves 63.8% on ScreenSpot-Pro and 69.1% on OS-World-G. It outperforms specialized models like GTA1-32B and VLMs ~8× its size like Qwen3-VL-235B. (1/N) 🧵
Anas Awadalla tweet media
English
7
41
235
34.3K
Ludwig Schmidt retweetledi
Alex Shaw
Alex Shaw@alexgshaw·
Today, we’re announcing the next chapter of Terminal-Bench with two releases: 1. Harbor, a new package for running sandboxed agent rollouts at scale 2. Terminal-Bench 2.0, a harder version of Terminal-Bench with increased verification
Alex Shaw tweet media
English
25
74
394
143K
Ludwig Schmidt retweetledi
John Yang
John Yang@jyangballin·
New eval! Code duels for LMs ⚔️ Current evals test LMs on *tasks*: "fix this bug," "write a test" But we code to achieve *goals*: maximize revenue, cut costs, win users Meet CodeClash: LMs compete via their codebases across multi-round tournaments to achieve high-level goals
English
31
99
416
101.7K
Ludwig Schmidt retweetledi
Alex Shaw
Alex Shaw@alexgshaw·
Evaluating agents on benchmarks is a pain. Each benchmark comes with its own harness, scoring scripts, and environments and integrating can take days. We're introducing the Terminal-Bench dataset registry to solve this problem. Think of it as the npm of agent benchmarks. Now you can use the Terminal-Bench CLI and harness to evaluate on SWE-bench and other popular benchmarks.
Alex Shaw tweet media
English
1
22
104
13.6K
Ludwig Schmidt
Ludwig Schmidt@lschmidt3·
I'm a big fan of the approach to research funding @andykonwinski and the Laude team are taking! Working with them on terminal-bench has been fantastic (thanks @alexgshaw!) and I'm excited that they're going to support more open, impact-oriented research.
Andy Konwinski@andykonwinski

Today, I’m launching a deeply personal project. I’m betting $100M that we can help computer scientists create more upside impact for humanity. Built for and by researchers, including @JeffDean & @jpineau1 on the board, @LaudeInstitute catalyzes research with real-world impact.

English
2
5
90
11.7K
Ludwig Schmidt retweetledi
Thao Nguyen
Thao Nguyen@thao_nguyen26·
Web data, the “fossil fuel of AI”, is being exhausted. What’s next?🤔 We propose Recycling the Web to break the data wall of pretraining via grounded synthetic data. It is more effective than standard data filtering methods, even with multi-epoch repeats! arxiv.org/abs/2506.04689
Thao Nguyen tweet media
English
14
62
226
36K