Ludwig Schmidt

250 posts

Ludwig Schmidt

@lschmidt3

Assistant professor at @Stanford and member of the technical staff at @AnthropicAI.

Palo Alto, CA Katılım Ağustos 2009

426 Takip Edilen6.5K Takipçiler

Ludwig Schmidt retweetledi

Vishaal Udandarao@vishaal_urao·2 Tem

🚀New Paper arxiv.org/abs/2606.28551 Everyone obsesses over VLM architectures & training recipes. But what about the data? Presenting the latest work in the DataComp-series: a testbed for VLM data curation with 1,000+ controlled experiments and some surprising lessons 👀 🧵👇

English

236

32K

Ludwig Schmidt retweetledi

Steven Dillmann ✈️ ICML 2026@StevenDillmann·25 Haz

Honored to have Terminal-Bench-Science included in Slingshots // THREE, alongside such a strong lineup of researchers and projects. Building a benchmark to evaluate AI agents on computational workflows across the natural sciences — authored and verified by real domain experts. Grateful for the incredible support from @LaudeInstitute & @bradenjhancock, and to all our contributors making this happen. ⚛️🧪 Check out the current progress on our brand-new task submission dashboard: stevendillmann.github.io/tb-science-tas…

Laude Institute@LaudeInstitute

TBench Science / @DillmannSteven, @ryanmart3n, @alexgshaw, @Mike_A_Merril, @AlexGDimakis, @sanmikoyejo, @lschmidt3 (@Stanford) A benchmark for evaluating AI agents on real computational workflows across the natural sciences, with tasks authored and verified by scientific domain experts.

English

3.3K

Ludwig Schmidt@lschmidt3·25 Haz

Very excited to release the next project in the DataComp / OpenThoughts line of research! Like OpenThoughts we worked on post-training data, this time with a focus on agentic models.

Richard Zhuang@RichardZ412

How can we train small agentic models that are highly capable of terminal use and coding? Announcing OpenThoughts-Agent + OpenThinkerAgent-32B, the strongest Qwen-3 based open-data agentic model: 44.8% avg across 7 agentic benchmarks! (1/n)

English

11.9K

Ludwig Schmidt retweetledi

terminalbench@terminalbench·18 Haz

Introducing Terminal-Bench Challenges! A new capability has emerged at the frontier: agents completing large-scale projects autonomously. To test this capability, we felt another flavor of benchmark was needed. Terminal-Bench Challenges are long-horizon, token-intensive, single-task benchmarks. Today we are releasing our first 3 challenges.

English

5.8K

Ludwig Schmidt retweetledi

Steven Dillmann ✈️ ICML 2026@StevenDillmann·20 May

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

Steven Dillmann ✈️ ICML 2026 tweet media

English

103

502

913.1K

Ludwig Schmidt retweetledi

Andrej Karpathy@karpathy·19 May

Personal update: I've joined Anthropic. I think the next few years at the frontier of LLMs will be especially formative. I am very excited to join the team here and get back to R&D. I remain deeply passionate about education and plan to resume my work on it in time.

English

11.1K

150.1K

27.7M

Ludwig Schmidt retweetledi

Nicholas Joseph@nickevanjoseph·19 May

Excited to welcome Andrej to the Pretraining team! He'll be building a team focused on using Claude to accelerate pretraining research itself. I can’t think of anyone better suited to do it — looking forward to what we build together!

Andrej Karpathy@karpathy

English

150

4.4K

339.1K

Ludwig Schmidt retweetledi

terminalbench@terminalbench·6 May

We're releasing Terminal-Bench 2.1 to patch 28 of the 89 tasks in Terminal-Bench 2.0 TB2.1 includes • recalibrated limits • fixed solutions • realigned verifiers Per-task breakdowns in 🧵 We'll continue to support TB2 and TB2.1 leaderboards (new submission process 🔜)

English

14.9K

Ludwig Schmidt retweetledi

John Yang@jyangballin·5 May

How much of SQLite, FFmpeg, PHP compiler can LMs code from scratch? Given just an executable and no starter code or internet access. Introducing ProgramBench: 200 rigorous, whole-repo generation tasks where models design, build, and ship a working program end to end. 🧵

English

107

246

1.6K

737.3K

Ludwig Schmidt retweetledi

David Duvenaud@DavidDuvenaud·28 Nis

Announcing Talkie: a new, open-weight historical LLM! We trained and finetuned a 13B model on a newly-curated dataset of only pre-1930 data. Try it below! with @AlecRad and @status_effects 🧵

English

199

458

3.7K

1.5M

Ludwig Schmidt retweetledi

Nick Levine@status_effects·28 Nis

New work with @AlecRad and @DavidDuvenaud: Have you ever dreamed of talking to someone from the past? Introducing talkie, a 13B model trained only on pre-1931 text. Vintage models should help us to understand how LMs generalize (e.g., can we teach talkie to code?). Thread:

English

179

405

3.2K

1.2M

Ludwig Schmidt retweetledi

Anthropic@AnthropicAI·6 Mar

A statement from Anthropic CEO Dario Amodei: anthropic.com/news/where-sta…

Italiano

1.1K

705

5.5K

2.7M

Ludwig Schmidt retweetledi

Anthropic@AnthropicAI·28 Şub

A statement on the comments from Secretary of War Pete Hegseth. anthropic.com/news/statement…

English

2.8K

6.4K

42.2K

17.8M

Ludwig Schmidt retweetledi

Charlie Ruan@charlie_ruan·18 Şub

Releasing the official SkyRL + Harbor integration: a standardized way to train terminal-use agents with RL. From the creators of Terminal-Bench, Harbor is a widely adopted framework for evaluating terminal-use agents on any task expressible as a Dockerfile + instruction + test script. This integration extends it: the same tasks you evaluate on, you can now RL-train on. Blog: novasky-ai.notion.site/skyrl-harbor 🧵

English

244

34.7K

Ludwig Schmidt retweetledi

Richard Zhuang@RichardZ412·20 Şub

Terminal-Bench is a leading benchmark for agents. Unfortunately it’s hard: most small coding agents get very low scores on TB2, so training/system ablations look flat - you can't tell what's working. Announcing OpenThoughts-TBLite - 100 curated TB2-style tasks, difficulty-calibrated so even 8B models can make progress. It's designed to give researchers measurable signal during development, providing faster feedback for experimental iteration while closely tracking true TB2 performance🧵

English

182

46K

Ludwig Schmidt retweetledi

vincent sunn chen@vincentsunnchen·11 Şub

x.com/i/article/2021…

ZXX

330

149.4K

Ludwig Schmidt retweetledi

Mike A. Merrill@Mike_A_Merrill·22 Oca

The Terminal-Bench paper is here! Read it to learn where frontier models still fail and the secrets of how we sourced hundreds of high quality environments from our open source community. 🧵

English

102

459

104.2K

Ludwig Schmidt retweetledi

Etash Guha@etash_guha·6 Ara

Building TerminalBench agents in the open is hard. We're making it much easier. OpenThoughts-Agent is our first milestone in open-data pipelines for building agents. We're the best model of our size on TerminalBench 2.0. We're pushing both SFT and RL pipelines for building agents. I'm so excited to see where this project goes! Check it out!

Negin Raoof@NeginRaoof_

How can we make a better TerminalBench agent? Today, we are announcing the OpenThoughts-Agent project. OpenThoughts-Agent v1 is the first TerminalBench agent trained on fully open curated SFT and RL environments. OpenThinker-Agent-v1 is the strongest model of its size on TerminalBench, and sets a new bar on our newly released OpenThoughts-TB-Dev benchmark. (1/n)

English

14.1K

Ludwig Schmidt retweetledi

Negin Raoof@NeginRaoof_·6 Ara

English

286

127.6K

Ludwig Schmidt retweetledi

Stanford Saplings@saplingsphd·24 Kas

Introducing the new Stanford CS Ph.D. Entrepreneurship Club: Saplings🌲🌲!! We're a community for Computer Science PhD students at Stanford to help them turn their innovative ideas into impactful startups, alongside fellow founders and industry leaders. We're going to have many exciting events soon, so stay tuned!

English

19.8K

Keşfet

@LaudeInstitute @bradenjhancock @AnthropicAI @OpenAI @GoogleDeepMind @AlecRad @status_effects @DavidDuvenaud