Steven Dillmann ✈️ ICML 2026

210 posts

Steven Dillmann ✈️ ICML 2026

@StevenDillmann

Stanford PhD working on #AI4Science and maintaining Terminal-Bench Science @StanfordAILab 🧬🤖🪐

Stanford, CA Katılım Ocak 2020

1.6K Takip Edilen672 Takipçiler

Sabitlenmiş Tweet

Steven Dillmann ✈️ ICML 2026@StevenDillmann·20 May

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

Steven Dillmann ✈️ ICML 2026 tweet media

English

103

502

912.9K

Steven Dillmann ✈️ ICML 2026@StevenDillmann·22h

Learn how to contribute here: tbench.ai/news/tb-scienc…

English

Steven Dillmann ✈️ ICML 2026@StevenDillmann·22h

Thank you for the invite to the organizers John Sous, Anna Gilbert, Arman Cohan, Eliu Huerta, Eun-Ah Kim, Andy Liu, Hao Peng, @grantrotskoff, @MinyangTian1, and Yilun Zhao.

English

122

Steven Dillmann ✈️ ICML 2026@StevenDillmann·22h

Wonderful to see so many people reaching out to contribute to Terminal-Bench Science after my contribution call at the @icmlconf AI for Physics Workshop in Seoul! #ICML2026⚛️🇰🇷

English

1.8K

Steven Dillmann ✈️ ICML 2026 retweetledi

Bodhisattwa Majumder@mbodhisattwa·1d

Peter Clark laying out our mission for building discovery machines (Asta) @allen_ai @icmlconf, pic from @ChenhaoTan

English

721

Steven Dillmann ✈️ ICML 2026 retweetledi

Kirill Acharya @ ICML 2026@kirillacharya·3d

Had a lot of fun presenting CertJudge at @icmlconf ICML 2026 AI4Math and Deep Learning for Code Workshops on July 10-11. Thanks to all authors and listeners!

Ethan Hersch@EthanHersch

🚨🚨Announcing CertJudge🚨🚨 But who judges the judge? … New work from Stanford University, Harvard University, Hong Kong Baptist University

English

Steven Dillmann ✈️ ICML 2026@StevenDillmann·2d

ZXX

165

Steven Dillmann ✈️ ICML 2026@StevenDillmann·3d

More details on this ongoing community effort here: tbench.ai/news/tb-scienc…

English

330

Steven Dillmann ✈️ ICML 2026@StevenDillmann·3d

🚨 Shout-out to Terminal-Bench Science at the #ICML2026 AI for Science Conference today! Thanks to the @AI_for_Science organizers @wellingmax, @marinkazitnik, @MengdiWang10, @SherryLixueC, @sungsoo_ahn_, Yixuan Wang, @mia_rosenfeld for the feature & @allen_ai's CEO Peter Clark for the picture :)

English

9.6K

Steven Dillmann ✈️ ICML 2026 retweetledi

Ethan Hersch@EthanHersch·3d

🚨🚨Announcing CertJudge🚨🚨 But who judges the judge? … New work from Stanford University, Harvard University, Hong Kong Baptist University

English

3.9K

Steven Dillmann ✈️ ICML 2026@StevenDillmann·3d

@RylanSchaeffer @Meta Exctited to see what’s next for you @RylanSchaeffer !

English

445

Rylan Schaeffer@RylanSchaeffer·4d

This was my last week at TBD / @Meta Superintelligence Labs (MSL) It's been an incredible experience watching the lab assemble and accelerate. I learned an enormous amount and had the privilege of working with some of the best ML/AI researchers on the planet 1/2

English

387

54.4K

Steven Dillmann ✈️ ICML 2026@StevenDillmann·3d

@elonmusk tweeting about our work on SWE-Marathon led by @rishi_desai2 was not on my 2026 bingo card

Elon Musk@elonmusk

Grok 4.5 is also rank 1 in SWE marathon

English

505

Steven Dillmann ✈️ ICML 2026 retweetledi

Rishi Desai@rishi_desai2·3d

Surreal watching frontier labs hillclimb on SWE-Marathon.

Elon Musk@elonmusk

Grok 4.5 is also rank 1 in SWE marathon

English

8.8K

Steven Dillmann ✈️ ICML 2026@StevenDillmann·3d

Ivan is one of the most important maintainers for Terminal-Bench & Terminal-Bench Science. Check out his guide on how to design high-quality tasks for AI agents!

Ivan Bercovich@neversupervised

x.com/i/article/2075…

English

2.3K

Steven Dillmann ✈️ ICML 2026@StevenDillmann·4d

If you're at ICML and interested in Terminal-Bench Science, come stop by my talk at the AI for Physics Workshop tomorrow at 9:30am (Conference Room S402). Also check out Terminal-Bench Science featured at the AI for Science Workshop. AI for Physics Workshop: ai4physics-workshop.github.io AI for Science Workshop: ai4sciencecommunity.github.io/icml26

English

6.4K

Steven Dillmann ✈️ ICML 2026@StevenDillmann·6d

Come contribute to Terminal-Bench Science with your hardest scientific problems!

Ben Blaiszik@BenBlaiszik

I've been working on a mechanics of materials benchmark for the Terminal-Bench Science effort led by @StevenDillmann, and I'm genuinely shocked how hard the problems have to be for the agents to fail. Deep, PhD-level capabilities in mechanics are already included in the capabilities of GPT 5.5 and Opus 4.8, and are especially accessible via web search and agent code implementation. No details on the mechanics benchmark yet to avoid leak into the training set. But, I would have loved to have had access to this for my own work :)

English

1.7K

Steven Dillmann ✈️ ICML 2026@StevenDillmann·6d

Harbor-Index is finally out - by our amazing @harborframework adapters lead @LinShi592021 & co. Go check the blog post: harbor-index.org

Lin Shi@LinShi592021

Introducing Harbor-Index, a compact, diverse, and high-quality benchmark built to challenge frontier agents. We carefully select, audit and fix 82 high-signal tasks out of 6,627 candidates spanning 54 benchmarks. No agent gets above 30%. (1/5)

English

2.9K

Steven Dillmann ✈️ ICML 2026@StevenDillmann·6 Tem

@justALEXWORTEGA Sorry about that - just opened!

English

alex nikolic@justALEXWORTEGA·6 Tem

@StevenDillmann Your dms closed(

English

Steven Dillmann ✈️ ICML 2026@StevenDillmann·6 Tem

Just landed in Seoul for #ICML2026 🇰🇷 Reach out if you want to chat about Terminal-Bench Science, AI for Science or anything else!

English

1.5K

Steven Dillmann ✈️ ICML 2026 retweetledi

Alex Shaw@alexgshaw·5 Tem

An under-appreciated aspect of Harbor is that results are auditable and reproducible so you can make trust-less claims about your agent or model. In the era of SWE, code was the source of truth so OSS built trust. In the era of agents, evals are the source of truth and reproducible evals build trust.

Monk Zero@NoCommas

@alexgshaw @lakshyaag @harborframework It is a good framework. We use it to benchmark all releases we shipped (Public ones at antigma.ai/eval with HarborHub links as provenance)

English

Keşfet

@grantrotskoff @MinyangTian1 @icmlconf @allen_ai @ChenhaoTan @AI_for_Science @wellingmax @marinkazitnik