Steven Dillmann

163 posts

Steven Dillmann banner
Steven Dillmann

Steven Dillmann

@StevenDillmann

Stanford PhD working on #AI4Science and maintaining Terminal-Bench Science @StanfordAILab 🧬🤖🪐

Stanford, CA Katılım Ocak 2020
1.4K Takip Edilen484 Takipçiler
Sabitlenmiş Tweet
Steven Dillmann
Steven Dillmann@StevenDillmann·
📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵
Steven Dillmann tweet media
English
15
110
469
889.8K
Steven Dillmann retweetledi
Stanford AI+Biomedicine Seminar
Stanford AI+Biomedicine Seminar@Stanford_AI_Bio·
Wish an AI agent could handle your next research task in the list? 👇
Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

English
0
1
4
1.7K
Surag Nair
Surag Nair@suragnair·
@StevenDillmann @AnthropicAI @OpenAI @GoogleDeepMind Hey Steven, this is super cool! What’s the policy around access to contributed tasks, artefacts, and solutions? Will they be openly available to labs using the benchmark, or are there contributor protections around reuse/training?
English
1
0
1
75
Steven Dillmann retweetledi
Sanmi Koyejo
Sanmi Koyejo@sanmikoyejo·
"AI for science" benchmarks today mostly test textbook recall. Terminal-Bench Science is a chance for scientists to practice writing that definition. Contribute a real workflow, and you find out exactly where today's best agents break on it. tbench.ai/news/tb-scienc…
Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

English
0
4
21
2.3K
Steven Dillmann retweetledi
Richard C. Suwandi
Richard C. Suwandi@richardcsuwandi·
Good evals like this are exactly what we need to accelerate progress in AI for science
Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

English
0
2
9
1.7K
Steven Dillmann retweetledi
Bodhisattwa Majumder
Bodhisattwa Majumder@mbodhisattwa·
Wonderful project; wonderful people; please contribute for the sake of science. Bonus: @StevenDillmann will be interning with me and AutoDiscovery team @allen_ai translating benefits from TB-Science to our science agents!
Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

English
0
2
20
2.1K
Steven Dillmann retweetledi
Bespoke Labs
Bespoke Labs@bespokelabsai·
Consider contributing tasks to Terminal-Bench Science, the most direct way to teach AI agent to solve your AI workflows and accelerate your research.
Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

English
0
3
8
1.2K
Steven Dillmann retweetledi
Alex Dimakis
Alex Dimakis@AlexGDimakis·
Terminal-Bench Science is a direct way to contribute to AI for Science. It's programming agents by task specification. Ask a precise scientific question and watch how AI agents will learn to solve it: Step 1. Package a scientific task or workflow, something that takes a working scientist a week or month to do into an RL environment. Step 2. Write tests that verify if the task has been done correctly (can be done easily if you have already solved the task manually). Step 3. Sit back and let AI agent progress solve it in 6 months.
Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

English
9
7
40
5.1K
Steven Dillmann retweetledi
Leon Chen
Leon Chen@realleonlc·
Scientists, I highly encourage you to submit hard scientific tasks that you want your agents to do to this Terminal-Bench Science benchmark! Make your task seen and solved by agent/model providers. Get credit from the project.
Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

English
0
1
5
574
Steven Dillmann retweetledi
Lisan al Gaib
Lisan al Gaib@scaling01·
let the hill climbing on scientific tasks begin new benchmark: TerminalBench Science
Lisan al Gaib tweet media
Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

English
2
3
56
5.7K
Steven Dillmann retweetledi
Alex Ratner
Alex Ratner@ajratner·
Extremely excited for Terminal-Bench Science, which we're proud to support via our Open Benchmarks Grants @SnorkelAI !
Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

English
1
3
19
2.3K
Steven Dillmann retweetledi
Ryan Marten
Ryan Marten@ryanmart3n·
deadline to submit tasks for Terminal-Bench 3.0 is may 31st! the best tasks are the most interesting to measure: realistic + useful + meaningfully beyond current frontier any piece of valuable work done on a computer is fair game
English
1
2
6
594
Steven Dillmann retweetledi
Alex Shaw
Alex Shaw@alexgshaw·
Consider contributing a task to Terminal-Bench Science! Terminal-Bench Science will serve as a north star for building agents and models that accelerate science.
Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

English
1
8
50
4.9K
Steven Dillmann retweetledi
vincent sunn chen
vincent sunn chen@vincentsunnchen·
We’re incredibly excited to work with @StevenDillmann & team on TB-Science - if you’re a scientist excited to drive how AI agents accelerate scientific workflows, please reach out!
Steven Dillmann@StevenDillmann

📣 Announcing Terminal-Bench Science: benchmarking AI agents on real scientific workflows – now open for task contributions👇 tbench.ai/news/tb-scienc… @AnthropicAI, @OpenAI, and @GoogleDeepMind use Terminal-Bench to evaluate AI on coding tasks. We're now extending it to scientific workflows. 1/6🧵

English
0
4
13
1.3K
Steven Dillmann
Steven Dillmann@StevenDillmann·
RT @bradenjhancock: Lots of people say AI could/should be used to advance "science." What does that mean? What are the tasks that would act…
English
0
1
0
108