Steven Dillmann

113 posts

Steven Dillmann banner
Steven Dillmann

Steven Dillmann

@DillmannSteven

Stanford PhD working on #AI4Science and maintaining Terminal-Bench-Science @StanfordAILab 🧬🤖🪐

Stanford, CA Katılım Ocak 2020
1.3K Takip Edilen274 Takipçiler
Steven Dillmann retweetledi
Chelsea Zou
Chelsea Zou@boson2photon·
Turns out LLMs also love to gamble... We made Claude, GPT-5, and Gemini agents play 100+ hands of no-limit poker♠️against each other and analyzed their reasoning traces. What we found: the largest gap in frontier AI isn't reasoning in isolation, it's adaptation and coordination in multi-agent settings. These models can compute pot odds, build opponent profiles, and even exhibit genuine Theory of Mind, modeling what their opponent believes and choosing actions to exploit that belief. One agent checked a monster hand specifically to "let the aggressor continue her story." Another correctly bluff-caught by citing a stored behavioral profile of its opponent. Unprompted. Emergent. But here's where it breaks: they almost never update these models. An agent stored a profile saying its opponent "c-bets dry boards ~50%." It cited this profile on the flop. Then cited it again on the turn, word for word, despite different bet sizing, a different board texture, and different range implications. The game state changed. The opponent adapted. But the model of the opponent stayed frozen. (1/3)👇
Chelsea Zou tweet media
English
6
12
38
10.4K
Steven Dillmann retweetledi
Garyk Brixi
Garyk Brixi@garykbrixi·
Evo 2 is out in Nature today, showing that genome language models can predict and design across the full complexity of life, from phages to eukaryotes. A few surprises from the project, including how ignoring trillions of nucleotides was key to getting a good model. 🧵
Garyk Brixi tweet media
English
13
208
1K
97.2K
Steven Dillmann retweetledi
Patrick Hsu
Patrick Hsu@pdhsu·
Evo 2, our fully open-source biological foundation model trained on trillions of DNA tokens spanning the entire tree of life, is out in @Nature today We & the scientific community have done a lot with this @arcinstitute @nvidia model in the last year! 🧵👇
English
55
320
2K
241.2K
Steven Dillmann retweetledi
Ken Liu
Ken Liu@kenziyuliu·
Can we build a blind, *unlinkable inference* layer where ChatGPT/Claude/Gemini can't tell which call came from which users, like a “VPN for AI inference”? Yes! Blog post below + we built it into open source infra/chat app and served >15k prompts at Stanford so far. How it helps with AI user privacy: # The AI user privacy problem If you ask AI to analyze your ChatGPT history today, it’s surprisingly easy to infer your demographics, health, immigration status, and political beliefs. Every prompt we send accumulates into an (identity-linked) profile that the AI lab controls completely and indefinitely. At a minimum this is a goldmine for ads (as we know now). A bigger issue is the concentration of power: AI labs can easily become (or asked to become) a Cambridge Analytica, whistleblow your immigration status, or work with health insurance to adjust your premium if they so choose. This is a uniquely worse problem than search engines because your average query is now more revealing (not just keywords), interactive, and intelligence is now cheap. Despite this, most of us still want these remote models; they’re just too good and convenient! (this is aka the "privacy paradox".) # Unlinkable inference as a user privacy architecture The idea of unlinkable inference is to add privacy while preserving access to the remote models controlled by someone else. A “privacy wrapper” or “VPN for AI inference”, so to speak. Concretely, it’s a blind inference middle layer that: (1) consists of decentralized proxies that anyone can operate; (2) blindly authenticates requests (via blind signatures / RFC9474,9578) so requests are provably sandboxed from each other and from user identity; (3) relays prompts over randomly chosen proxies that don’t see or log traffic (via client-side ephemeral keys or hosting in TEEs); and (4) the provider simply sees a mixed pool of anonymous prompts from the proxies. No state, pseudonyms, or linkable metadata. If you squint, an unlinkable inference layer is essentially a vendor for per-request, anonymous, ephemeral AI access credentials (for users or agents alike). It partitions your context so that user tracking is drastically harder. Obviously, unlinkability isn’t a silver bullet: the prompt itself still goes to the remote model and can leak privacy (so don't use our chat app for a therapy session!). It aims to combat *longitudinal tracking* as a major threat to user privacy, and its statistical power increases quickly by mixing more users and requests. Unlinkability can be applied at any granularity. For an AI chat app, you can unlinkably request a fresh ephemeral key for every session so tracking is virtually impossible. # The Open Anonymity Project We started this project with the belief that intelligence should be a truly public utility. Like water and electricity, providers should be compensated by usage, not who you are or what you do with it. We think unlinkable inference is a first step towards this “intelligence neutrality”. # Try it out! It’s quite practical - Chat app “oa-chat”: chat.openanonymity.ai (<20 seconds to get going) - Blog post that should be a fun read: openanonymity.ai/blog/unlinkabl… - Project page: openanonymity.ai - GitHub: github.com/OpenAnonymity
Ken Liu tweet media
English
62
157
827
372.9K
Steven Dillmann retweetledi
Laude Institute
Laude Institute@LaudeInstitute·
Slingshots: Resources for researchers who ship. 💫 Apply year-round: slingshots.laude.org
English
0
3
13
2.8K
Steven Dillmann retweetledi
Richard Zhuang
Richard Zhuang@RichardZ412·
Terminal-Bench is a leading benchmark for agents. Unfortunately it’s hard: most small coding agents get very low scores on TB2, so training/system ablations look flat - you can't tell what's working. Announcing OpenThoughts-TBLite - 100 curated TB2-style tasks, difficulty-calibrated so even 8B models can make progress. It's designed to give researchers measurable signal during development, providing faster feedback for experimental iteration while closely tracking true TB2 performance🧵
Richard Zhuang tweet media
English
11
22
181
43.5K
Steven Dillmann retweetledi
Xiangyi Li
Xiangyi Li@xdotli·
Introducing SkillsBench, the first benchmarks that measures agent skills and how well agents use them. 86 tasks from 105 domain experts across 11 domains, every task is verifiable, human created and has verified Skills. SOTA model without skills score ~30% without skills. 🧵👇
Xiangyi Li tweet media
English
2
11
40
2.6K
Steven Dillmann retweetledi
Xiangyi Li
Xiangyi Li@xdotli·
Agent Skills are everywhere - Claude Code, Gemini CLI, Codex all support them. But do they actually work? 105 domain experts from Stanford, CMU, Berkeley, Oxford, Amazon, ByteDance & more built SkillsBench to find that out. 86 tasks. 11 domains. 7,308 trajectories. 🧵👇
Xiangyi Li tweet media
English
24
92
668
76.4K
Steven Dillmann retweetledi
Bodhisattwa Majumder
Bodhisattwa Majumder@mbodhisattwa·
.@allen_ai's next-generation Asta is live! ⏳ We extended from a goal-driven setup to long-horizon open-ended scientific exploration, with AutoDiscovery. Try now. 🧑🏻‍🚀 For the past 6 months, we partnered with oncologists, social scientists, marine biologists, and epidemiologists to uncover "hidden truths" from vast public and private datasets. 🌈 This work was a researcher's paradise: it started with an important AI problem, and ended with driving truly impactful applications with countercurrent findings that change traditional practices in critical sciences. ✨ Today, we release three technical reports, where our partner scientists document the discoveries made by our system, opening up to their respective scientific communities. 🎷 We are heavily marching towards truly long-horizon discovery systems paired with asynchronous user feedback. While we share our next research updates, have fun with AutoDiscovery. PS: This release has so much that I'm gonna need multiple posts to unpack it.
Ai2@allen_ai

Knowing which questions to ask is often the hardest part of science. Today we're releasing AutoDiscovery in AstaLabs, an AI system that starts with your data and generates its own hypotheses. 🧪

English
1
7
36
5.1K
Steven Dillmann
Steven Dillmann@DillmannSteven·
As agents take on longer-horizon, real-world tasks, many of today’s evals fail to measure the work we actually expect them to do in practice, especially in the natural sciences! Very excited about this initiative! 🤿 We previously partnered up with @SnorkelAI on Terminal-Bench 2.0, and are now building Terminal-Bench-Science. Get in touch if you work in the natural sciences and are interested in contributing! GitHub: lnkd.in/gejKmXpj Discord: lnkd.in/gpVNw3tG
vincent sunn chen@vincentsunnchen

x.com/i/article/2021…

English
3
5
25
1.7K
Steven Dillmann retweetledi
Diyi Yang
Diyi Yang@Diyi_Yang·
Two amazing postdocs from our lab are on the academic job market this year. I've learned a lot from their wonderful research -- you should definitely reach out and hire them!
English
2
30
142
41.2K
Steven Dillmann retweetledi
Etash Guha
Etash Guha@etash_guha·
OpenThoughts is going to be an Oral Presentation at ICLR! It's my first oral presentation so super excited! See y'all in Brazil! :)
Etash Guha tweet media
English
5
4
91
11.6K
Steven Dillmann retweetledi
Alex Shaw
Alex Shaw@alexgshaw·
Yesterday's OpenAI and Anthropic Terminal-Bench 2.0 results used different harnesses. Run both in Terminus 2 ➡️ ~similar scores (within noise). Harnesses matter! Congrats to both teams on incredible models!
Alex Shaw tweet media
English
14
12
201
19.2K
Steven Dillmann retweetledi
Alex Ratner
Alex Ratner@ajratner·
Exciting mention of TBench 2.0 in today's model releases - congrats to @Mike_A_Merrill @alexgshaw & team + proud of @SnorkelAI 's contributions! Benchmarks are just one (limited) measurement tool - but critical guideposts of frontier progress. Much more to build here ahead!
Alex Ratner tweet mediaAlex Ratner tweet media
English
0
9
42
2.6K
Steven Dillmann retweetledi
Ryan Marten
Ryan Marten@ryanmart3n·
if you want to help us create harder tasks, come join the terminal-bench-3.0 effort discord.gg/6xWPKhGDbA
English
0
1
11
5.1K