Steven Dillmann

113 posts

Steven Dillmann

@DillmannSteven

Stanford PhD working on #AI4Science and maintaining Terminal-Bench-Science @StanfordAILab 🧬🤖🪐

Stanford, CA Katılım Ocak 2020

1.3K Takip Edilen274 Takipçiler

Steven Dillmann retweetledi

Chelsea Zou@boson2photon·3d

Turns out LLMs also love to gamble... We made Claude, GPT-5, and Gemini agents play 100+ hands of no-limit poker♠️against each other and analyzed their reasoning traces. What we found: the largest gap in frontier AI isn't reasoning in isolation, it's adaptation and coordination in multi-agent settings. These models can compute pot odds, build opponent profiles, and even exhibit genuine Theory of Mind, modeling what their opponent believes and choosing actions to exploit that belief. One agent checked a monster hand specifically to "let the aggressor continue her story." Another correctly bluff-caught by citing a stored behavioral profile of its opponent. Unprompted. Emergent. But here's where it breaks: they almost never update these models. An agent stored a profile saying its opponent "c-bets dry boards ~50%." It cited this profile on the flop. Then cited it again on the turn, word for word, despite different bet sizing, a different board texture, and different range implications. The game state changed. The opponent adapted. But the model of the opponent stayed frozen. (1/3)👇

English

10.4K

Steven Dillmann retweetledi

Alex Shaw@alexgshaw·4d

Such a cool benchmark using @harborframework !

max@maxbittker

RuneBench is out: measuring long horizon goal optimization across 14 AI coding models inside Runescape

English

2.6K

Steven Dillmann retweetledi

Garyk Brixi@garykbrixi·4 Mar

Evo 2 is out in Nature today, showing that genome language models can predict and design across the full complexity of life, from phages to eukaryotes. A few surprises from the project, including how ignoring trillions of nucleotides was key to getting a good model. 🧵

English

208

97.2K

Steven Dillmann retweetledi

Patrick Hsu@pdhsu·4 Mar

Evo 2, our fully open-source biological foundation model trained on trillions of DNA tokens spanning the entire tree of life, is out in @Nature today We & the scientific community have done a lot with this @arcinstitute @nvidia model in the last year! 🧵👇

English

320

241.2K

Steven Dillmann retweetledi

Etash Guha@etash_guha·3 Mar

Super excited for this! Hope to see y’all there!

Stanford Saplings@saplingsphd

Dan Levy, cofounder of SSI, will be our next guest for Saplings! On March 12, we will have a fireside chat with time for audience Q&A. Luma link in thread! We are super excited about this and hope to see y'all there!

English

897

Steven Dillmann retweetledi

Ken Liu@kenziyuliu·27 Şub

Can we build a blind, *unlinkable inference* layer where ChatGPT/Claude/Gemini can't tell which call came from which users, like a “VPN for AI inference”? Yes! Blog post below + we built it into open source infra/chat app and served >15k prompts at Stanford so far. How it helps with AI user privacy: # The AI user privacy problem If you ask AI to analyze your ChatGPT history today, it’s surprisingly easy to infer your demographics, health, immigration status, and political beliefs. Every prompt we send accumulates into an (identity-linked) profile that the AI lab controls completely and indefinitely. At a minimum this is a goldmine for ads (as we know now). A bigger issue is the concentration of power: AI labs can easily become (or asked to become) a Cambridge Analytica, whistleblow your immigration status, or work with health insurance to adjust your premium if they so choose. This is a uniquely worse problem than search engines because your average query is now more revealing (not just keywords), interactive, and intelligence is now cheap. Despite this, most of us still want these remote models; they’re just too good and convenient! (this is aka the "privacy paradox".) # Unlinkable inference as a user privacy architecture The idea of unlinkable inference is to add privacy while preserving access to the remote models controlled by someone else. A “privacy wrapper” or “VPN for AI inference”, so to speak. Concretely, it’s a blind inference middle layer that: (1) consists of decentralized proxies that anyone can operate; (2) blindly authenticates requests (via blind signatures / RFC9474,9578) so requests are provably sandboxed from each other and from user identity; (3) relays prompts over randomly chosen proxies that don’t see or log traffic (via client-side ephemeral keys or hosting in TEEs); and (4) the provider simply sees a mixed pool of anonymous prompts from the proxies. No state, pseudonyms, or linkable metadata. If you squint, an unlinkable inference layer is essentially a vendor for per-request, anonymous, ephemeral AI access credentials (for users or agents alike). It partitions your context so that user tracking is drastically harder. Obviously, unlinkability isn’t a silver bullet: the prompt itself still goes to the remote model and can leak privacy (so don't use our chat app for a therapy session!). It aims to combat *longitudinal tracking* as a major threat to user privacy, and its statistical power increases quickly by mixing more users and requests. Unlinkability can be applied at any granularity. For an AI chat app, you can unlinkably request a fresh ephemeral key for every session so tracking is virtually impossible. # The Open Anonymity Project We started this project with the belief that intelligence should be a truly public utility. Like water and electricity, providers should be compensated by usage, not who you are or what you do with it. We think unlinkable inference is a first step towards this “intelligence neutrality”. # Try it out! It’s quite practical - Chat app “oa-chat”: chat.openanonymity.ai (<20 seconds to get going) - Blog post that should be a fun read: openanonymity.ai/blog/unlinkabl… - Project page: openanonymity.ai - GitHub: github.com/OpenAnonymity

English

157

827

372.9K

Steven Dillmann retweetledi

Laude Institute@LaudeInstitute·26 Şub

Slingshots: Resources for researchers who ship. 💫 Apply year-round: slingshots.laude.org

English

2.8K

Steven Dillmann@DillmannSteven·26 Şub

Congratulations to all the great researchers in this second batch of the @LaudeInstitute Slingshots including my collaborators @alexgshaw @ryanmart3n @Mike_A_Merrill @etash_guha @FeuerBenjamin @NeginRaoof_ @AlexGDimakis @lschmidt3 for @harborframework, @terminalbench and OpenThoughts-Agent!

Laude Institute@LaudeInstitute

Introducing Slingshots // TWO: Research that ships. 14 projects, six institutions – let’s meet the batch 🧵

English

296

Steven Dillmann retweetledi

Richard Zhuang@RichardZ412·20 Şub

Terminal-Bench is a leading benchmark for agents. Unfortunately it’s hard: most small coding agents get very low scores on TB2, so training/system ablations look flat - you can't tell what's working. Announcing OpenThoughts-TBLite - 100 curated TB2-style tasks, difficulty-calibrated so even 8B models can make progress. It's designed to give researchers measurable signal during development, providing faster feedback for experimental iteration while closely tracking true TB2 performance🧵

English

181

43.5K

Steven Dillmann retweetledi

Xiangyi Li@xdotli·17 Şub

Introducing SkillsBench, the first benchmarks that measures agent skills and how well agents use them. 86 tasks from 105 domain experts across 11 domains, every task is verifiable, human created and has verified Skills. SOTA model without skills score ~30% without skills. 🧵👇

English

2.6K

Steven Dillmann retweetledi

Xiangyi Li@xdotli·13 Şub

Agent Skills are everywhere - Claude Code, Gemini CLI, Codex all support them. But do they actually work? 105 domain experts from Stanford, CMU, Berkeley, Oxford, Amazon, ByteDance & more built SkillsBench to find that out. 86 tasks. 11 domains. 7,308 trajectories. 🧵👇

English

668

76.4K

Steven Dillmann retweetledi

Bodhisattwa Majumder@mbodhisattwa·12 Şub

.@allen_ai's next-generation Asta is live! ⏳ We extended from a goal-driven setup to long-horizon open-ended scientific exploration, with AutoDiscovery. Try now. 🧑🏻‍🚀 For the past 6 months, we partnered with oncologists, social scientists, marine biologists, and epidemiologists to uncover "hidden truths" from vast public and private datasets. 🌈 This work was a researcher's paradise: it started with an important AI problem, and ended with driving truly impactful applications with countercurrent findings that change traditional practices in critical sciences. ✨ Today, we release three technical reports, where our partner scientists document the discoveries made by our system, opening up to their respective scientific communities. 🎷 We are heavily marching towards truly long-horizon discovery systems paired with asynchronous user feedback. While we share our next research updates, have fun with AutoDiscovery. PS: This release has so much that I'm gonna need multiple posts to unpack it.

Ai2@allen_ai

Knowing which questions to ask is often the hardest part of science. Today we're releasing AutoDiscovery in AstaLabs, an AI system that starts with your data and generates its own hypotheses. 🧪

English

5.1K

Steven Dillmann@DillmannSteven·12 Şub

Check out our repo for more info on how to contribute to TB-Science: github.com/harbor-framewo…

English

Steven Dillmann@DillmannSteven·11 Şub

As agents take on longer-horizon, real-world tasks, many of today’s evals fail to measure the work we actually expect them to do in practice, especially in the natural sciences! Very excited about this initiative! 🤿 We previously partnered up with @SnorkelAI on Terminal-Bench 2.0, and are now building Terminal-Bench-Science. Get in touch if you work in the natural sciences and are interested in contributing! GitHub: lnkd.in/gejKmXpj Discord: lnkd.in/gpVNw3tG

vincent sunn chen@vincentsunnchen

x.com/i/article/2021…

English

1.7K

Steven Dillmann@DillmannSteven·12 Şub

Join our TB-Science Discord channel: discord.gg/aGpT6HGUNM

English

Steven Dillmann retweetledi

Diyi Yang@Diyi_Yang·10 Şub

Two amazing postdocs from our lab are on the academic job market this year. I've learned a lot from their wonderful research -- you should definitely reach out and hire them!

English

142

41.2K

Steven Dillmann retweetledi

Etash Guha@etash_guha·9 Şub

OpenThoughts is going to be an Oral Presentation at ICLR! It's my first oral presentation so super excited! See y'all in Brazil! :)

English

11.6K

Steven Dillmann retweetledi

Alex Shaw@alexgshaw·7 Şub

Yesterday's OpenAI and Anthropic Terminal-Bench 2.0 results used different harnesses. Run both in Terminus 2 ➡️ ~similar scores (within noise). Harnesses matter! Congrats to both teams on incredible models!

English

201

19.2K

Steven Dillmann retweetledi

Alex Ratner@ajratner·5 Şub

Exciting mention of TBench 2.0 in today's model releases - congrats to @Mike_A_Merrill @alexgshaw & team + proud of @SnorkelAI 's contributions! Benchmarks are just one (limited) measurement tool - but critical guideposts of frontier progress. Much more to build here ahead!

English

2.6K

Steven Dillmann retweetledi

Ryan Marten@ryanmart3n·3 Şub

if you want to help us create harder tasks, come join the terminal-bench-3.0 effort discord.gg/6xWPKhGDbA

English

5.1K

Keşfet

@harborframework @Nature @arcinstitute @nvidia @LaudeInstitute @alexgshaw @ryanmart3n @Mike_A_Merrill