Kelly Buchanan

1K posts

Kelly Buchanan

Kelly Buchanan

@ekellbuch

Postdoctoral Fellow @Stanford with @HazyResearch and @Scott_linderman. Working on 🤖🧠 PhD @Columbia @ZuckermanBrain @GoogleAI

Palo Alto, CA Katılım Temmuz 2011
2.2K Takip Edilen1.7K Takipçiler
Sabitlenmiş Tweet
Kelly Buchanan
Kelly Buchanan@ekellbuch·
Very excited to release Terminal-Bench 2.1! Coding agents are among the most economically consequential deployments of LLMs to date. As agents improve, benchmark reliability matters more. We audited TB2.0 and found and corrected issues in 28/89 tasks. 30% of the benchmark! But the rankings survived, absolute scores moved up to 12pp!
Kelly Buchanan tweet media
English
27
74
767
83.8K
Kelly Buchanan retweetledi
Jeff Dean
Jeff Dean@JeffDean·
1/ Today at #GoogleIO, we’re releasing Gemini 3.5, our latest family of models combining frontier intelligence with action. We’re starting by releasing 3.5 Flash, which is built to help you execute complex, long-horizon agentic workflows. Gemini 3.5 Flash is our strongest model for coding and agent yet.It outscores 3.1 Pro on agentic and coding benchmarks like Terminal-Bench and MCP Atlas, while running 4x faster than other frontier models. Used in Google Antigravity, 3.5 Flash is even further optimized to be up to 12x faster. It’s a powerful engine to deploy sub-agents that collaborate, run high-frequency iterative loops, and solve real-world problems at scale. Some highlights we’re excited about 🔽
Jeff Dean tweet media
English
79
187
1.4K
116.1K
Kelly Buchanan retweetledi
Google
Google@Google·
Gemini 3.5 Flash is built to help you execute complex, agentic workflows. 3.5 Flash rivals flagship models to deliver frontier performance for agents and coding, at the lightning speeds you expect from the Flash series.
Google tweet media
English
76
180
2.3K
935.8K
Kelly Buchanan retweetledi
Omri Weinstein
Omri Weinstein@WeinsteinOmri·
A milestone for Pearl Research Labs: our first major enterprise partnership is live with Together AI. @togethercompute’s inference platform is an ideal demonstration of @prlnet's value proposition — One of the world’s most advanced hyperscalers running AI workloads on Pearl’s 2-for-1 Cuda kernels, turning inference into ¶PRL coins, and reducing consumer LLM price per token. Excited for what we’ll build together.
Together AI@togethercompute

Introducing Gemma-4-31B-it-Pearl on Together AI, Pearl Research Labs’ instruction-tuned checkpoint of Gemma 4 31B powered by @prlnet Proof of Useful Work protocol. AI natives can now use this Pearl model as a serverless inference endpoint on Together AI, at a 25%+ discounted pricing.

English
7
10
83
13K
Kelly Buchanan
Kelly Buchanan@ekellbuch·
Very excited to release Terminal-Bench 2.1! Coding agents are among the most economically consequential deployments of LLMs to date. As agents improve, benchmark reliability matters more. We audited TB2.0 and found and corrected issues in 28/89 tasks. 30% of the benchmark! But the rankings survived, absolute scores moved up to 12pp!
Kelly Buchanan tweet media
English
27
74
767
83.8K
Golf on CBS ⛳
Golf on CBS ⛳@GolfonCBS·
How many golfers can Rory McIlroy identify just by watching their swing silhouette? The results are QUITE impressive.
English
111
362
6.8K
1.8M
Kelly Buchanan retweetledi
Epoch AI
Epoch AI@EpochAIResearch·
We are conducting an AI-assisted review of FrontierMath: Tiers 1-4. This has flagged fatal errors in about a third of problems, and we believe most of these flags to be valid. We will release updated scores on a corrected dataset after completing a thorough human review.
English
31
67
875
463.3K
Kelly Buchanan retweetledi
Kevin Li
Kevin Li@kevin_x_li·
Introducing SWE-ZERO-12M-trajectories: the largest agentic trace dataset in the open, 5.7x larger than the previous largest. 112B tokens · 12M trajectories · 122K PRs · 3K repos · 16 languages huggingface.co/datasets/Alien…
English
19
68
524
77.2K
Kelly Buchanan retweetledi
Snorkel AI
Snorkel AI@SnorkelAI·
We're humbled that @SnorkelAI co-authored Continual Learning Bench with @BerkeleySky and partnered with @harvey on LAB and @ekellbuch on TerminalBench 2.1. More coming soon!
vincent sunn chen@vincentsunnchen

last week was a fun week for benchmarks, which advanced the key axes for measuring frontier AI: - Legal Agent Benchmark (LAB) (from @harvey) → environment complexity: 1200+ tasks covering realistic instructions and work products, with expert rubrics - Continual Learning Bench (from @BerkeleySky & @SnorkelAI) → autonomy horizon: the first benchmark to capture ability of AI systems to learn from experience - ProgramBench (from @Meta & @StanfordAILab) → output complexity: expanding the scope of tasks from patches to entire programs, with a 0% pass rate - Bonus (from @ekellbuch & @terminalbench): TerminalBench 2.1 released with 28/89 tasks audited & fixed- showing that continuous quality control & task-level rigor are critical for enduring benchmarks!

English
1
6
34
2.8K
Kelly Buchanan retweetledi
Druv Pai
Druv Pai@druv_pai·
Pretty belated, but a great time to mention that I recently joined @thinkymachines! It’s been a pleasure to work with such kind and brilliant colleagues. We released a preview of real-time multimodal interaction models yesterday — check it out!
Thinking Machines@thinkymachines

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/interacti…

English
3
5
72
5.9K
Kelly Buchanan retweetledi
Noam Brown
Noam Brown@polynoamial·
And, of course, they should be plotted with compute, latency, or cost on the x-axis.
English
6
1
114
7.1K
Kelly Buchanan retweetledi
Unitree
Unitree@UnitreeRobotics·
Unitree Unveils: GD01, A Manned Transformable Mecha, from $650,000 👏 The world's first production-ready manned mecha. It can transform. It's a civilian vehicle. It weighs ~500kg with you inside. Please everyone be sure to use the robot in a Friendly and Safe manner.
English
1.1K
3.2K
17.1K
9M
Kelly Buchanan retweetledi
clem 🤗
clem 🤗@ClementDelangue·
We just crossed 1,000,000 public datasets on Hugging Face! That's petabytes of data available that millions of AI builders are downloading, analyzing, and training AI models on every day! What's interesting is that we see a clear acceleration since agents started to be good as the number of datasets doubled over the past 8 months (it took 4 years to reach the first 500k). It's becoming easier and faster to build, share and use your own datasets! Many are saying the next bottleneck for more people to build AI themselves (instead of relying on APIs) is better data so we're just getting started! Thanks everyone for your amazing contributions, we couldn't do it without you!
clem 🤗 tweet media
English
37
51
281
17K
Kelly Buchanan retweetledi
clem 🤗
clem 🤗@ClementDelangue·
Local open-weight AI on a laptop has been improving more than twice as fast as Moore's Law! Between May 2024 and May 2026, the most expensive MacBook Pro you could buy stayed at 128 GB of unified memory. The hardware ceiling barely moved. But the smartest open-weight model from @huggingface you could actually run on it went from a score of 10 (Llama 3 70B) to 47 (DeepSeek V4 Flash on @antirez's mixed-Q2 GGUF) on the @ArtificialAnlys Intelligence Index. That is 4.7× in 24 months, or a doubling of intelligence every 10.7 months. Moore's Law (transistor count) doubles every 24 months. Local open-weight AI on a laptop has been improving more than twice as fast as Moore's Law, on completely unchanged hardware.
clem 🤗 tweet media
English
48
92
613
57.5K
Kelly Buchanan retweetledi
Christian Szegedy
Christian Szegedy@ChrSzegedy·
Memory is the new oil
English
9
11
182
17.9K
Kelly Buchanan retweetledi
Kazuki Irie
Kazuki Irie@kzkirie·
Great to see Neural Data Router (ICLR 2022) also covered here. By @robert_csordas ahead of time, now popular: improving reasoning & generalization with shared-layer / looped / universal transformers. Arguing _why_ a "recurrent loop" helps + intuitive algorithmic illustration.
Kazuki Irie tweet mediaKazuki Irie tweet mediaKazuki Irie tweet mediaKazuki Irie tweet media
hardmaru@hardmaru

Reproducing all of Schmidhuber’s papers (1990-2025) using an AI coding assistant. Cool project by @yaroslavvb! It even reproduced the “World Models” paper by me and @SchmidhuberAI with a toy env, with a full VAE + RNN world model implementation. Project: github.com/cybertronai/sc…

English
1
6
36
6K
Kelly Buchanan retweetledi
rohan anil
rohan anil@_arohan_·
There is no pre-training, post-training, or test-time training. There are only priors, updates, constraints, and compute budgets. There is only TRAINING. Last several years we shipped the org chart to fundamental optimization science.
English
22
35
538
67.3K