Kelly Buchanan

1K posts

Kelly Buchanan

@ekellbuch

Postdoctoral Fellow @Stanford with @HazyResearch and @Scott_linderman. Working on 🤖🧠 PhD @Columbia @ZuckermanBrain @GoogleAI

Palo Alto, CA Katılım Temmuz 2011

2.2K Takip Edilen1.7K Takipçiler

Sabitlenmiş Tweet

Kelly Buchanan@ekellbuch·7 May

Very excited to release Terminal-Bench 2.1! Coding agents are among the most economically consequential deployments of LLMs to date. As agents improve, benchmark reliability matters more. We audited TB2.0 and found and corrected issues in 28/89 tasks. 30% of the benchmark! But the rankings survived, absolute scores moved up to 12pp!

English

767

83.8K

Kelly Buchanan retweetledi

Jeff Dean@JeffDean·1d

1/ Today at #GoogleIO, we’re releasing Gemini 3.5, our latest family of models combining frontier intelligence with action. We’re starting by releasing 3.5 Flash, which is built to help you execute complex, long-horizon agentic workflows. Gemini 3.5 Flash is our strongest model for coding and agent yet.It outscores 3.1 Pro on agentic and coding benchmarks like Terminal-Bench and MCP Atlas, while running 4x faster than other frontier models. Used in Google Antigravity, 3.5 Flash is even further optimized to be up to 12x faster. It’s a powerful engine to deploy sub-agents that collaborate, run high-frequency iterative loops, and solve real-world problems at scale. Some highlights we’re excited about 🔽

English

187

1.4K

116.1K

Kelly Buchanan retweetledi

Google@Google·1d

Gemini 3.5 Flash is built to help you execute complex, agentic workflows. 3.5 Flash rivals flagship models to deliver frontier performance for agents and coding, at the lightning speeds you expect from the Flash series.

English

180

2.3K

935.8K

Kelly Buchanan retweetledi

Sasha Rush@srush_nlp·1d

Been working on text feedback / OPSD in Composer. Really interesting space, and much more to be explored.

Cursor@cursor_ai

We improved Composer by scaling training, generating more complex RL environments, and introducing new learning methods. For example, we use text feedback during RL to learn faster by assigning credit in rollouts spanning hundreds of thousands of tokens.

English

270

34.4K

Kelly Buchanan retweetledi

Omri Weinstein@WeinsteinOmri·5d

A milestone for Pearl Research Labs: our first major enterprise partnership is live with Together AI. @togethercompute’s inference platform is an ideal demonstration of @prlnet's value proposition — One of the world’s most advanced hyperscalers running AI workloads on Pearl’s 2-for-1 Cuda kernels, turning inference into ¶PRL coins, and reducing consumer LLM price per token. Excited for what we’ll build together.

Together AI@togethercompute

Introducing Gemma-4-31B-it-Pearl on Together AI, Pearl Research Labs’ instruction-tuned checkpoint of Gemma 4 31B powered by @prlnet Proof of Useful Work protocol. AI natives can now use this Pearl model as a serverless inference endpoint on Together AI, at a 25%+ discounted pricing.

English

13K

Kelly Buchanan@ekellbuch·5d

@anh_ng8 It is now available in github.com/harbor-framewo… Thanks to @alexgshaw !

English

Anh Totti Nguyen@anh_ng8·9 May

@ekellbuch Hi @ekellbuch , thank you for the great work!!! I wanted to try it but I'm seeing tbench.ai/benchmarks/ter… (tasks not uploaded) and github.com/harbor-framewo… (PR not merged yet). Is it ready for the public to try or we should wait for now? :)

English

Kelly Buchanan@ekellbuch·7 May

English

767

83.8K

Kelly Buchanan@ekellbuch·5d

@MogicianTony thank you!

English

Hao Wang@MogicianTony·7 May

Great work in improving the quality of the benchmark!

Kelly Buchanan@ekellbuch

English

522

Kelly Buchanan@ekellbuch·5d

@GolfonCBS @_sdbuchanan

QAM

Golf on CBS ⛳@GolfonCBS·12 May

How many golfers can Rory McIlroy identify just by watching their swing silhouette? The results are QUITE impressive.

English

111

362

6.8K

1.8M

Kelly Buchanan retweetledi

Epoch AI@EpochAIResearch·12 May

We are conducting an AI-assisted review of FrontierMath: Tiers 1-4. This has flagged fatal errors in about a third of problems, and we believe most of these flags to be valid. We will release updated scores on a corrected dataset after completing a thorough human review.

English

875

463.3K

Kelly Buchanan retweetledi

Kevin Li@kevin_x_li·13 May

Introducing SWE-ZERO-12M-trajectories: the largest agentic trace dataset in the open, 5.7x larger than the previous largest. 112B tokens · 12M trajectories · 122K PRs · 3K repos · 16 languages huggingface.co/datasets/Alien…

English

524

77.2K

Kelly Buchanan@ekellbuch·13 May

@SnorkelAI @BerkeleySky @harvey Thank you very much for the support @SnorkelAI !

English

Kelly Buchanan retweetledi

Snorkel AI@SnorkelAI·12 May

We're humbled that @SnorkelAI co-authored Continual Learning Bench with @BerkeleySky and partnered with @harvey on LAB and @ekellbuch on TerminalBench 2.1. More coming soon!

vincent sunn chen@vincentsunnchen

last week was a fun week for benchmarks, which advanced the key axes for measuring frontier AI: - Legal Agent Benchmark (LAB) (from @harvey) → environment complexity: 1200+ tasks covering realistic instructions and work products, with expert rubrics - Continual Learning Bench (from @BerkeleySky & @SnorkelAI) → autonomy horizon: the first benchmark to capture ability of AI systems to learn from experience - ProgramBench (from @Meta & @StanfordAILab) → output complexity: expanding the scope of tasks from patches to entire programs, with a 0% pass rate - Bonus (from @ekellbuch & @terminalbench): TerminalBench 2.1 released with 28/89 tasks audited & fixed- showing that continuous quality control & task-level rigor are critical for enduring benchmarks!

English

2.8K

Kelly Buchanan retweetledi

Druv Pai@druv_pai·12 May

Pretty belated, but a great time to mention that I recently joined @thinkymachines! It’s been a pleasure to work with such kind and brilliant colleagues. We released a preview of real-time multimodal interaction models yesterday — check it out!

Thinking Machines@thinkymachines

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/interacti…

English

5.9K

Kelly Buchanan retweetledi

Noam Brown@polynoamial·12 May

And, of course, they should be plotted with compute, latency, or cost on the x-axis.

English

114

7.1K

Kelly Buchanan retweetledi

Unitree@UnitreeRobotics·12 May

Unitree Unveils: GD01, A Manned Transformable Mecha, from $650,000 👏 The world's first production-ready manned mecha. It can transform. It's a civilian vehicle. It weighs ~500kg with you inside. Please everyone be sure to use the robot in a Friendly and Safe manner.

English

1.1K

3.2K

17.1K

Kelly Buchanan retweetledi

clem 🤗@ClementDelangue·12 May

We just crossed 1,000,000 public datasets on Hugging Face! That's petabytes of data available that millions of AI builders are downloading, analyzing, and training AI models on every day! What's interesting is that we see a clear acceleration since agents started to be good as the number of datasets doubled over the past 8 months (it took 4 years to reach the first 500k). It's becoming easier and faster to build, share and use your own datasets! Many are saying the next bottleneck for more people to build AI themselves (instead of relying on APIs) is better data so we're just getting started! Thanks everyone for your amazing contributions, we couldn't do it without you!

English

281

17K

Kelly Buchanan retweetledi

clem 🤗@ClementDelangue·11 May

Local open-weight AI on a laptop has been improving more than twice as fast as Moore's Law! Between May 2024 and May 2026, the most expensive MacBook Pro you could buy stayed at 128 GB of unified memory. The hardware ceiling barely moved. But the smartest open-weight model from @huggingface you could actually run on it went from a score of 10 (Llama 3 70B) to 47 (DeepSeek V4 Flash on @antirez's mixed-Q2 GGUF) on the @ArtificialAnlys Intelligence Index. That is 4.7× in 24 months, or a doubling of intelligence every 10.7 months. Moore's Law (transistor count) doubles every 24 months. Local open-weight AI on a laptop has been improving more than twice as fast as Moore's Law, on completely unchanged hardware.

English

613

57.5K

Kelly Buchanan retweetledi

Christian Szegedy@ChrSzegedy·11 May

Memory is the new oil

English

182

17.9K

Kelly Buchanan@ekellbuch·9 May

@MillionInt @Lin_Manuel @ricky_martin @isabelamerced @celeste_oconn

QAM

Jerry Tworek@MillionInt·9 May

If I was @Lin_Manuel I’d feel an urge to do a certain thing

Daniel Green@dgrreen

The Sam Altman and @miramurati texts from the day he got fired from @OpenAI in 2023 just became evidence in the @elonmusk v. @sama trial. It felt like a meaningful moment in AI history, so I turned it into a musical. The lyrics are the texts.

English

24.3K

Kelly Buchanan retweetledi

Kazuki Irie@kzkirie·9 May

Great to see Neural Data Router (ICLR 2022) also covered here. By @robert_csordas ahead of time, now popular: improving reasoning & generalization with shared-layer / looped / universal transformers. Arguing _why_ a "recurrent loop" helps + intuitive algorithmic illustration.

hardmaru@hardmaru

Reproducing all of Schmidhuber’s papers (1990-2025) using an AI coding assistant. Cool project by @yaroslavvb! It even reproduced the “World Models” paper by me and @SchmidhuberAI with a toy env, with a full VAE + RNN world model implementation. Project: github.com/cybertronai/sc…

English

Kelly Buchanan retweetledi

rohan anil@_arohan_·9 May

There is no pre-training, post-training, or test-time training. There are only priors, updates, constraints, and compute budgets. There is only TRAINING. Last several years we shipped the org chart to fundamental optimization science.

English

538

67.3K

Keşfet

@togethercompute @prlnet @anh_ng8 @alexgshaw @MogicianTony @GolfonCBS @_sdbuchanan @SnorkelAI