Kelly Buchanan

943 posts

Kelly Buchanan

@ekellbuch

Postdoctoral Fellow @Stanford with @HazyResearch and @Scott_linderman. Working on 🤖🧠 PhD @Columbia @ZuckermanBrain @GoogleAI

Palo Alto, CA Se unió Temmuz 2011

2.2K Siguiendo1.3K Seguidores

Tweet fijado

Kelly Buchanan@ekellbuch·24 Haz

LLMs can generate 100 answers, but which one is right? Check out our latest work closing the generation-verification gap by aggregating weak verifiers and distilling them into a compact 400M model. If this direction is exciting to you, we’d love to connect.

Jon Saad-Falcon@JonSaadFalcon

How can we close the generation-verification gap when LLMs produce correct answers but fail to select them? 🧵 Introducing Weaver: a framework that combines multiple weak verifiers (reward models + LM judges) to achieve o3-mini-level accuracy with much cheaper non-reasoning models like Llama 3.3 70B Instruct! 🧵(1 / N)

English

11.6K

Kelly Buchanan retuiteado

Ziran Yang@__zrrr__·2d

Introducing Goedel-Code-Prover 🌲 LLMs write code, but can they prove it correct? Not just pass tests, but construct machine-checkable proofs that a program works for ALL possible inputs. We built a system that does exactly this. Given aprogram and its specification in Lean 4, Goedel-Code-Prover automatically synthesizes formal proofs ofcorrectness. Our 8B model achieves 62% overall success rate across three benchmarks (Verina, Clever &AlgoVeri), a 2.6x improvement over the strongest baseline, surpassing both frontier LLMs (GPT/Gemini/Claude)and open-source theorem provers up to 84x larger (DeepSeek-Prover/Goedel-Prover/Kimina-Prover/BFS-Prover).

English

537

63.5K

Kelly Buchanan@ekellbuch·19h

@Avanika15 @OpenJarvisAI hahaha

Filipino

Avanika Narayan@Avanika15·20h

don’t be like the agent 🐐 @ekellbuch 😊 run your agents locally with @OpenJarvisAI avoid internet overage fees

English

1.1K

Kelly Buchanan retuiteado

Chroma@trychroma·1d

Introducing Chroma Context-1, a 20B parameter search agent. > pushes the pareto frontier of agentic search > order of magnitude faster > order of magnitude cheaper > Apache 2.0, open-source

English

137

394

4.1K

Kelly Buchanan retuiteado

Stuart Sul@stuart_sul·2d

Happy to share new ThunderKittens attention kernels for B300 GPUs -- faster than FA4! Check it out:

Nash Brown@nash_c_brown

Excited to share new ThunderKittens attention kernels that match or outperform Flash Attention 4 on Blackwell GPUs! Currently only supports QK192/V128 shapes, but more coming soon. Check out the code here: github.com/HazyResearch/T… Shoutout to the FA4 team for the algorithmic innovations and to @stuart_sul for the helpful discussions.

English

150

13.1K

Kelly Buchanan retuiteado

Phillip Isola@phillip_isola·3d

@GuanyaShi I mostly see "algorithmic novelty" as a cost. That cost needs to be justified by a sufficiently surprising result (e.g., a new capability or insight). All the better if you get the same with zero change in method. My rough heuristic: value = log P(results) - log P(methods)

English

105

5.9K

Kelly Buchanan@ekellbuch·3d

Great blogpost on writing papers, something I will keep coming back to.

Neel Guha@NeelGuha

I wrote a blogpost about writing machine learning research papers (e.g., NeurIPS, ICML, ICLR, etc.). The core idea is that most papers follow one of a predetermined set of templates. The post talks about each template, describes their rules, and offers examples...

English

285

Kelly Buchanan retuiteado

Fireworks AI@FireworksAI_HQ·3d

We’re seeing lots of interest in how Cursor delivered Composer 2. One less obvious insight: you don't need to spend billions on a giant cluster to do reinforcement learning. With disaggregated sampling, we ran @Cursor_ai Composer 2 training across 3-4 clusters worldwide, with a unified capacity of Fireworks Virtual Cloud. Check how we optimize cross-region 1TB+ model updates by 98%+ while keeping staleness under a few minutes: fireworks.ai/blog/frontier-…

Cursor@cursor_ai

We're releasing a technical report describing how Composer 2 was trained.

English

329

77.8K

Kelly Buchanan retuiteado

Xavier Gonzalez@xavierjgonzalez·4d

Parallelizing nonlinear RNNs is gaining traction! More efficient than transformers; more expressive than linear RNNs. My PhD thesis provides an intro guide to the math (Newton's method) behind the parallelization. Great as a quick-start if you want to explore this new field!

English

359

31.2K

Kelly Buchanan retuiteado

Stuart Sul@stuart_sul·3d

Happy to share this technical report! Building MXFP8/NVFP4 training kernels for Composer 2 with ThunderKittens/ParallelKittens was a lot of fun. We share some details in the report, including our novel variant of NVFP4:

Cursor@cursor_ai

We're releasing a technical report describing how Composer 2 was trained.

English

200

17K

Kelly Buchanan retuiteado

Stephen Roller@stephenroller·12 Eki

@srush_nlp I find people unfamiliar with scaling are shocked by this:

English

279

Kelly Buchanan@ekellbuch·5d

@yusan_lin @mirrormirror_ai congrats Yusan!

Català

255

Yusan Lin@yusan_lin·5d

Today @mirrormirror_ai is launching the marketplace where fashion models license their likeness and brands get stunning AI-generated imagery featuring real people. Commercially licensed, model-approved. Try our platform: mirrormirrorai.com As a fashion model I used to spend hours on fashion photoshoot sets. I later did my PhD in CS and became a Research Scientist on AI for fashion. I can see clearly that AI image generation is replacing a large portion of my old job. But brands that use AI recklessly have already paid the price. It damages reputations and hurts the bottom line. Putting real people at the core of AI-generated imagery isn't just about avoiding backlash. It's better business. That's what Mirror Mirror AI is built for. Right now, Mirror Mirror AI houses agency-signed models who have graced the covers of Vogue and Harper's Bazaar. You can digitally book them using our fashion-centric AI software, get your campaign done in hours instead of weeks, and never have to fly anyone in. You purchase a license for commercial use upon approval, and the models get paid. Mirror Mirror AI is also opening a global call for independent models from anywhere in the world to apply to be featured on the platform. Work with fashion brands internationally, choose the projects you take on, and earn from your own likeness on your own terms. Selected models will be announced at an exclusive event in New York during @Techweek_ this June. Apply for the open call: mirrormirrorai.com/open-call A huge thank you to our incredible team for pouring their hearts into this launch, and to a16z @speedrun for believing in our vision from the start. We're just getting started.

English

112

832

201.7K

Kelly Buchanan retuiteado

Christina Baek@_christinabaek·18 Mar

Models are typically specialized to new domains by finetuning on small, high-quality datasets. We find that repeating the same dataset 10–50× starting from pretraining leads to substantially better downstream performance, in some cases outperforming larger models. 🧵

English

614

90.4K

Kelly Buchanan retuiteado

Yi Ma@YiMaTweets·20 Mar

This is precisely what Chapter 7 of our new book says: ma-lab-berkeley.github.io/deep-represent…

CLaE@leafs_s

Transformers are Bayesian Networks arxiv.org/abs/2603.17063

English

791

83.7K

Kelly Buchanan retuiteado

Jon Saad-Falcon@JonSaadFalcon·12 Mar

Personal AI should run on your personal devices. So, we built OpenJarvis: a personal AI that lives, learns, and works on-device. Try it today and top the OpenJarvis Leaderboard for a chance to win a Mac Mini! Collab w/ @Avanika15, John Hennessy, @HazyResearch, and @Azaliamirh. Details in thread.

English

319

98.9K

Kelly Buchanan retuiteado

Zitong Yang@ZitongYang0·13 Mar

This is only possible with @tyler_griggs_'s tool use library github.com/thinking-machi… I am unfortunately late to the party, but I only recently realized how much of a paradigm shift multi-turn+tool-use is. I even wonder if it makes sense to rewrite the entire pretraining corpus into an agentic trajectory? This solves two problems: (1) removing the gap between pretraining and test distribution; (2) agentic turn change can function as a natural "glue" that puts related internet documents together in context -- agent browsing one document at turn 7 influences its action/generation at turn 107 -- encoding the internet in a natural long-context format. Also, a great time to share that I have joined @thinkymachines. Thanks @miramurati for teaching me the value of focus, @lilianweng for instilling in me the power of responsibility, and @johnschulman2 for showing me by example the free spirit of scientific exploration! We are hiring job-boards.greenhouse.io/thinkingmachin…

clare ❤️‍🔥@clarejtbirch

kind of a big deal but actual legend @ZitongYang0 has integrated @tinkerapi with @harborframework, so you can use Harbor on Tinker w ~no code change now 🤠🧡

English

116

30.2K

Kelly Buchanan retuiteado

Omar Shaikh@oshaikh13·10 Mar

What’s the point of a “helpful assistant” if you have to always tell it what to do next? In a new paper, we introduce a reasoning model that predicts what you’ll do next over long contexts (LongNAP 💤). We trained it on 1,800 hours of computer use from 20 users. 🧵

English

292

98.3K

Kelly Buchanan retuiteado

Sam Buchanan@_sdbuchanan·10 Mar

We've released an updated "v2.0" of our book on deep representation learning! We've reorganized and improved many sections for better pedagogical clarity, and added many new examples and applications throughout the book. Massive thanks are due to folks in the community who submitted feedback and corrections on the first version, including @sirbayes :-) 📕Read: ma-lab-berkeley.github.io/deep-represent… 🛠️Contribute: github.com/Ma-Lab-Berkele…

Kevin Patrick Murphy@sirbayes

I am delighted to see a new version of the book by @_sdbuchanan, @druv_pai , @pengwang2003 and @YiMaTweets . This is the best book on the foundations of deep representation learning! In this era of coding agents, the math is all you need to learn :) ma-lab-berkeley.github.io/deep-represent…

English

6.1K

Kelly Buchanan retuiteado

Ken Liu@kenziyuliu·27 Şub

Can we build a blind, *unlinkable inference* layer where ChatGPT/Claude/Gemini can't tell which call came from which users, like a “VPN for AI inference”? Yes! Blog post below + we built it into open source infra/chat app and served >15k prompts at Stanford so far. How it helps with AI user privacy: # The AI user privacy problem If you ask AI to analyze your ChatGPT history today, it’s surprisingly easy to infer your demographics, health, immigration status, and political beliefs. Every prompt we send accumulates into an (identity-linked) profile that the AI lab controls completely and indefinitely. At a minimum this is a goldmine for ads (as we know now). A bigger issue is the concentration of power: AI labs can easily become (or asked to become) a Cambridge Analytica, whistleblow your immigration status, or work with health insurance to adjust your premium if they so choose. This is a uniquely worse problem than search engines because your average query is now more revealing (not just keywords), interactive, and intelligence is now cheap. Despite this, most of us still want these remote models; they’re just too good and convenient! (this is aka the "privacy paradox".) # Unlinkable inference as a user privacy architecture The idea of unlinkable inference is to add privacy while preserving access to the remote models controlled by someone else. A “privacy wrapper” or “VPN for AI inference”, so to speak. Concretely, it’s a blind inference middle layer that: (1) consists of decentralized proxies that anyone can operate; (2) blindly authenticates requests (via blind signatures / RFC9474,9578) so requests are provably sandboxed from each other and from user identity; (3) relays prompts over randomly chosen proxies that don’t see or log traffic (via client-side ephemeral keys or hosting in TEEs); and (4) the provider simply sees a mixed pool of anonymous prompts from the proxies. No state, pseudonyms, or linkable metadata. If you squint, an unlinkable inference layer is essentially a vendor for per-request, anonymous, ephemeral AI access credentials (for users or agents alike). It partitions your context so that user tracking is drastically harder. Obviously, unlinkability isn’t a silver bullet: the prompt itself still goes to the remote model and can leak privacy (so don't use our chat app for a therapy session!). It aims to combat *longitudinal tracking* as a major threat to user privacy, and its statistical power increases quickly by mixing more users and requests. Unlinkability can be applied at any granularity. For an AI chat app, you can unlinkably request a fresh ephemeral key for every session so tracking is virtually impossible. # The Open Anonymity Project We started this project with the belief that intelligence should be a truly public utility. Like water and electricity, providers should be compensated by usage, not who you are or what you do with it. We think unlinkable inference is a first step towards this “intelligence neutrality”. # Try it out! It’s quite practical - Chat app “oa-chat”: chat.openanonymity.ai (<20 seconds to get going) - Blog post that should be a fun read: openanonymity.ai/blog/unlinkabl… - Project page: openanonymity.ai - GitHub: github.com/OpenAnonymity

English

157

828

374.5K

Kelly Buchanan retuiteado

Together AI@togethercompute·10 Mar

Introducing the official Together MCP server! Use it in your favorite coding agent to build AI apps, fine-tune models, or spin up clusters faster.

English

1.5K

Kelly Buchanan retuiteado

nathan chen@nathancgy4·10 Mar

There’s likely no better ways to do scientific model architecture research than to climb the scaling ladder, as what we’ve done for every improvement in Kimi. You rapidly test changes on a small model scale (e.g. 3b total, 800m active) that gives fast enough feedback time, then continue running the working ones on larger models step by step. This differentiates the models that actually work v.s. ones that won’t scale e.g. enabling some form of inductive bias which small models can’t learn. Now this research mode is starting to get automated with LLMs, I see no other outcome than an LLM (or group of agents) coming up with real innovation one day very soon. Current frontier models are already having nice research tastes. Scary yet amazing things are about to happen :)

Andrej Karpathy@karpathy

Three days ago I left autoresearch tuning nanochat for ~2 days on depth=12 model. It found ~20 changes that improved the validation loss. I tested these changes yesterday and all of them were additive and transferred to larger (depth=24) models. Stacking up all of these changes, today I measured that the leaderboard's "Time to GPT-2" drops from 2.02 hours to 1.80 hours (~11% improvement), this will be the new leaderboard entry. So yes, these are real improvements and they make an actual difference. I am mildly surprised that my very first naive attempt already worked this well on top of what I thought was already a fairly manually well-tuned project. This is a first for me because I am very used to doing the iterative optimization of neural network training manually. You come up with ideas, you implement them, you check if they work (better validation loss), you come up with new ideas based on that, you read some papers for inspiration, etc etc. This is the bread and butter of what I do daily for 2 decades. Seeing the agent do this entire workflow end-to-end and all by itself as it worked through approx. 700 changes autonomously is wild. It really looked at the sequence of results of experiments and used that to plan the next ones. It's not novel, ground-breaking "research" (yet), but all the adjustments are "real", I didn't find them manually previously, and they stack up and actually improved nanochat. Among the bigger things e.g.: - It noticed an oversight that my parameterless QKnorm didn't have a scaler multiplier attached, so my attention was too diffuse. The agent found multipliers to sharpen it, pointing to future work. - It found that the Value Embeddings really like regularization and I wasn't applying any (oops). - It found that my banded attention was too conservative (i forgot to tune it). - It found that AdamW betas were all messed up. - It tuned the weight decay schedule. - It tuned the network initialization. This is on top of all the tuning I've already done over a good amount of time. The exact commit is here, from this "round 1" of autoresearch. I am going to kick off "round 2", and in parallel I am looking at how multiple agents can collaborate to unlock parallelism. github.com/karpathy/nanoc… All LLM frontier labs will do this. It's the final boss battle. It's a lot more complex at scale of course - you don't just have a single train. py file to tune. But doing it is "just engineering" and it's going to work. You spin up a swarm of agents, you have them collaborate to tune smaller models, you promote the most promising ideas to increasingly larger scales, and humans (optionally) contribute on the edges. And more generally, *any* metric you care about that is reasonably efficient to evaluate (or that has more efficient proxy metrics such as training a smaller network) can be autoresearched by an agent swarm. It's worth thinking about whether your problem falls into this bucket too.

English

727

65.5K

Descubrir

@Avanika15 @OpenJarvisAI @GuanyaShi @Cursor_ai @srush_nlp @yusan_lin @mirrormirror_ai @Techweek_