Jon Saad-Falcon

389 posts

Jon Saad-Falcon banner
Jon Saad-Falcon

Jon Saad-Falcon

@JonSaadFalcon

CS PhD @hazyresearch @stanfordnlp @StanfordAILab

Palo Alto, CA Katılım Ocak 2021
936 Takip Edilen1.8K Takipçiler
Sabitlenmiş Tweet
Jon Saad-Falcon
Jon Saad-Falcon@JonSaadFalcon·
Data centers dominate AI, but they're hitting physical limits. What if the future of AI isn't just bigger data centers, but local intelligence in our hands? The viability of local AI depends on intelligence efficiency. To measure this, we propose intelligence per watt (IPW): intelligence delivered (capabilities) per unit of power consumed (efficiency). Today’s Local LMs already handle 88.7% of single-turn chat and reasoning queries, with local IPW improving 5.3× in 2 years—driven by better models (3.2×) and better accelerators (1.7×). As local IPW improves, a meaningful fraction of workloads can shift from centralized infrastructure to local compute, with IPW serving as the critical metric for tracking this transition. (1/N)
Jon Saad-Falcon tweet media
English
55
142
456
227.7K
Jon Saad-Falcon retweetledi
Avanika Narayan
Avanika Narayan@Avanika15·
“the edge + cloud are going to come together” — the 🐐 @Benioff @Benioff knows what’s up 💪🏽. hybrid local-cloud inference is the way. @JonSaadFalcon and I have been working on this for a minute. links to research in comments 👇
Maddy A@its_maddy_a

“I think we are getting brainwashed.” @Benioff said this on @theallinpod. “We’re using $300M of @AnthropicAI this year… the vast majority of those tokens don’t need to go to Anthropic.” Some tasks need @claudeai . Some need @OpenAI . Most need smaller, cheaper, faster models like @ZeroGPU_AI @Benioff believes in what we do - @salesforcevc should take a look. zerogpu.ai

English
1
1
16
2.2K
Jon Saad-Falcon retweetledi
Jon Saad-Falcon retweetledi
Kelly Buchanan
Kelly Buchanan@ekellbuch·
Very excited to release Terminal-Bench 2.1! Coding agents are among the most economically consequential deployments of LLMs to date. As agents improve, benchmark reliability matters more. We audited TB2.0 and found and corrected issues in 28/89 tasks. 30% of the benchmark! But the rankings survived, absolute scores moved up to 12pp!
Kelly Buchanan tweet media
English
27
74
768
84.2K
Jon Saad-Falcon retweetledi
Parth Asawa
Parth Asawa@pgasawa·
Today, we’re releasing Continual Learning Bench 1.0: the first, realistic benchmark for measuring how AI systems can improve in online settings. Benchmarks today assume models are stateless. Each example is independent, and once a system finishes a task, it moves on as if nothing happened. But deployed AI systems should learn from experience. We tested 10+ frontier systems against novel, expert-validated tasks and find there’s still plenty of headroom for learning. (1/n)
Parth Asawa tweet media
English
42
154
1.1K
824.3K
Jon Saad-Falcon retweetledi
Avanika Narayan
Avanika Narayan@Avanika15·
hyped to see computer systems 🐐's like @JeffDean, david patterson, @AzaliaMirh & others discussing how intelligence per watt (ipw) should be the north star metric for computer system design links to event notes + ipw work w/@JonSaadFalcon in comments below!
Avanika Narayan tweet media
English
3
8
48
8.5K
Jon Saad-Falcon retweetledi
Jon Saad-Falcon retweetledi
Michael Y. Li
Michael Y. Li@michaelyli_·
Can a language model learn, end-to-end, what to keep in its own KV cache and what to throw away? Can it learn to forget while it learns to reason? Deep learning's central lesson: capability emerges from end-to-end optimization, not heuristics/strong inductive biases. But for efficiency, we rely heavily on hand-designed approaches. 🗑️ Introducing Neural Garbage Collection (NGC): we train a language model to jointly reason and manage its own KV cache, using reinforcement learning with outcome-based task reward alone. No SFT, no proxy objectives, no summarization in natural language. New paper with @jubayer_hamid, Emily Fox, and @noahdgoodman!
Michael Y. Li tweet media
English
30
133
905
163.1K
Jon Saad-Falcon retweetledi
Avanika Narayan
Avanika Narayan@Avanika15·
@OpenJarvisAI is now just one tweet away 😊. file issues, make prs & more, directly from your socials. s/o to the amazing @robbymanihani for the 🚢
Jon Saad-Falcon@JonSaadFalcon

Say hi to @OpenJarvisAI 👋 If you have issues, want to make a PR, or simply chat, just @OpenJarvisAI in a tweet! This account is itself an OpenJarvis instance: running 24/7 on an NVIDIA DGX Spark, triaging issues + PRs on the repo and serving as a personal assistant for the lab! For personal AI on personal devices, checkout: github.com/open-jarvis/Op… x.com/JonSaadFalcon/…

English
1
2
5
768
Jon Saad-Falcon retweetledi
Avanika Narayan
Avanika Narayan@Avanika15·
thrilled to see that intelligence per joule (ipj) has become north star metric for hardware-software codesign @JonSaadFalcon and i study ipj extensively in our latest paper. link in comments 👇
Reiner Pope@reinerpope

Intelligence per picojoule, with @itsclivetime and @dylan522p (0:00) Intro (1:22) What is codesign? (2:49) Codesign example: Swish vs ReLU (4:22) Are DeepSeek papers codesign? (6:45) Predicting where ML research will go (8:06) Should researchers hate your chips? (9:34) Can you codesign too much? (13:23) Picking the right grain size for specialization (16:22) How much hardware flexibility for The Age of Research? (20:05) Did reasoning and RL disrupt hardware roadmaps? (23:09) Cerebras/Groq: unexpected wins on reasoning and RL (25:34) Disaggregating MLP and attention (29:06) The right metrics for quantization and codesign papers

English
3
2
23
6.1K
Jon Saad-Falcon retweetledi
Dan Fu
Dan Fu@realDanFu·
📢 Super excited to announce Parcae! We've been thinking about scaling laws and the "right" way to get more FLOPs. Turns out layer looping - with the right parameterization - gives you a new axis to scale! Parcae matches Transformers 2x their size (w/ the same data), and outperforms prior formulations of looped models. But - you need the right parameterization to get these gains against strong Transformer baselines. Looped models are famously unstable to train, with tons of loss spikes and hyperparameter sensitivity. The main technical challenge with looped models is residual explosion - if you're passing the activations through the same layers over and over, some otherwise benign parameterizations cause huge instability. Our key idea: we can think of the residual stream of a model as a time-varying dynamical system - the same fundamentals behind SSMs like Mamba and S4. Then a few modest modifications to classic Transformers (stable diagonalization of injection params, LN before embeddings) can stabilize the looped models. The resulting models are more stable to train, but also reach higher quality. It's strong enough to start to derive new scaling laws. Classically - we know you need to scale parameters with data to be FLOP-optimal. With Parcae, we find a third axis - given fixed parameters, you additionally want to scale FLOPs by looping as you scale data. Super excited to see how these ideas hold, and what we can do with looped models! Check out @hayden_prairie's great explainer thread below, and see links for our paper, blog, and models. Joint w/ @zacknovack and @BergKirkpatrick, and a fun collab between @togethercompute and my lab at @ucsd_cse. Enjoy!
Hayden Prairie@hayden_prairie

We’ve been thinking a lot about scaling laws, wondering if there is a more effective way to scale FLOPs without increasing parameters. Turns out the answer is YES – by looping blocks of layers during training. We find that predictable scaling laws exist for layer looping, allowing us to use looping to achieve the quality of a Transformer twice the size. Our scaling laws suggest that for a fixed parameter budget, data and looping should be increased in tandem! 🧵👇

English
2
26
128
21.6K
Jon Saad-Falcon retweetledi
Azalia Mirhoseini
Azalia Mirhoseini@Azaliamirh·
Turns out we can get SOTA on agentic benchmarks with a simple test-time method! Excited to introduce LLM-as-a-Verifier. Test-time scaling is effective, but picking the "winner" among many candidates is the bottleneck. We introduce a way to extract a cleaner signal from the model: 1️⃣ Ask the LLM to rank results on a scale of 1-k 2️⃣ Use the log-probs of those rank tokens to calculate an expected score You can get a verification score in a single sampling pass per candidate pair. Blog: llm-as-a-verifier.notion.site Code: llm-as-a-verifier.github.io Led by @jackyk02 and in collaboration with a great team: @shululi256, @pranav_atreya, @liu_yuejiang, @drmapavone, @istoica05
Azalia Mirhoseini tweet media
English
34
114
987
115.7K
Jon Saad-Falcon retweetledi
Tarun Suresh
Tarun Suresh@TarunSures41845·
Great work with @hangoo_kang , @JonSaadFalcon , and @Azaliamirh on a new system for environment-specific LLM agent self-improvement that trains the agent on the underlying capabilities it lacks 🚀
Hangoo Kang@hangoo_kang

Introducing TRACE: an end-to-end system for environment-specific agent self-improvement🚀 Outperforms direct RL on the environment, GEPA, and synthetic data approaches on τ²-Bench and ToolSandBox📈 Collab w/ @TarunSures41845, @JonSaadFalcon, @Azaliamirh. Details in thread👇

English
1
2
15
2.8K
Jon Saad-Falcon retweetledi
Hangoo Kang
Hangoo Kang@hangoo_kang·
Introducing TRACE: an end-to-end system for environment-specific agent self-improvement🚀 Outperforms direct RL on the environment, GEPA, and synthetic data approaches on τ²-Bench and ToolSandBox📈 Collab w/ @TarunSures41845, @JonSaadFalcon, @Azaliamirh. Details in thread👇
Hangoo Kang tweet media
English
4
29
155
14K
Jon Saad-Falcon retweetledi
Tristan Thrush
Tristan Thrush@TristanThrush·
New paper! Want to precisely optimize synthetic training data to do practical or even wacky things? Dataset Policy Gradients get you there, letting you target any differentiable training or post-training metric. We embedded a QR code in GPT-2’s weights using only training data!
Tristan Thrush tweet media
English
7
41
223
46.8K
Jon Saad-Falcon retweetledi
Jacky Kwok
Jacky Kwok@jackyk02·
We release LLM-as-a-Verifier 🧠: A general-purpose verification framework that achieves SOTA 👑 on Terminal-Bench 2 (86.4%) and SWE-Bench Verified (77.8%) by scaling: - scoring granularity - repeated verification - criteria decomposition 📄 Blog & Code: llm-as-a-verifier.notion.site
Jacky Kwok tweet media
English
10
57
443
54.4K