Dan Fu

861 posts

Dan Fu

Dan Fu

@realDanFu

VP, Kernels @togethercompute Assistant Professor @ucsd_cse Looking for talented kernel engineers and performance engineers!

Katılım Eylül 2019
229 Takip Edilen7.4K Takipçiler
Sabitlenmiş Tweet
Dan Fu
Dan Fu@realDanFu·
Excited to share that I will be joining UCSD CSE as an assistant professor in January 2026! I'll be recruiting PhD students from the 2024 application pool - if you're interested in anything ML Sys/efficiency/etc please reach out & put my name on your application! Until then I'll be finishing up some requirements at Stanford (long story...) and hanging out at @togethercompute. Stay tuned for more!
English
47
40
578
113.6K
Dan Fu retweetledi
Together AI
Together AI@togethercompute·
Congrats to the @cursor_ai team on Composer 2 — a huge milestone for RL-trained models and step forward for open-source coding intelligence. Together AI is proud to partner on this launch. Composer 2 is turning heads for its speed and quality — and we help power the Composer 2 Fast endpoint on the AI Native Cloud. Excited to keep building with the Cursor team.
Cursor@cursor_ai

Composer 2 is now available in Cursor.

English
3
8
70
6.3K
Dan Fu
Dan Fu@realDanFu·
Congrats @cursor_ai on launching Composer 2 today! Really gratifying working with them to make it ready for launch today and to power the Composer 2 Fast endpoint on @togethercompute. A couple things make this one really cool for me: It's really really good. Seriously, try it! It's a really gratifying validation of kernel research we've been doing for years - training kernels written in ThunderKittens (by my old lab mate @stuart_sul), and inference kernels in our stack to make it so blazing fast. Looking forward to what comes next!
Cursor@cursor_ai

Composer 2 is now available in Cursor.

English
2
11
56
5.4K
Dan Fu retweetledi
Jon Saad-Falcon
Jon Saad-Falcon@JonSaadFalcon·
Personal AI should run on your personal devices. So, we built OpenJarvis: a personal AI that lives, learns, and works on-device. Try it today and top the OpenJarvis Leaderboard for a chance to win a Mac Mini! Collab w/ @Avanika15, John Hennessy, @HazyResearch, and @Azaliamirh. Details in thread.
Jon Saad-Falcon tweet media
English
35
91
307
92.6K
Dan Fu retweetledi
Together AI
Together AI@togethercompute·
Together Research has produced FlashAttention, ATLAS, ThunderKittens and more. This week at AI Native Conf: seven more releases, all coming to production soon. Thread → #ainativeconf #ainativecloud
Together AI tweet media
English
1
14
102
23.9K
Nuliayuk
Nuliayuk@Nuliayuk·
@realDanFu @togethercompute As a complete amateur, I’m seriously considering GEPA Optimize Anything along with thunderkittens/hipkittens to optimize kernels on my architecture and enable repeat training runs to dial in hyperparameter settings. Incredible how the benefits can compound so quickly.
English
1
1
2
279
Dan Fu
Dan Fu@realDanFu·
I'm really excited by using AI to write kernels and accelerate kernel development. At @togethercompute we use AI models extensively through all parts of the performance engineering pipeline. Super excited to see this new work coming from DoubleAI!
Amnon Shashua@AmnonShashua

DoubleAI’s AI system just beat a decade of expert GPU engineering WarpSpeed just beat a decade of expert-engineered GPU kernels — every single one of them. cuGraph is one of the most widely used GPU-accelerated libraries in the world. It spans dozens of graph algorithms, each written and continuously refined by some of the world’s top performance engineers. @_doubleAI_'s WarpSpeed autonomously rewrote and re-optimized these kernels across three GPU architectures (A100, L4, A10G). Today, we released the hyper-optimized version on GitHub — install it with no change to your code. The numbers: - 3.6x average speedup over human experts - 100% of kernels benefit from speedup - 55% see more than 2x improvement. But hasn’t AI already achieved expert-level status — winning gold medals at IMO, outperforming top programmers on CodeForces? Not quite. Those wins share three hidden crutches: abundant training data, trivial validation, and short reasoning chains. Where all three hold, today’s AI shines. Remove any one of them and it falls apart (as Shai Shalev Shwartz wrote in his post). GPU performance engineering breaks all three. Data is scarce. Correctness is hard to validate. And performance comes from a long chain of interacting choices — memory layout, warp behavior, caching, scheduling, graph structure. Even state-of-the-art agents like Claude Code, Codex, and Gemini CLI fail dramatically here, often producing incorrect implementations even when handed cuGraph’s own test suite. Scaling alone can’t break this barrier. It took new algorithmic ideas — our Diligent framework for learning from extremely small datasets, our PAC-reasoning methodology for verification when ground truth isn’t available, and novel agentic search structures for navigating deep decision chains. This is the beginning of Artificial Expert Intelligence (AEI) — not AGI, but something the world needs more: systems that reliably surpass human experts in the domains where expertise is rarest, slowest, and most valuable. If AI can surpass the world’s best GPU engineers, which domain falls next? For the full blog: doubleai.com/research/doubl… CuGraph: docs.rapids.ai/api/cugraph/st… Winning Gold at IMO 2025: arxiv.org/abs/2507.15855 Codeforces benchmarks: rdworldonline.com/openai-release… @shai_s_shwartz post: x.com/shai_s_shwartz… From Reasoning to Super-Intelligence: A Search-Theoretic Perspective arxiv.org/abs/2507.15865 Artificial Expert Intelligence through PAC-reasoning arxiv.org/abs/2412.02441

English
1
13
113
15.6K
Dan Fu retweetledi
Tanishq Kumar
Tanishq Kumar@tanishqkumar07·
I've been working on a new LLM inference algorithm. It's called Speculative Speculative Decoding (SSD) and it's up to 2x faster than the strongest inference engines in the world. Collab w/ @tri_dao @avnermay. Details in thread.
English
132
454
4K
598.3K
Hao Zhang
Hao Zhang@haozhangml·
Can’t believe I get to say this -- deeply honored to be named a 2026 Sloan Research Fellow: today.ucsd.edu/story/2026-slo… Early faculty life is… "hyper-intense": teaching, advising, hiring, papers, grants; and trying to build a lab culture you’ll still be proud of years later. There were many weeks where it felt like we were building the plane mid-flight, burning plenty of midnight oil along the way. Over the past few years, I’ve been incredibly lucky to work with amazing students and collaborators on a chain of OSS project: Vicuna → Chatbot Arena → vLLM → DistServe → LMGame → FastVideo; each one then pushed forward way further by people far beyond our lab. This award feels less like a finish line and more like fuel for the lab, for our students, and for the next set of systems we haven’t built yet. A core principle of us is building "open-source research that ships." At the same time, it’s hard not to feel a mix of excitement + uncertainty + anxiety about where CS is heading. Coding agents are improving so fast that I am feeling the AGI first-handedly. I have gone back to builder mode -- only more productive than ever -- outside of my faculty admin work. I’ve watched friends and colleagues hit numbers that would’ve sounded like science fiction a year ago (e.g., 100+ commits/day). So what does it mean to “do great computer science” when baseline productivity keeps jumping? For me, it makes “research that ships” more important, and even raises the bar. The leverage shifts toward taste and problem selection, principled system design, and translating ideas into reliable artifacts. We're excited to keep proving that through real systems people can use! Deeply grateful to: - My students and collaborators — for the ideas, execution, and drive. - @HDSIUCSD , Dean @GuptaUcsd, and my @UCSanDiego colleagues — for building an environment where ambitious work can happen. - @nvidia and @mbzuai (and other compute sponsors) — for support that helped us move faster and turn ideas into real artifacts. Even as the interface changes, the need for efficient compute and solid infrastructure only grows. Most of all: credit to the students at @haoailab. You’re the reason any of this is worth doing. Keep building and shipping!
English
34
10
185
16.1K
Dan Fu retweetledi
Mayee Chen
Mayee Chen@MayeeChen·
Data mixing - determining ratios across your training datasets - matters a lot for model quality. While building Olmo 3, we learned it’s hard to set up a method that finds a strong mix, and hard to maintain that mix as datasets change throughout development. Introducing Olmix👇
Mayee Chen tweet media
English
13
70
261
46.6K
Flapping Airplanes
Flapping Airplanes@flappyairplanes·
Announcing Flapping Airplanes! We’ve raised $180M from GV, Sequoia, and Index to assemble a new guard in AI: one that imagines a world where models can think at human level without ingesting half the internet.
GIF
English
340
259
3.6K
2.1M
Dan Fu retweetledi
Together AI
Together AI@togethercompute·
In The Mad podcast, hosted by @mattturck, @realDanFu is joined by @Tim_Dettmers of @allen_ai, to discuss the evolving understanding of AI — including topics surrounding compute and the pursuit toward AGI, emphasizing its potential to change society and enhance productivity.
English
1
1
6
1.7K
Dan Fu
Dan Fu@realDanFu·
Really enjoyed coming on the podcast to debate scaling and AGI with @Tim_Dettmers thanks so much for having us on @mattturck!
Matt Turck@mattturck

The End of GPU Scaling? Compute & The Agent Era My conversation with @Tim_Dettmers of @allen_ai and @realDanFu of @togethercompute about their blog posts on AGI and compute (links in replies) and agents in 2026 00:00 - Intro 01:06 – Two essays, two frameworks on AGI 01:34 – Tim’s background: quantization, QLoRA, efficient deep learning 02:25 – Dan’s background: FlashAttention, kernels, alternative architectures 03:38 – Defining AGI: what does it mean in practice 08:20 – Tim’s case: computation is physical, diminishing returns, memory movement 11:29 – “GPUs won’t improve meaningfully”: the core claim and why 16:16 – Dan’s response: utilization headroom (MFU) + “models are lagging indicators” 22:50 – Pre-training vs post-training (and why product feedback matters) 25:30 – Convergence: usefulness + diffusion (where impact actually comes from) 29:50 – Multi-hardware future: NVIDIA, AMD, TPUs, Cerebras, inference chips 32:16 – Agents: did the “switch flip” yet? 33:19 – Dan: agents crossed the threshold (kernels as the “final boss”) 34:51 – Tim: “use agents or be left behind” + beyond coding 36:58 – “90% of code and text should be written by agents” (how to do it responsibly) 39:11 – Practical automation for non-coders: what to build and how to start 43:52 – Dan: managing agents like junior teammates (tools, guardrails, leverage) 48:14 – Education and training: learning in an agent world 52:44 – What Tim is building next (open-source coding agent; private repo specialization) 54:44 – What Dan is building next (inference efficiency, cost, performance) 55:58 – Mega-kernels + Together Atlas (speculative decoding + adaptive speedups) 58:19 – Predictions for 2026: small models, open-source, hardware, modalities 1:02:02 – Beyond transformers: state-space and architecture diversity 1:03:34 – Wrap

English
4
4
35
6.6K