Dan Fu

886 posts

Dan Fu

Dan Fu

@realDanFu

VP, Kernels @togethercompute Assistant Professor @ucsd_cse Looking for talented kernel engineers and performance engineers!

Katılım Eylül 2019
241 Takip Edilen7.7K Takipçiler
Sabitlenmiş Tweet
Dan Fu
Dan Fu@realDanFu·
Excited to share that I will be joining UCSD CSE as an assistant professor in January 2026! I'll be recruiting PhD students from the 2024 application pool - if you're interested in anything ML Sys/efficiency/etc please reach out & put my name on your application! Until then I'll be finishing up some requirements at Stanford (long story...) and hanging out at @togethercompute. Stay tuned for more!
English
47
40
578
115.2K
Dan Fu retweetledi
Hamza Elshafie
Hamza Elshafie@hamzaelshafie·
New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels" This post is my attempt to dissect ThunderKittens from the bottom up. I approached TK by asking what each abstraction is really buying us: which hardware detail it corresponds to, how it maps onto the underlying layouts the GPU actually wants, what boilerplate it removes, and which parts of the GPU programming model still remain visible to us as kernel authors. The post walks through the tile abstractions TK provides: register, shared, and tensor memory tiles, global layouts, vector abstractions, warp/warpgroup compute, TMA, swizzling, Hopper WGMMA, Blackwell tcgen05, 2xSM MMA, tensor memory, Cluster Launch Control, TK’s pipeline templates, and static persistent tile scheduling. At the end, I demonstrate TK’s lcf pipeline template by implementing a non-causal attention prefill kernel and benchmarking it against FlashAttention-2 and FlashAttention-3 on an H100 PCIe across different sequence lengths. The kernel beats FA2 across the sweep by ~1.55x on average, and closely tracks FA3, where FA3 is only ~1.05x-1.17x faster on the longer sequence lengths. Blog link: hamzaelshafie.bearblog.dev/dissecting-thu… Repo: github.com/HamzaElshafie/… I also put an extensive list of resources at the end, which I found very useful for interested readers. Please note: this is my own independent writeup. I’m not affiliated with @HazyResearch, and any mistakes in the post are mine. If you spot any please reach out! 1 / xx
Hamza Elshafie tweet mediaHamza Elshafie tweet mediaHamza Elshafie tweet mediaHamza Elshafie tweet media
English
3
42
354
37.1K
Dan Fu
Dan Fu@realDanFu·
✈️ Flying out to Bellevue for #MLSys2026! My students and collaborators are presenting two papers, and I'll be around through Wednesday afternoon. Come find me if you want to chat Parcae, looped models, kernels, kittens (Thunder-, Hip-, and more), OSS models, or anything else!
Dan Fu tweet mediaDan Fu tweet media
English
0
3
42
3K
Dan Fu
Dan Fu@realDanFu·
🎼2⃣5⃣
Together AI@togethercompute

Congrats to the @cursor_ai team on Composer 2.5 — a huge milestone for agentic coding models. Together AI, the AI Native Cloud, is proud to partner on this launch. Composer 2.5 is pushing the frontier for coding agents and turning heads for its speed and quality. Excited to keep building with the Cursor team!

ART
1
0
10
927
Dan Fu retweetledi
Together AI
Together AI@togethercompute·
Congrats to the @cursor_ai team on Composer 2.5 — a huge milestone for agentic coding models. Together AI, the AI Native Cloud, is proud to partner on this launch. Composer 2.5 is pushing the frontier for coding agents and turning heads for its speed and quality. Excited to keep building with the Cursor team!
Cursor@cursor_ai

Introducing Composer 2.5, our most powerful model yet. It's more intelligent, better at sustained work on long-running tasks, and more reliable at following complex instructions. For the next week, we’re doubling the included usage of the model.

English
3
10
116
11.2K
Dan Fu
Dan Fu@realDanFu·
This is pretty cool - LLM inference that generates @prlnet coins during the forward pass, so you can subsidize inference cost. Excited to see how this changes inference tokenomics!
Omri Weinstein@WeinsteinOmri

A milestone for Pearl Research Labs: our first major enterprise partnership is live with Together AI. @togethercompute’s inference platform is an ideal demonstration of @prlnet's value proposition — One of the world’s most advanced hyperscalers running AI workloads on Pearl’s 2-for-1 Cuda kernels, turning inference into ¶PRL coins, and reducing consumer LLM price per token. Excited for what we’ll build together.

English
1
0
7
820
Dan Fu retweetledi
Omri Weinstein
Omri Weinstein@WeinsteinOmri·
A milestone for Pearl Research Labs: our first major enterprise partnership is live with Together AI. @togethercompute’s inference platform is an ideal demonstration of @prlnet's value proposition — One of the world’s most advanced hyperscalers running AI workloads on Pearl’s 2-for-1 Cuda kernels, turning inference into ¶PRL coins, and reducing consumer LLM price per token. Excited for what we’ll build together.
Together AI@togethercompute

Introducing Gemma-4-31B-it-Pearl on Together AI, Pearl Research Labs’ instruction-tuned checkpoint of Gemma 4 31B powered by @prlnet Proof of Useful Work protocol. AI natives can now use this Pearl model as a serverless inference endpoint on Together AI, at a 25%+ discounted pricing.

English
7
10
84
13.8K
Dan Fu retweetledi
Together AI
Together AI@togethercompute·
Introducing Gemma-4-31B-it-Pearl on Together AI, Pearl Research Labs’ instruction-tuned checkpoint of Gemma 4 31B powered by @prlnet Proof of Useful Work protocol. AI natives can now use this Pearl model as a serverless inference endpoint on Together AI, at a 25%+ discounted pricing.
Together AI tweet media
English
11
17
96
96.1K
Rose Yu
Rose Yu@yuqirose·
A bit of a delayed career update: I have been promoted to full professor! It has been a tremendous journey since I started my faculty position in 2018. I want to thank everyone who has helped me along this process. I’m especially thankful to my students, mentors and collaborators for pushing the frontiers of AI+Science research together. I’m deeply grateful for the unconditional support from my family and friends. Now to the next chapter! 🍻
Rose Yu tweet media
English
45
8
546
27.1K
Hao Zhang
Hao Zhang@haozhangml·
Such an honor to share that our 2016 paper GeePS just received the EuroSys Test-of-Time Award 🫡🚀🏆 It was actually my first system paper (and obviously my first MLSys paper, too) in phd -- and arguably the first paper to systematically tackle GPU memory swapping for deep learning, right after AlexNet moved DL onto GPUs. It has been 10 years! The ideas are everywhere. A short thread on what we did and where it went 🧵
Hao Zhang tweet mediaHao Zhang tweet media
English
3
8
81
8.7K
Dan Fu retweetledi
Together AI
Together AI@togethercompute·
Introducing DeepSeek V4 Pro, a long-context model with hybrid attention, three reasoning modes, and SOTA coding performance. AI natives can now use DeepSeek V4 Pro on Together AI and benefit from reliable inference for long-horizon coding and agentic workflows.
Together AI tweet media
English
16
5
126
1M
Together AI
Together AI@togethercompute·
What if you could get 1.3B Transformer quality from a 770M model? That's not a compression result. It's a different architecture. Parcae, from @realDanFu (Together AI's VP of Kernels) and his lab at UCSD, passes activations through the same layers multiple times — stably, for the first time.
Together AI tweet media
English
7
27
187
21.2K
Dan Fu retweetledi
Dan Fu retweetledi
Marktechpost AI
Marktechpost AI@Marktechpost·
UCSD and Together AI Research Introduces Parcae: A Stable Architecture for Looped Language Models That Achieves the Quality of a Transformer Twice the Size The main idea is to recast the looped forward pass as a nonlinear time-variant dynamical system over the residual stream. By analyzing the linearized form of this system, the research team shows that prior injection methods — addition and concatenation-with-projection — produce marginally stable or unconstrained parameterizations of the state transition matrix Ā. Parcae fixes this by constraining Ā via discretization of a negative diagonal parameterization, guaranteeing ρ(Ā) < 1 at all times. Two additional training fixes accompany the architectural change: a normalization layer on the prelude output to prevent late-stage loss spikes, and a per-sequence depth sampling algorithm that corrects a distributional mismatch bug in prior recurrence sampling methods. On results: → Parcae reduces validation perplexity by up to 6.3% over parameter- and data-matched RDMs at 350M scale → A 770M Parcae model matches the Core benchmark quality of a 1.3B standard Transformer → At 1.3B parameters, Parcae outperforms the parameter-matched Transformer by 2.99 points on Core and 1.18 points on Core-Extended On scaling laws: → Compute-optimal training scales mean recurrence µ_rec and tokens D in tandem following power laws (µ_rec ∝ C^0.40, D ∝ C^0.78) → Test-time looping follows a saturating exponential decay — gains plateau near the training recurrence depth µ_rec, setting a hard ceiling on inference-time scaling → A unified law predicts held-out model loss within 0.85–1.31% average error Full analysis: marktechpost.com/2026/04/16/ucs… Paper: arxiv.org/pdf/2604.12946 Technical details: together.ai/blog/parcae Models: huggingface.co/collections/Sa… @togethercompute @UCSD @hayden_prairie @zacknovack @BergKirkpatrick @realDanFu
Marktechpost AI tweet media
English
4
16
85
378.9K
Dan Fu retweetledi
Mitko Vasilev
Mitko Vasilev@iotcoi·
In 2026 we can train a looped LLM on one GPU at home They were AI's cold fusion- always 5 years away, theoretically perfect, exploding in training Parcae fixes it with physics (SSM dynamics), matches 2× bigger Transformers Unlocks 3rd scaling law: infinite depth, finite memory
Mitko Vasilev tweet media
English
6
5
140
7.7K
Beidi Chen
Beidi Chen@BeidiChen·
Check out the great work~
Dan Fu@realDanFu

📢 Super excited to announce Parcae! We've been thinking about scaling laws and the "right" way to get more FLOPs. Turns out layer looping - with the right parameterization - gives you a new axis to scale! Parcae matches Transformers 2x their size (w/ the same data), and outperforms prior formulations of looped models. But - you need the right parameterization to get these gains against strong Transformer baselines. Looped models are famously unstable to train, with tons of loss spikes and hyperparameter sensitivity. The main technical challenge with looped models is residual explosion - if you're passing the activations through the same layers over and over, some otherwise benign parameterizations cause huge instability. Our key idea: we can think of the residual stream of a model as a time-varying dynamical system - the same fundamentals behind SSMs like Mamba and S4. Then a few modest modifications to classic Transformers (stable diagonalization of injection params, LN before embeddings) can stabilize the looped models. The resulting models are more stable to train, but also reach higher quality. It's strong enough to start to derive new scaling laws. Classically - we know you need to scale parameters with data to be FLOP-optimal. With Parcae, we find a third axis - given fixed parameters, you additionally want to scale FLOPs by looping as you scale data. Super excited to see how these ideas hold, and what we can do with looped models! Check out @hayden_prairie's great explainer thread below, and see links for our paper, blog, and models. Joint w/ @zacknovack and @BergKirkpatrick, and a fun collab between @togethercompute and my lab at @ucsd_cse. Enjoy!

English
1
0
10
4.5K
Dan Fu
Dan Fu@realDanFu·
📢 Super excited to announce Parcae! We've been thinking about scaling laws and the "right" way to get more FLOPs. Turns out layer looping - with the right parameterization - gives you a new axis to scale! Parcae matches Transformers 2x their size (w/ the same data), and outperforms prior formulations of looped models. But - you need the right parameterization to get these gains against strong Transformer baselines. Looped models are famously unstable to train, with tons of loss spikes and hyperparameter sensitivity. The main technical challenge with looped models is residual explosion - if you're passing the activations through the same layers over and over, some otherwise benign parameterizations cause huge instability. Our key idea: we can think of the residual stream of a model as a time-varying dynamical system - the same fundamentals behind SSMs like Mamba and S4. Then a few modest modifications to classic Transformers (stable diagonalization of injection params, LN before embeddings) can stabilize the looped models. The resulting models are more stable to train, but also reach higher quality. It's strong enough to start to derive new scaling laws. Classically - we know you need to scale parameters with data to be FLOP-optimal. With Parcae, we find a third axis - given fixed parameters, you additionally want to scale FLOPs by looping as you scale data. Super excited to see how these ideas hold, and what we can do with looped models! Check out @hayden_prairie's great explainer thread below, and see links for our paper, blog, and models. Joint w/ @zacknovack and @BergKirkpatrick, and a fun collab between @togethercompute and my lab at @ucsd_cse. Enjoy!
Hayden Prairie@hayden_prairie

We’ve been thinking a lot about scaling laws, wondering if there is a more effective way to scale FLOPs without increasing parameters. Turns out the answer is YES – by looping blocks of layers during training. We find that predictable scaling laws exist for layer looping, allowing us to use looping to achieve the quality of a Transformer twice the size. Our scaling laws suggest that for a fixed parameter budget, data and looping should be increased in tandem! 🧵👇

English
2
26
128
21.6K