Dan Fu

14

130

9.9K

Dan Fu retweetledi

Hamza Elshafie@hamzaelshafie·4d

New in-depth blog post: "Dissecting ThunderKittens: Anatomy of a Compact DSL for High-Performance AI Kernels" This post is my attempt to dissect ThunderKittens from the bottom up. I approached TK by asking what each abstraction is really buying us: which hardware detail it corresponds to, how it maps onto the underlying layouts the GPU actually wants, what boilerplate it removes, and which parts of the GPU programming model still remain visible to us as kernel authors. The post walks through the tile abstractions TK provides: register, shared, and tensor memory tiles, global layouts, vector abstractions, warp/warpgroup compute, TMA, swizzling, Hopper WGMMA, Blackwell tcgen05, 2xSM MMA, tensor memory, Cluster Launch Control, TK’s pipeline templates, and static persistent tile scheduling. At the end, I demonstrate TK’s lcf pipeline template by implementing a non-causal attention prefill kernel and benchmarking it against FlashAttention-2 and FlashAttention-3 on an H100 PCIe across different sequence lengths. The kernel beats FA2 across the sweep by ~1.55x on average, and closely tracks FA3, where FA3 is only ~1.05x-1.17x faster on the longer sequence lengths. Blog link: hamzaelshafie.bearblog.dev/dissecting-thu… Repo: github.com/HamzaElshafie/… I also put an extensive list of resources at the end, which I found very useful for interested readers. Please note: this is my own independent writeup. I’m not affiliated with @HazyResearch, and any mistakes in the post are mine. If you spot any please reach out! 1 / xx

English

3

42

354

37.1K

Dan Fu@realDanFu·19 May

✈️ Flying out to Bellevue for #MLSys2026! My students and collaborators are presenting two papers, and I'll be around through Wednesday afternoon. Come find me if you want to chat Parcae, looped models, kernels, kittens (Thunder-, Hip-, and more), OSS models, or anything else!

English

Together AI@togethercompute

3

42

3K

Dan Fu@realDanFu·19 May

🎼2⃣5⃣

Congrats to the @cursor_ai team on Composer 2.5 — a huge milestone for agentic coding models. Together AI, the AI Native Cloud, is proud to partner on this launch. Composer 2.5 is pushing the frontier for coding agents and turning heads for its speed and quality. Excited to keep building with the Cursor team!

ART

Omri Weinstein@WeinsteinOmri

0

10

927

Dan Fu retweetledi

Together AI@togethercompute·19 May

Congrats to the @cursor_ai team on Composer 2.5 — a huge milestone for agentic coding models. Together AI, the AI Native Cloud, is proud to partner on this launch. Composer 2.5 is pushing the frontier for coding agents and turning heads for its speed and quality. Excited to keep building with the Cursor team!

Cursor@cursor_ai

Introducing Composer 2.5, our most powerful model yet. It's more intelligent, better at sustained work on long-running tasks, and more reliable at following complex instructions. For the next week, we’re doubling the included usage of the model.

English

3

10

116

11.2K

Dan Fu@realDanFu·15 May

This is pretty cool - LLM inference that generates @prlnet coins during the forward pass, so you can subsidize inference cost. Excited to see how this changes inference tokenomics!

A milestone for Pearl Research Labs: our first major enterprise partnership is live with Together AI. @togethercompute’s inference platform is an ideal demonstration of @prlnet's value proposition — One of the world’s most advanced hyperscalers running AI workloads on Pearl’s 2-for-1 Cuda kernels, turning inference into ¶PRL coins, and reducing consumer LLM price per token. Excited for what we’ll build together.

English

Together AI@togethercompute

0

7

820

Dan Fu retweetledi

Omri Weinstein@WeinsteinOmri·15 May

A milestone for Pearl Research Labs: our first major enterprise partnership is live with Together AI. @togethercompute’s inference platform is an ideal demonstration of @prlnet's value proposition — One of the world’s most advanced hyperscalers running AI workloads on Pearl’s 2-for-1 Cuda kernels, turning inference into ¶PRL coins, and reducing consumer LLM price per token. Excited for what we’ll build together.

Introducing Gemma-4-31B-it-Pearl on Together AI, Pearl Research Labs’ instruction-tuned checkpoint of Gemma 4 31B powered by @prlnet Proof of Useful Work protocol. AI natives can now use this Pearl model as a serverless inference endpoint on Together AI, at a 25%+ discounted pricing.

English

10

84

13.8K

Dan Fu retweetledi

Together AI@togethercompute·15 May

Introducing Gemma-4-31B-it-Pearl on Together AI, Pearl Research Labs’ instruction-tuned checkpoint of Gemma 4 31B powered by @prlnet Proof of Useful Work protocol. AI natives can now use this Pearl model as a serverless inference endpoint on Together AI, at a 25%+ discounted pricing.

English

11

17

96

96.1K

Dan Fu@realDanFu·15 May

@yuqirose Congrats!!

English

1

215

Rose Yu@yuqirose·15 May

A bit of a delayed career update: I have been promoted to full professor! It has been a tremendous journey since I started my faculty position in 2018. I want to thank everyone who has helped me along this process. I’m especially thankful to my students, mentors and collaborators for pushing the frontiers of AI+Science research together. I’m deeply grateful for the unconditional support from my family and friends. Now to the next chapter! 🍻

English

45

8

546

27.1K

Dan Fu@realDanFu·1 May

@haozhangml Congrats @haozhangml!! Well-deserved!

English

1

161

Hao Zhang@haozhangml·1 May

Such an honor to share that our 2016 paper GeePS just received the EuroSys Test-of-Time Award 🫡🚀🏆 It was actually my first system paper (and obviously my first MLSys paper, too) in phd -- and arguably the first paper to systematically tackle GPU memory swapping for deep learning, right after AlexNet moved DL onto GPUs. It has been 10 years! The ideas are everywhere. A short thread on what we did and where it went 🧵

English

3

8

81

8.7K

Dan Fu retweetledi

Together AI@togethercompute·30 Nis

Join us Tue 5/5: #DeepSeek-V4's hybrid attention + sparse MoE reduces KV cache up to 90%, enabling 1M-token context. We'll cover why that makes it great for agentic workflows, what it took to serve at scale, and how to build with it. Hear from @realDanFu @JueWANG26088228 @ZainHasan6 and @zhyncs42 → togetherai.link/ds-v4-x

English

Hayden Prairie@hayden_prairie

23

9.5K

Dan Fu@realDanFu·27 Nis

If you're at #ICLR2026 and interested in Parcae - I'm giving a keynote (via Zoom) at the Latent and Implicit Thinking Workshop at 1:30 local time today! @hayden_prairie will be at the workshop all day and presenting Parcae at the poster sessions - stop by!

We’ve been thinking a lot about scaling laws, wondering if there is a more effective way to scale FLOPs without increasing parameters. Turns out the answer is YES – by looping blocks of layers during training. We find that predictable scaling laws exist for layer looping, allowing us to use looping to achieve the quality of a Transformer twice the size. Our scaling laws suggest that for a fixed parameter budget, data and looping should be increased in tandem! 🧵👇

English

Together AI@togethercompute

4

24

3.2K

Dan Fu@realDanFu·24 Nis

4⃣4⃣4⃣4⃣

Introducing DeepSeek V4 Pro, a long-context model with hybrid attention, three reasoning modes, and SOTA coding performance. AI natives can now use DeepSeek V4 Pro on Together AI and benefit from reliable inference for long-horizon coding and agentic workflows.

ART

1

9

2.4K

Dan Fu retweetledi

Together AI@togethercompute·24 Nis

Introducing DeepSeek V4 Pro, a long-context model with hybrid attention, three reasoning modes, and SOTA coding performance. AI natives can now use DeepSeek V4 Pro on Together AI and benefit from reliable inference for long-horizon coding and agentic workflows.

English

16

5

126

1M

Dan Fu@realDanFu·18 Nis

@winglian @togethercompute The end-to-end Core numbers, not PPL

English

55

Wing Lian (caseus)@winglian·18 Nis

@togethercompute @realDanFu But looking at the reported metrics, the looped 770M model isn’t really close to the 1.3B model.

English

0

125

Together AI@togethercompute·15 Nis

What if you could get 1.3B Transformer quality from a 770M model? That's not a compression result. It's a different architecture. Parcae, from @realDanFu (Together AI's VP of Kernels) and his lab at UCSD, passes activations through the same layers multiple times — stably, for the first time.

English

Hayden Prairie@hayden_prairie

27

187

21.2K

Dan Fu retweetledi

Albert Gu@_albertgu·17 Nis

a dynamical systems point of view, which looks like an SSM applied along the residual stream, informs more principled ways to scale looped architectures

We’ve been thinking a lot about scaling laws, wondering if there is a more effective way to scale FLOPs without increasing parameters. Turns out the answer is YES – by looping blocks of layers during training. We find that predictable scaling laws exist for layer looping, allowing us to use looping to achieve the quality of a Transformer twice the size. Our scaling laws suggest that for a fixed parameter budget, data and looping should be increased in tandem! 🧵👇

English

30

222

25.8K

Dan Fu retweetledi

Marktechpost AI@Marktechpost·16 Nis

UCSD and Together AI Research Introduces Parcae: A Stable Architecture for Looped Language Models That Achieves the Quality of a Transformer Twice the Size The main idea is to recast the looped forward pass as a nonlinear time-variant dynamical system over the residual stream. By analyzing the linearized form of this system, the research team shows that prior injection methods — addition and concatenation-with-projection — produce marginally stable or unconstrained parameterizations of the state transition matrix Ā. Parcae fixes this by constraining Ā via discretization of a negative diagonal parameterization, guaranteeing ρ(Ā) < 1 at all times. Two additional training fixes accompany the architectural change: a normalization layer on the prelude output to prevent late-stage loss spikes, and a per-sequence depth sampling algorithm that corrects a distributional mismatch bug in prior recurrence sampling methods. On results: → Parcae reduces validation perplexity by up to 6.3% over parameter- and data-matched RDMs at 350M scale → A 770M Parcae model matches the Core benchmark quality of a 1.3B standard Transformer → At 1.3B parameters, Parcae outperforms the parameter-matched Transformer by 2.99 points on Core and 1.18 points on Core-Extended On scaling laws: → Compute-optimal training scales mean recurrence µ_rec and tokens D in tandem following power laws (µ_rec ∝ C^0.40, D ∝ C^0.78) → Test-time looping follows a saturating exponential decay — gains plateau near the training recurrence depth µ_rec, setting a hard ceiling on inference-time scaling → A unified law predicts held-out model loss within 0.85–1.31% average error Full analysis: marktechpost.com/2026/04/16/ucs… Paper: arxiv.org/pdf/2604.12946 Technical details: together.ai/blog/parcae Models: huggingface.co/collections/Sa… @togethercompute @UCSD @hayden_prairie @zacknovack @BergKirkpatrick @realDanFu

English

4

16

85

378.9K

Dan Fu retweetledi

Mitko Vasilev@iotcoi·15 Nis

In 2026 we can train a looped LLM on one GPU at home They were AI's cold fusion- always 5 years away, theoretically perfect, exploding in training Parcae fixes it with physics (SSM dynamics), matches 2× bigger Transformers Unlocks 3rd scaling law: infinite depth, finite memory

English

6

5

140

7.7K

Dan Fu@realDanFu·15 Nis

@BeidiChen Thanks Beidi!

English

📢 Super excited to announce Parcae! We've been thinking about scaling laws and the "right" way to get more FLOPs. Turns out layer looping - with the right parameterization - gives you a new axis to scale! Parcae matches Transformers 2x their size (w/ the same data), and outperforms prior formulations of looped models. But - you need the right parameterization to get these gains against strong Transformer baselines. Looped models are famously unstable to train, with tons of loss spikes and hyperparameter sensitivity. The main technical challenge with looped models is residual explosion - if you're passing the activations through the same layers over and over, some otherwise benign parameterizations cause huge instability. Our key idea: we can think of the residual stream of a model as a time-varying dynamical system - the same fundamentals behind SSMs like Mamba and S4. Then a few modest modifications to classic Transformers (stable diagonalization of injection params, LN before embeddings) can stabilize the looped models. The resulting models are more stable to train, but also reach higher quality. It's strong enough to start to derive new scaling laws. Classically - we know you need to scale parameters with data to be FLOP-optimal. With Parcae, we find a third axis - given fixed parameters, you additionally want to scale FLOPs by looping as you scale data. Super excited to see how these ideas hold, and what we can do with looped models! Check out @hayden_prairie's great explainer thread below, and see links for our paper, blog, and models. Joint w/ @zacknovack and @BergKirkpatrick, and a fun collab between @togethercompute and my lab at @ucsd_cse. Enjoy!

229

Beidi Chen@BeidiChen·15 Nis

Check out the great work~

Dan Fu@realDanFu

English