Han Guo

23

151

11.3K

Han Guo retweetledi

Ziming Mao@ziming_mao·17h

🚀 Excited to release mKernel: a set of fast multi-node, multi-GPU fused kernels. 💻 Code: github.com/uccl-project/m… 📝 Blog: uccl-project.github.io/posts/mkernel/ mKernel fuses compute + communication into one persistent GPU kernel, covering both intra/inter-node with GPU-initiated communication. Amazing team: @yangzhouy, Chon Lam Lao, Costin Raiciu, Scott Shenker, @istoica05

English

42

305

20.5K

Han Guo retweetledi

Yoav Gelberg@yoav_gelberg·21h

Excited about this new work As KV compaction becomes increasingly important, we ask whether it’s worth adapting the model itself to perform better under compaction Turns out, it can really matter

Yam Eitan@ytn_ym

1/ How much can you compress an LLM’s KV cache? tl;dr it depends on how you train your model. Many strong context compaction methods, such as Cartridges and attention matching, operate post-hoc: given a fixed model and a context, they try to compress the resulting KV cache. @yoav_gelberg and I ask the complementary question: can we train the model to produce KV representations that are easier to compress? In other words: keep the compression method fixed, and change the representations it sees.

English

3

24

134

18.2K

Han Guo retweetledi

Sasha Rush@srush_nlp·4d

Talk: Training Composer youtube.com/watch?v=uTgqYe… Overview of the methods that we use at Cursor to build our model.

YouTube

English

6

84

664

92.4K

Han Guo@HanGuo97·2d

@Rohit_Writes Ah got it!

English

30

Rohit Agarwal@Rohit_Writes·2d

@HanGuo97 Claude Code goal mode is a good start, or even just trying a simple genetic algorithm with AI refiner could work. Unfortunately none of the platforms are too mature yet..

English

0

1

55

Han Guo@HanGuo97·4d

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

English

15

100

675

189.4K

Han Guo@HanGuo97·3d

@Rohit_Writes Good question, any good places to try it out?

English

0

313

Rohit Agarwal@Rohit_Writes·3d

@HanGuo97 Now that LLMs can author kernels using Claude Code, you ever try throwing autoresearch on top of this?

English

0

386

Han Guo retweetledi

Shannon Shen@shannonzshen·4d

x.com/i/article/2057…

ZXX

17

71

9.2K

Han Guo@HanGuo97·3d

@Tim_Dettmers Thank you Tim!

English

2

562

Tim Dettmers@Tim_Dettmers·3d

@HanGuo97 Very nice work, Han!

English

0

2

1.1K

Han Guo retweetledi

Mingkai Deng@mdeng34·3d

This is a prototype using language-based world models. Stay tuned for our next steps on multimodal and physical world models. The concept of a configurator, which decides when and how deeply to engage a reasoning process, is not specific to planning, but extensible to learning and adaptation going forward. 📄 SR²AM: arxiv.org/abs/2605.22138 📄 SiRA: arxiv.org/abs/2507.23773 🌐 Project: sailing-lab.github.io/sr2am-self-reg… 💻 Code: github.com/sailing-lab/sr… 🤗 SR²AM-v0.1-8B: huggingface.co/sailing-lab/SR… 🤗 SR²AM-v1.0-30B: huggingface.co/sailing-lab/SR… Joint work with @jinyuhou0, @larasnevess, @varad0309, @tw_killian, @waterluffy, @ericxing

English

Ryan Bahlous-Boldi@RyanBoldi

10

59

4.2K

Han Guo@HanGuo97·3d

Check out their new cool work!

Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

English

2

23

2.9K

Peter Henderson@PeterHndrsn·3d

Very cool work from an all-star team!

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

English

2

14

4.4K

Han Guo@HanGuo97·3d

@PeterHndrsn Thank you Peter!

English

152

Han Guo retweetledi

Mingkai Deng@mdeng34·3d

Frontier LLMs are converging on efficient, adaptive reasoning. Opus 4.7 lets the model decide how deeply to reason. GPT-5.5 achieves strong results with fewer reasoning tokens. We study a related but more structural question: what 𝗸𝗶𝗻𝗱 𝗼𝗳 𝗿𝗲𝗮𝘀𝗼𝗻𝗶𝗻𝗴 should we adapt? Last year in SiRA (upper figure), we showed that simulative reasoning (System II), which uses a 𝘄𝗼𝗿𝗹𝗱 𝗺𝗼𝗱𝗲𝗹 to evaluate consequences of actions, yields up to 124% improvement over reactive baselines (System I), and that strong reasoning models (o1, o3-mini) fail as planners without this structure. In our new paper SR²AM (lower figure), we add a learned 𝗰𝗼𝗻𝗳𝗶𝗴𝘂𝗿𝗮𝘁𝗼𝗿 (System III) that self-regulates when to simulate, how far ahead, and when to skip planning entirely. Efficient reasoning is not just shorter reasoning: it is better allocation of simulation.

English

3

45

274

59.9K

Han Guo retweetledi

Ryan Bahlous-Boldi@RyanBoldi·3d

Your RL post-training may be sabotaging your LLM’s test-time scaling! Conventional RL pretends that you can collapse all reward signals *upfront* into a single *scalar reward*. We introduce Vector Policy Optimization (VPO), which natively maximizes *vector-valued* rewards, boosting test time search performance, even on the original scalar.

English

34

119

844

201.1K

Han Guo@HanGuo97·3d

@silverhawk_ny Another good question. The key idea is to split the reduction into tile-level reductions followed by a separate reduction pass. FWIW, you could do the whole dimension reduction but that might involve atomics.

English

0

9

arch rock@silverhawk_ny·3d

@HanGuo97 Another question for GEMM epilog with reduction kernel, usually we need to hold the GEMM results ion resigner for whole feature dim reduction, which could cause register pressure, do you have any novel algorithms to solve this?

English

0

8

Han Guo retweetledi

Jyo Pari@jyo_pari·3d

The computational abstractions humans developed are great for building architectures, however they’re not necessarily the right abstractions for kernels. Han shows why 🔥

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

English

3

21

3.1K

Han Guo@HanGuo97·3d

@silverhawk_ny Good question! The epilogue could use TMA to load/store auxiliary data.

English

12

arch rock@silverhawk_ny·3d

@HanGuo97 Is there a good way we can use TMA to pipeline warp specilization between GEMM and epilog ops?

English

0

21

Han Guo@HanGuo97·4d

@DaviJin Let’s make Blackwell GPUs go brrrrr

English

0

4

214

David Jin@DaviJin·4d

Turns out that we can write the whole transformer using just GEMM and epilogue enabling much simpler yet effective abstractions for GPU kernels! Fantastic work🚀🚀

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

English

5

704

Han Guo@HanGuo97·4d

@behrouz_ali Thank you Ali!

English

0

2

213

Ali Behrouz@behrouz_ali·4d

Truly amazing work!

LLM training is built on fast MatMuls. But many surrounding ops still run as memory-bound kernels. CODA reparameterizes them to hide in the matmul’s shadow, fused into its epilogue before results leave the chip. Bonus: LLMs can write fast CODA kernels too (approaching SoLs).

English