Will Bui

2

164

Flapping Airplanes@flappyairplanes·2d

(4/5) One thing we’ve built is a “kittens” virtual machine that takes over the whole GPU and allows new kinds of co-optimization. We can go past the traditional sequential kernel model – for example, fusing entire training runs into a single kernel and even weirder stuff.

English

27

55

664

230.8K

Flapping Airplanes@flappyairplanes·2d

(1/5) Great to be at @sequoia to give a sneak peek of one of our research directions! TL;DR one path to data-efficiency may be to “abuse GPUs like they’ve never been abused before”

English

13

66

950

143.3K

Will Bui@will_ea·2d

@m_sirovatka as a vllm contributor, i approve this post

English

2

83

Matej Sirovatka@m_sirovatka·3d

I have now officially became one of the vLLM contributors, after long months of hard work. It has been a long and hard journey, I would like to thank my family, friends and my company for supporting me along the way. nah jk my 3 line PR just got merged 🫡

English

16

2

315

10.8K

Will Bui@will_ea·3d

@retr0sushi_ Be around those with better taste than you

English

0

5

770

himanshu@retr0sushi_·4d

how does one develop a taste for research? explain to me like i am a guy entering college with unlimited energy and enthusiasm

English

37

1

175

17.7K

Will Bui@will_ea·4d

@anneouyang @Norapom04 cattention

English

4

378

Anne Ouyang@anneouyang·4d

TIL the Huawei Ascend linear algebra kernels library is called "CATLASS" 🐱

English

8

11

149

12.4K

Will Bui@will_ea·4d

@danofer @LLMenjoyer Great to know!!!

English

2

220

Dan Ofer (Was @ICML,@Worldcon )@danofer·4d

@will_ea @LLMenjoyer Dayum. (My implementation really got about ~6-8 sec/it, and hence my dropping it in favour of going to 1.6 it/sec). On a tiny mlm.

English

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

0

1

275

Will Bui@will_ea·5d

27x faster Attention Residuals!!! 🚀 We implemented Block AttnRes as a pip-installable package. !pip install flash-attn-res No annoying kernel nonsense. No compile/autograd plumbing. Call it like a regular PyTorch op. It just works. Methodology: 🔹 fused triton kernels 🔹 batched attention over residual blocks 🔹 online-softmax merge 🔹 flash attention-style split-KV reduction Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

Kimi.ai@Kimi_Moonshot

English

19

82

771

72.9K

Will Bui@will_ea·4d

@Birchlabs @LLMenjoyer okay tbh jvp support isn't planned at the moment but PRs accepted! :)

English

0

1

72

Birchlabs@Birchlabs·4d

@will_ea @LLMenjoyer I am just a guy I don’t have a budget

English

0

1

72

Will Bui@will_ea·4d

@Birchlabs @LLMenjoyer if you hire me to do so, yes.

English

0

2

60

Birchlabs@Birchlabs·4d

@will_ea @LLMenjoyer will you add JVP support?

English

0

1

79

Will Bui@will_ea·4d

@Birchlabs @LLMenjoyer The benchmarking script and output here verified gradient correctness github.com/catswe/flash-a…. I also did some private internal tests verifying output correctness. More thorough tests will be added in the future.

English

0

3

122

Birchlabs@Birchlabs·4d

@will_ea @LLMenjoyer the kind of test I'm asking for is something like what I did in this gist.github.com/Birch-san/3e0b… it also shows you how to add (and test) JVP support to your triton attention kernel, which is crucial for supporting MeanFlow models.

English

0

3

327

Will Bui@will_ea·4d

@slLuxia @LLMenjoyer Gotcha. I’m keeping that in mind for future development

English

2

62

Luxia 🔮@slLuxia·4d

@will_ea @LLMenjoyer yeah it's orchestration level more than the raw ops; but it is important because that's double the layer routing overhead in practice and where the chunk of the slowdown is reported from if i remember right in the original paper (and why they do blocks vs all layers)

English

0

1

41

Will Bui@will_ea·4d

@slLuxia @LLMenjoyer The package exposes lower-level phase 1/phase 2 ops that users can use to compose the routing however they want. I believe what you’re referring to is the experimental API/example, which is still rough around the edges and mainly meant for research/prototyping.

English

0

2

72

Luxia 🔮@slLuxia·4d

@will_ea @LLMenjoyer hmm; it seems like you drop some routing; in the original paper they do 2x per block for pre-MLP & pre-attention but your repo does each block as a single layer. can see in fig 2 and section 2 of the moonshot paper. is this intended?

English

0

1

72

Will Bui@will_ea·4d

@slLuxia @LLMenjoyer Yes it allows arbitrary blocks.

English

3

701

Luxia 🔮@slLuxia·4d

@will_ea @LLMenjoyer should be agnostic to block size/allow arbitrary blocks yeah? i find maintaining predicted interlayer circuitry gives the best perf for attnres

English

0

1

956

Will Bui@will_ea·4d

@ricklamers indeed. im personally very excited to contribute to the ai infra space

English

27x faster Attention Residuals!!! 🚀 We implemented Block AttnRes as a pip-installable package. !pip install flash-attn-res No annoying kernel nonsense. No compile/autograd plumbing. Call it like a regular PyTorch op. It just works. Methodology: 🔹 fused triton kernels 🔹 batched attention over residual blocks 🔹 online-softmax merge 🔹 flash attention-style split-KV reduction Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

6

61

Rick Lamers@ricklamers·4d

software stack matters, a lot

Will Bui@will_ea

English

27x faster Attention Residuals!!! 🚀 We implemented Block AttnRes as a pip-installable package. !pip install flash-attn-res No annoying kernel nonsense. No compile/autograd plumbing. Call it like a regular PyTorch op. It just works. Methodology: 🔹 fused triton kernels 🔹 batched attention over residual blocks 🔹 online-softmax merge 🔹 flash attention-style split-KV reduction Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

3

852

Will Bui@will_ea·5d

@Kimi_Moonshot I loved this paper so much I built the first public two-phase kernels for AttnRes ;) x.com/will_ea/status…

Will Bui@will_ea

English

2

45

Kimi.ai@Kimi_Moonshot·16 Mar

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

337

2.1K

13.6K

5M

Will Bui@will_ea·5d

@LLMenjoyer this is actually pretty fire now that i think more about it...

English

27x faster Attention Residuals!!! 🚀 We implemented Block AttnRes as a pip-installable package. !pip install flash-attn-res No annoying kernel nonsense. No compile/autograd plumbing. Call it like a regular PyTorch op. It just works. Methodology: 🔹 fused triton kernels 🔹 batched attention over residual blocks 🔹 online-softmax merge 🔹 flash attention-style split-KV reduction Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

4

80

llm_enjoyer@LLMenjoyer·5d

this mfer is dropping speedful attention residuals kernels today i tried to tell bro to use the good diagrams, but bro didn't listen...... sad!

Will Bui@will_ea

English

47

3.2K

Will Bui@will_ea·5d

Check it out here: github.com/catswe/flash-a…

English

2

25

2.1K

Will Bui@will_ea·6d

@MainzOnX @A_K_Nain i love writing kernels... unless i'm racing against a deadline

English

0

1

29

Adam Mainz@MainzOnX·6d

@A_K_Nain Yeah no one tells you that it’s actually not the fun part 😂 I’m writing backwards ops for 25% of my time last week and probably next week it’s no fun

English

3

0

16

871

Aakash Kumar Nain@A_K_Nain·6d

I hate writing kernels. We shouldn't be doing this at least frequently. I hope the next gen compilers are much better and smart.

English

5

0

42

3.6K

Will Bui@will_ea·1 May

@blelbach I spent a couple weeks just writing kernels

English

66

Bryce, the CUDA Colonel@blelbach·30 Nis

I have not worked hours this long since the start of my career. So much to do these days. So many possibilities that are now unlocked.

English

4

0

47

4.1K

Will Bui@will_ea·28 Nis

@henrylhtsang I mean you need a decent amount of prerequisites knowledge to start with any substantial kernel work. The actual kernels may not seem like much but to get to that point requires narrowing down a bunch of ways to do it optimally that having experience would help.

English

195

henry tsang@henrylhtsang·28 Nis

one impression of working a new grad: they can seem... slow? like slow to finish a task and turn around in general is that what my TL thought of me when I first started

English

4

0

18

2.5K

Will Bui@will_ea·28 Nis

@Alexintosh @ritakozlov How is this AI slop, Alex?

English