Will (@will_ea) - Twitter Profili | Zamantika Mersobahis Locabet

Sabitlenmiş Tweet

Will@will_ea·20h

27x faster Attention Residuals!!! 🚀 We implemented Block AttnRes as a pip-installable package. !pip install flash-attn-res No annoying kernel nonsense. No compile/autograd plumbing. Call it like a regular PyTorch op. It just works. Methodology: 🔹 fused triton kernels 🔹 batched attention over residual blocks 🔹 online-softmax merge 🔹 flash attention-style split-KV reduction Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

16

61

571

49.8K

Will@will_ea·8h

@Birchlabs @LLMenjoyer okay tbh jvp support isn't planned at the moment but PRs accepted! :)

English

1

0

30

Birchlabs@Birchlabs·9h

@will_ea @LLMenjoyer I am just a guy I don’t have a budget

English

1

0

41

Will@will_ea·20h

27x faster Attention Residuals!!! 🚀 We implemented Block AttnRes as a pip-installable package. !pip install flash-attn-res No annoying kernel nonsense. No compile/autograd plumbing. Call it like a regular PyTorch op. It just works. Methodology: 🔹 fused triton kernels 🔹 batched attention over residual blocks 🔹 online-softmax merge 🔹 flash attention-style split-KV reduction Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

Kimi.ai@Kimi_Moonshot

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

16

61

571

49.8K

Will@will_ea·9h

@Birchlabs @LLMenjoyer if you hire me to do so, yes.

English

1

0

1

27

Birchlabs@Birchlabs·9h

@will_ea @LLMenjoyer will you add JVP support?

English

1

0

47

Will@will_ea·9h

@Birchlabs @LLMenjoyer The benchmarking script and output here verified gradient correctness github.com/catswe/flash-a…. I also did some private internal tests verifying output correctness. More thorough tests will be added in the future.

English

2

0

56

Birchlabs@Birchlabs·9h

@will_ea @LLMenjoyer the kind of test I'm asking for is something like what I did in this gist.github.com/Birch-san/3e0b… it also shows you how to add (and test) JVP support to your triton attention kernel, which is crucial for supporting MeanFlow models.

English

1

0

2

161

Will@will_ea·10h

@slLuxia @LLMenjoyer Gotcha. I’m keeping that in mind for future development

English

0

2

16

Luxia 🔮@slLuxia·10h

@will_ea @LLMenjoyer yeah it's orchestration level more than the raw ops; but it is important because that's double the layer routing overhead in practice and where the chunk of the slowdown is reported from if i remember right in the original paper (and why they do blocks vs all layers)

English

1

0

1

20

Will@will_ea·10h

@slLuxia @LLMenjoyer The package exposes lower-level phase 1/phase 2 ops that users can use to compose the routing however they want. I believe what you’re referring to is the experimental API/example, which is still rough around the edges and mainly meant for research/prototyping.

English

1

0

2

24

Luxia 🔮@slLuxia·13h

@will_ea @LLMenjoyer hmm; it seems like you drop some routing; in the original paper they do 2x per block for pre-MLP & pre-attention but your repo does each block as a single layer. can see in fig 2 and section 2 of the moonshot paper. is this intended?

English

1

0

1

45

Will@will_ea·10h

@slLuxia @LLMenjoyer Yes it allows arbitrary blocks.

English

0

3

383

Luxia 🔮@slLuxia·13h

@will_ea @LLMenjoyer should be agnostic to block size/allow arbitrary blocks yeah? i find maintaining predicted interlayer circuitry gives the best perf for attnres

English

2

0

1

610

Will@will_ea·14h

@ricklamers indeed. im personally very excited to contribute to the ai infra space

English

0

4

45

Rick Lamers@ricklamers·15h

software stack matters, a lot

Will@will_ea

27x faster Attention Residuals!!! 🚀 We implemented Block AttnRes as a pip-installable package. !pip install flash-attn-res No annoying kernel nonsense. No compile/autograd plumbing. Call it like a regular PyTorch op. It just works. Methodology: 🔹 fused triton kernels 🔹 batched attention over residual blocks 🔹 online-softmax merge 🔹 flash attention-style split-KV reduction Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

English

1

3

684

Will@will_ea·19h

@Kimi_Moonshot I loved this paper so much I built the first public two-phase kernels for AttnRes ;) x.com/will_ea/status…

Will@will_ea

27x faster Attention Residuals!!! 🚀 We implemented Block AttnRes as a pip-installable package. !pip install flash-attn-res No annoying kernel nonsense. No compile/autograd plumbing. Call it like a regular PyTorch op. It just works. Methodology: 🔹 fused triton kernels 🔹 batched attention over residual blocks 🔹 online-softmax merge 🔹 flash attention-style split-KV reduction Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

English

0

1

28

Kimi.ai@Kimi_Moonshot·16 Mar

Introducing 𝑨𝒕𝒕𝒆𝒏𝒕𝒊𝒐𝒏 𝑹𝒆𝒔𝒊𝒅𝒖𝒂𝒍𝒔: Rethinking depth-wise aggregation. Residual connections have long relied on fixed, uniform accumulation. Inspired by the duality of time and depth, we introduce Attention Residuals, replacing standard depth-wise recurrence with learned, input-dependent attention over preceding layers. 🔹 Enables networks to selectively retrieve past representations, naturally mitigating dilution and hidden-state growth. 🔹 Introduces Block AttnRes, partitioning layers into compressed blocks to make cross-layer attention practical at scale. 🔹 Serves as an efficient drop-in replacement, demonstrating a 1.25x compute advantage with negligible (<2%) inference latency overhead. 🔹 Validated on the Kimi Linear architecture (48B total, 3B activated parameters), delivering consistent downstream performance gains. 🔗Full report: github.com/MoonshotAI/Att…

English

337

2.1K

13.6K

5M

Will@will_ea·19h

@LLMenjoyer this is actually pretty fire now that i think more about it...

English

0

3

66

llm_enjoyer@LLMenjoyer·20h

this mfer is dropping speedful attention residuals kernels today i tried to tell bro to use the good diagrams, but bro didn't listen...... sad!

Will@will_ea

27x faster Attention Residuals!!! 🚀 We implemented Block AttnRes as a pip-installable package. !pip install flash-attn-res No annoying kernel nonsense. No compile/autograd plumbing. Call it like a regular PyTorch op. It just works. Methodology: 🔹 fused triton kernels 🔹 batched attention over residual blocks 🔹 online-softmax merge 🔹 flash attention-style split-KV reduction Thanks @LLMenjoyer and @cartesia for the support and guidance✌️

English

1

2

43

2.9K

Will@will_ea·20h

Check it out here: github.com/catswe/flash-a…

English

0

1

15

1.4K

Will@will_ea·1d

@MainzOnX @A_K_Nain i love writing kernels... unless i'm racing against a deadline

English

2

0

1

25

Adam Mainz@MainzOnX·2d

@A_K_Nain Yeah no one tells you that it’s actually not the fun part 😂 I’m writing backwards ops for 25% of my time last week and probably next week it’s no fun

English

3

0

16

865

Aakash Kumar Nain@A_K_Nain·2d

I hate writing kernels. We shouldn't be doing this at least frequently. I hope the next gen compilers are much better and smart.

English

5

0

42

3.6K

Will@will_ea·3d

@blelbach I spent a couple weeks just writing kernels

English

0

62

Bryce, the CUDA Colonel@blelbach·4d

I have not worked hours this long since the start of my career. So much to do these days. So many possibilities that are now unlocked.

English

4

0

47

4.1K

Will@will_ea·5d

@henrylhtsang I mean you need a decent amount of prerequisites knowledge to start with any substantial kernel work. The actual kernels may not seem like much but to get to that point requires narrowing down a bunch of ways to do it optimally that having experience would help.

English

0

191

henry tsang@henrylhtsang·6d

one impression of working a new grad: they can seem... slow? like slow to finish a task and turn around in general is that what my TL thought of me when I first started

English

4

0

18

2.5K

Will@will_ea·5d

@Alexintosh @ritakozlov How is this AI slop, Alex?

English

0

2

86

alexintosh@Alexintosh·5d

@ritakozlov ai slop spotted

Nederlands

3

0

6

758

rita kozlov 🐀@ritakozlov·5d

i've picked up the pen so many times to write about being a woman in tech and every time i chicken out because there's this catch-22: to talk about being a woman in tech, you need to have credibility. and once you start talking about it as a woman, you lose said credibility so i'm going to mortgage some of my credibility to get this off my chest, as someone who has both had a pretty successful career in tech, and leads a team with a lot of women on it: every woman you work with has had the most insane shit happen to her — on an almost daily basis. shit that makes you look at the camera and go "how did i end up here". from wild remarks about appearance to stalking and trauma dumping, and just constant dismissal from so many directions (employees, customers...). shit that you never tell anyone because they wouldn't believe you... i recently learned that like 97% of my followers on here are men. so my challenge to you is just to sit with that for a moment. you don't need to do anything about it (other than try not to be that person). but you should be aware that that's what every woman you work with deals with

English

61

124

1.2K

75K

Will@will_ea·26 Nis

@valigo May as well use Claude/GPT/Gemini to do the resume screening for you. Insane world.

English

0

43

Valentin Ignatev@valigo·25 Nis

If you still don't hate recruiters enough, or if you are still not convinced that hr is the most incompetent, harmful, and entitled industry in tech, just watch this video. But TLDR is "I have 20 seconds to find 5 keywords in your CV, but nobody taught me how to use Ctrl+F"

English

57

98

2.4K

115K

Will@will_ea·26 Nis

GPT 5.5 is goated at kernels. Far better than Opus 4.7. The AI-generated kernels thesis is increasingly looking viable. Btw, I love what you guys are building @makora_ai

English

0

4

173

Will@will_ea·26 Nis

@LLMenjoyer i aspire to be a memer just like you

English

0

2

786

llm_enjoyer@LLMenjoyer·25 Nis

reading the deepseek v4 report be like