Reuben Stern

30 posts

Reuben Stern

Reuben Stern

@ReubenConducts

Research Scientist at Colfax International. Conductor, bassoonist, mathematician, wannabe powerlifter. They/them

Boston, MA Katılım Mayıs 2019
74 Takip Edilen85 Takipçiler
Reuben Stern
Reuben Stern@ReubenConducts·
We have a new blog about cluster launch control (CLC) on NVIDIA Blackwell GPUs! CLC is a powerful tool for dynamically scheduling work across the GPU, both within- and accross-kernels. research.colfax-intl.com/dynamic-persis…
English
1
7
52
2.9K
Reuben Stern
Reuben Stern@ReubenConducts·
My colleagues Jack Carlisle and Jay Shah gave a fantastic lecture for @GPU_MODE yesterday on our categorical foundations for CuTe layout algebra! They were joined by Cris Cecka, the inventor of CuTe, and @marksaroufim as moderators. Bravi tutti! youtu.be/MVh_guNbWMA?si…
YouTube video
YouTube
English
2
9
58
5.3K
Reuben Stern
Reuben Stern@ReubenConducts·
@charles_irl pmpp and the rising sea are the only two books anyone could ever need!
English
0
0
1
103
Charles 🎉 Frye
Charles 🎉 Frye@charles_irl·
I am the very model of a modern engineer-general, I have citations technical, go-to-market, and spiritual, I know the emperors of Rome and algebras categorical, I'm well-acquainted too with databases relational, I've written CUDA kernels memory and compute-maxxical...
Charles 🎉 Frye tweet media
English
25
17
515
30.5K
driss guessous
driss guessous@drisspg·
Same question but O(new operators needed) Obviously we all want flex linear attention, but what else? :SOL quant kernels - do we need to add some common fusions for them in eager? What about jitted Gemm+epilogue without full compile machinery?
Edward Z. Yang@ezyang

Give me your O(num operators) PyTorch improvement ideas that you are interested in. Historical examples: making every kernel deterministic / support zero size. Not done: every kernel in your favorite DSL / batch invariant / masked / padded / device side size

English
1
2
22
4.1K
Reuben Stern
Reuben Stern@ReubenConducts·
@drisspg the answer of course is "more than you think; not as much as you want"
English
1
0
0
62
driss guessous
driss guessous@drisspg·
“You aren’t a researcher” ohh yeah? I’m researching the limits of how much butter and bread the human body can consume. we are not the same.
English
3
0
48
5.1K
Reuben Stern retweetledi
Ant Ling
Ant Ling@AntLingAGI·
🚀 Linear Attention is unlocking million-token context windows by dropping computational complexity from O(N^2) to O(N), but software is increasingly bottlenecking the hardware. Meet cuLA (CUDA Linear Attention): hand-written kernels using CuTe DSL & CUTLASS C++ to extract maximum performance on NVIDIA GPUs. A drop-in replacement for FLA designed to push hardware to its absolute limits.
English
6
49
388
91.1K
Reuben Stern retweetledi
Cursor
Cursor@cursor_ai·
Thank you to the companies and open-source communities behind Kimi K2.5, Ray, ThunderKittens, PyTorch, and more. We'd also like to thank Fireworks and Colfax for their collaboration and partnership.
English
9
8
298
73.5K
Reuben Stern retweetledi
PyTorch
PyTorch@PyTorch·
PyTorch 2.11 is now available, featuring 2,723 commits from 432 contributors since PyTorch 2.10. This release prioritizes performance scaling for distributed training and next-generation hardware architectures. Highlights include a FlashAttention-4 backend for FlexAttention on Hopper and Blackwell GPUs, Differentiable Collectives for distributed training, and performance optimizations for Intel GPUs via XPU Graph. This release also delivers comprehensive operator expansion for Apple Silicon (MPS) and RNN/LSTM GPU export support. 🖇️ Read the PyTorch 2.11 release blog and release notes: pytorch.org/blog/pytorch-2… #PyTorch #OpenSource #AIInfrastructure
PyTorch tweet media
English
13
85
622
58.7K
Reuben Stern retweetledi
Reuben Stern
Reuben Stern@ReubenConducts·
@drisspg @gaunernst it's pretty close, but FA-4 is missing in particular some of the inference optimizations that FA-3 has, such as cuda graphability via dynamic scheduling metadata. coming soon, though!
English
0
0
0
61
driss guessous
driss guessous@drisspg·
@gaunernst FWIW its already there but that being said I would argue that they are at different levels of code maturity / feature coverage
English
1
0
4
243
driss guessous
driss guessous@drisspg·
While all the hype has been around FA4 there is still alot of users of FA3 and hopefully this will help to make that easier :) And similarly you can use this with SDPA: #L21" target="_blank" rel="nofollow noopener">github.com/pytorch/pytorc…
angel@liangel02

In collaboration with Tri Dao and xformers, PyTorch has uploaded official Flash Attention 3 wheels to download.pytorch.org/whl/flash-attn…. These wheels support various CUDA versions (12.6+, 13), CPUs (x86, ARM) and OS (Linux, Windows), and are compatible w/ Python 3.10+, torch 2.9+.

English
1
1
34
3.6K
Reuben Stern retweetledi
Belinda Li
Belinda Li@belindazli·
New blog post on introspection for interpretability, and why I think training models to self-explain is a promising frontier for interpretability research:
Belinda Li tweet media
English
8
37
242
21.6K
Reuben Stern retweetledi
Hieu Pham
Hieu Pham@hyhieu226·
Friends at @colfaxintl released an excellent work on the mathematical foundation of CuTe layouts. CuTe layouts are central to the modern programming models on NVIDIA GPUs. You can (almost) ditch C++, but you cannot ditch CuTe. In fact, you can (almost) ditch C++ because of a thing called Python CuTe DSL. And to use that, you must know CuTe layouts. Despite their role, CuTe layouts are highly unintuitive. I think you can only make sense of CuTe layouts with some mathematical guarantees about their peculiar behaviors. Colfax friends provided that 👇 It's amazing that these best things are free.
English
2
15
169
15.6K