Reuben Stern

30 posts

Reuben Stern

@ReubenConducts

Research Scientist at Colfax International. Conductor, bassoonist, mathematician, wannabe powerlifter. They/them

Boston, MA Katılım Mayıs 2019

74 Takip Edilen85 Takipçiler

Reuben Stern@ReubenConducts·9h

We have enjoyed collaborating with @thinkymachines on some of the attention backend that supports this impressive work. Congrats to everyone involved!

Thinking Machines@thinkymachines

People talk, listen, watch, think, and collaborate at the same time, in real time. We've designed an AI that works with people the same way. We share our approach, early results, and a quick look at our model in action. thinkingmachines.ai/blog/interacti…

English

Reuben Stern@ReubenConducts·1d

@Simon_Vt Thanks, Simon!

English

184

Simon V@Simon_Vt·1d

good work from friends at colfax as always research.colfax-intl.com/dynamic-persis…

English

2.5K

Reuben Stern@ReubenConducts·1d

We have a new blog about cluster launch control (CLC) on NVIDIA Blackwell GPUs! CLC is a powerful tool for dynamically scheduling work across the GPU, both within- and accross-kernels. research.colfax-intl.com/dynamic-persis…

English

2.9K

Reuben Stern@ReubenConducts·4 May

All my homies love CLC

driss guessous@drisspg

I alluded to this a few tweets ago but just pushed up a shortish blog on a subtle feature of CLC work stealing that makes cuda-graphable grouped_gemm possible with this scheduling mode: drisspg.github.io/nuggets/A-Tale…

English

110

Reuben Stern@ReubenConducts·2 May

@Simon_Vt @GPU_MODE @marksaroufim Jack has a special gift for exposition!

English

107

Simon V@Simon_Vt·2 May

@ReubenConducts @GPU_MODE @marksaroufim Best talk on CuTe Layout algebra imo :)

English

1.2K

Reuben Stern@ReubenConducts·19 Nis

My colleagues Jack Carlisle and Jay Shah gave a fantastic lecture for @GPU_MODE yesterday on our categorical foundations for CuTe layout algebra! They were joined by Cris Cecka, the inventor of CuTe, and @marksaroufim as moderators. Bravi tutti! youtu.be/MVh_guNbWMA?si…

YouTube

English

5.3K

Reuben Stern@ReubenConducts·2 May

@charles_irl pmpp and the rising sea are the only two books anyone could ever need!

English

103

Charles 🎉 Frye@charles_irl·2 May

I am the very model of a modern engineer-general, I have citations technical, go-to-market, and spiritual, I know the emperors of Rome and algebras categorical, I'm well-acquainted too with databases relational, I've written CUDA kernels memory and compute-maxxical...

English

515

30.5K

Reuben Stern@ReubenConducts·25 Nis

@drisspg gemm+epi op+fused quant is big

English

117

driss guessous@drisspg·25 Nis

Same question but O(new operators needed) Obviously we all want flex linear attention, but what else? :SOL quant kernels - do we need to add some common fusions for them in eager? What about jitted Gemm+epilogue without full compile machinery?

Edward Z. Yang@ezyang

Give me your O(num operators) PyTorch improvement ideas that you are interested in. Historical examples: making every kernel deterministic / support zero size. Not done: every kernel in your favorite DSL / batch invariant / masked / padded / device side size

English

4.1K

Reuben Stern@ReubenConducts·11 Nis

@drisspg the answer of course is "more than you think; not as much as you want"

English

driss guessous@drisspg·11 Nis

“You aren’t a researcher” ohh yeah? I’m researching the limits of how much butter and bread the human body can consume. we are not the same.

English

5.1K

Reuben Stern retweetledi

Ant Ling@AntLingAGI·2 Nis

🚀 Linear Attention is unlocking million-token context windows by dropping computational complexity from O(N^2) to O(N), but software is increasingly bottlenecking the hardware. Meet cuLA (CUDA Linear Attention): hand-written kernels using CuTe DSL & CUTLASS C++ to extract maximum performance on NVIDIA GPUs. A drop-in replacement for FLA designed to push hardware to its absolute limits.

English

388

91.1K

Reuben Stern retweetledi

Cursor@cursor_ai·25 Mar

Thank you to the companies and open-source communities behind Kimi K2.5, Ray, ThunderKittens, PyTorch, and more. We'd also like to thank Fireworks and Colfax for their collaboration and partnership.

English

298

73.5K

Reuben Stern retweetledi

PyTorch@PyTorch·23 Mar

PyTorch 2.11 is now available, featuring 2,723 commits from 432 contributors since PyTorch 2.10. This release prioritizes performance scaling for distributed training and next-generation hardware architectures. Highlights include a FlashAttention-4 backend for FlexAttention on Hopper and Blackwell GPUs, Differentiable Collectives for distributed training, and performance optimizations for Intel GPUs via XPU Graph. This release also delivers comprehensive operator expansion for Apple Silicon (MPS) and RNN/LSTM GPU export support. 🖇️ Read the PyTorch 2.11 release blog and release notes: pytorch.org/blog/pytorch-2… #PyTorch #OpenSource #AIInfrastructure

English

622

58.7K

Reuben Stern retweetledi

Tri Dao@tri_dao·17 Mar

The frontier has increasingly shifted to hybrid models - from Qwen to Kimi-Linear and now with NVIDIA's Nemotron-3 Super - that rely on a strong linear sequence model. Today we release Mamba-3, the most powerful linear model to date. x.com/_albertgu/stat…

Albert Gu@_albertgu

The newest model in the Mamba series is finally here 🐍 Hybrid models have become increasingly popular, raising the importance of designing the next generation of linear models. We've introduced several SSM-centric ideas to significantly increase Mamba-2's modeling capabilities without compromising on speed. The resulting Mamba-3 model has noticeable performance gains over the most popular previous linear models (such as Mamba-2 and Gated DeltaNet) at all sizes. This is the first Mamba that was student led: all credit to @aakash_lahoti @kevinyli_ @_berlinchen @caitWW9, and of course @tri_dao!

English

112

845

77.3K

Reuben Stern@ReubenConducts·7 Mar

@drisspg @gaunernst it's pretty close, but FA-4 is missing in particular some of the inference optimizations that FA-3 has, such as cuda graphability via dynamic scheduling metadata. coming soon, though!

English

driss guessous@drisspg·7 Mar

@gaunernst FWIW its already there but that being said I would argue that they are at different levels of code maturity / feature coverage

English

243

driss guessous@drisspg·7 Mar

While all the hype has been around FA4 there is still alot of users of FA3 and hopefully this will help to make that easier :) And similarly you can use this with SDPA: #L21" target="_blank" rel="nofollow noopener">github.com/pytorch/pytorc…

angel@liangel02

In collaboration with Tri Dao and xformers, PyTorch has uploaded official Flash Attention 3 wheels to download.pytorch.org/whl/flash-attn…. These wheels support various CUDA versions (12.6+, 13), CPUs (x86, ARM) and OS (Linux, Windows), and are compatible w/ Python 3.10+, torch 2.9+.

English

3.6K

Reuben Stern@ReubenConducts·6 Mar

It's been great working on the FA-4 backend to FlexAttention -- check out this blog post to learn more!

PyTorch@PyTorch

FlexAttention now has a FlashAttention-4 backend. FlexAttention has enabled researchers to rapidly prototype custom attention variants—with 1000+ repos adopting it and dozens of papers citing it. But users consistently hit a performance ceiling. Until now. We've added a FlashAttention-4 backend to FlexAttention on Hopper and Blackwell GPUs. PyTorch now auto-generates CuTeDSL score/mask modifications and JIT-instantiates FlashAttention-4 for your custom attention variant. The result: 1.2× to 3.2× speedups over Triton on compute-bound workloads. 🖇️ Read our latest blog here: hubs.la/Q045FHPh0 No more choosing between flexibility and performance. hashtag#PyTorch hashtag#FlexAttention hashtag#FlashAttention hashtag#OpenSourceAI

English

Reuben Stern retweetledi

Belinda Li@belindazli·6 Şub

New blog post on introspection for interpretability, and why I think training models to self-explain is a promising frontier for interpretability research:

English

242

21.6K

Reuben Stern retweetledi

Hieu Pham@hyhieu226·25 Eyl

Friends at @colfaxintl released an excellent work on the mathematical foundation of CuTe layouts. CuTe layouts are central to the modern programming models on NVIDIA GPUs. You can (almost) ditch C++, but you cannot ditch CuTe. In fact, you can (almost) ditch C++ because of a thing called Python CuTe DSL. And to use that, you must know CuTe layouts. Despite their role, CuTe layouts are highly unintuitive. I think you can only make sense of CuTe layouts with some mathematical guarantees about their peculiar behaviors. Colfax friends provided that 👇 It's amazing that these best things are free.

English

169

15.6K

Reuben Stern@ReubenConducts·27 Tem

@YuriSulyma not with that attitude you can't

English

Reuben Stern@ReubenConducts·24 Tem

@YuriSulyma i'm so proud of you

English

Reuben Stern@ReubenConducts·27 Nis

Amazing, Matt! Combined with one of my favorite acts of opera ever 😃

Matthew Aucoin@aucoincomposer

just announced! I can't wait to be reunited with the great @METOrchestra and @nezetseguin for the premiere of my orchestral "Lear Sketches" at @carnegiehall next June. on the second half: @travlingtenor and @ReneeFleming sing Act 4 of "Otello" (!) carnegiehall.org/Calendar/2023/…

English

Keşfet

@thinkymachines @Simon_Vt @GPU_MODE @marksaroufim @charles_irl @drisspg @gaunernst @elonmusk