536 posts

r0

@romitjain_

0049 I like organizing matrices

Katılım Şubat 2010

470 Takip Edilen201 Takipçiler

r0@romitjain_·8m

Wrote down some learnings from how torch.compile is integrated into @vllm_project r0m1t.com/torch-compile-… Give it a read if interested in what makes vLLM a highly efficient inference engine

English

r0 retweetledi

finbarr@finbarrtimbers·3d

Working in ML starts out as a math problem and very rapidly becomes a distributed systems problem

English

1.3K

56K

r0@romitjain_·6d

@auto_grad_ That data-dependent flow can be coded purely in Torch without any CPU fallback, so it should be torch compile safe. Check this x.com/cHHillee/statu…

Horace He@cHHillee

The runtime advantage of MoE comes from dynamic sparsity. But if this dynamism isn't properly handled, you might end up slower. Luckily, for BS=1 decoding, there's a nice way to do this. Instead of doing the indexing in *Python*, let's do the indexing *on the GPU*. (4/8)

English

Ishaan@auto_grad_·6d

well its still an issue. even if we take batch 1 token 1 and pass it, one expert will receive (1,h_dim) and the rest would receive 0. now which expert gets the token changes with the data. even if you make the shape static, we still have the data dependent flow, so essentially we are bakc to square 1

English

Ishaan@auto_grad_·29 Nis

since this has popped up, let me make you guys aware about why compiling MoE models is a nightmare task: 1. torch Dynamo (which creates the forward fx graph): this hits data dependent flow. see, torch.compile excels when we have static shapes and MoEs are completely opposite of static. when creating the forward graph from the python byte code, dynamo faces a problem when the each expert get diff no of tokens (which are from router) which is dependent on data. now we have two cases to handle this i.e. (a) treat output share as dynamic and the problem here is each expert has a different symbolic shape (which dynamo uses) and hence the compile time scales poorly, (b) graph break i.e. if dynamo cant handle something, it breaks the graph into sections and moe with N experts causes a lot of graph breaks. 2. AOTAutograd (the backward pass and graph): the aotautograd does functionalization which essentially means rewriting in-place operations into pure functional ones. in both forward and backward pass of MoE, compiler has to prove that the writes don't conflict, track aliasing carefully, and rewrite the whole thing into a series of index_put / scatter_add ops on fresh tensors (the graphs fucking explode). 3. Inductor: when inductor receives this enormous graph full of dynamic shapes, inductor fuses elementwise ops with their producers, but it's conservative around ops with data dependent shapes. the scatter/gather operations act as fusion barriers which means inductor cannot fuse across them because it can't reason about which output element corresponds to which input element until runtime. 4. fucking recompilations: these will necessarily be the end of me. if you get past the above stages (if you have that much of patience) then the moe models tend to recompile repeatedly during training because the actual token-to-expert distribution changes batch by batch. sorry for the rant though

maharshi@maharshii

Recently, I have been diving deeper into torch compile internals especially for inference related graph optimizations with custom kernels and below are my findings/learnings: Note that this is still a very high-level overview with lots of moving parts hidden behind the scenes. From what I understand the entire process of torch compile can be broken down into 5 stages. 1) Torch Dynamo: responsible for tracing the python bytecode and returning a fx GraphModule (gm). We can call it "functional graph module". 2) Pre-grad: this stage runs fx passes (both built-in and custom) on the gm before passing it to AOT autograd stage. 3) AOT autograd: this stage decomposes the unstable IR from dynamo+pre-grad stage into ATEN operations. ATEN is what pytorch uses behind the scenes. Ops like aten.addmm, aten.mul, and so on. It also builds and runs fx passes on the "joint" forward+backward graph if required (not necessary for inference only). 4) Post grad: this stage applies fx passes on the partitioned forward and backward graphs that come from AOT autograd stage. 5) Inductor codegen: this stage is still kinda a black box for me but I think it fuses ops in the graph, does autotuning, code generation and so on. Within all these stages, we can ask the torch compile backend/inductor to apply our own fx passes using the "pre_grad_custom_pass", "joint_custom_pre_pass", "joint_custom_post_pass", "post_grad_custom_pre_pass", "post_grad_custom_post_pass" present in inductor's config. Here, we can edit the graph nodes (add, remove, update) to have custom fusions that inductor may not do for us. Try thinking of an example as an exercise :) From a practical standpoint, if we wanted to have our own fx passes related to inference, the post grad pre passes (after AOT autograd, and before Inductor decomposition) is the best place to do it. At this point, we still have higher-level ATEN ops intact in the graph as nodes. For example, aten.addmm is not decomposed into aten .mm + aten.add here. To apply custom fx passes, we have two options: > Pattern replacement: Inductor provides a nice helper that lets us define a pattern function and a replacement function using aten ops or custom torch ops. One caveat is that the pattern must be robust, even a small change and inductor won't replace it. > Node surgery: We can search and edit the nodes in the graph directly. This is much more robust but it can get pretty hard for complex patterns. This is not documented much but by leveraging what torch compile/inductor provides us, one can as much graph optimization as they want. Another thing that inductor provides us is, registering custom lowering for built-in aten ops but that is a whole another topic to discuss, maybe next time.

English

109

r0@romitjain_·28 Nis

@Standard_Kernel My dumb guess is that the optimizations that this system was able to surface are the ones that are already explored but not emitted with the current configs of the DSL. Is that possible?

English

r0@romitjain_·28 Nis

@Standard_Kernel One thing that you also note in the blog is that maybe you are not running with the best configurations of each DSL. Do you think, if each DSL is actually run with the best configs, this system might not be able to find any new optimization?

English

147

Standard Kernel Co.@Standard_Kernel·27 Nis

We built a system combining program analysis and LLMs to transform and optimize PTX. By operating at this shared layer across DSLs (e.g. Triton, TileLang, ThunderKittens, CUTLASS), our system learns the best ideas from each and generates kernels that outperform all of them (1/5)

English

250

20.4K

r0@romitjain_·28 Nis

@jino_rohit @vast_ai @modal Saving you some trouble - 5090 will not unlock all the Blackwell features. Some Blackwell features (tcgen, TMA) are only available in the datacenter Blackwell. See here: research.colfax-intl.com/cutlass-tutori…

English

170

Jino Rohit@jino_rohit·28 Nis

im making a decision to switch to blackwell than hopper since the 5090s are more affordable. i was learning WGMMA and renting h100 was getting too expensive :( what are some affordable options to rent among @vast_ai @modal etc

English

5.4K

r0 retweetledi

kaio ken@kaiokendev1·26 Nis

too many software engineers who do not actually enjoy solving business problems with code they just want to get paid to solve programming puzzles

English

161

235

5.3K

181.3K

r0@romitjain_·27 Nis

@vikhyatk Sending this to claude code. It needs to hurry tf up

English

vik@vikhyatk·26 Nis

the standard pace is for chumps

English

288

11.7K

r0@romitjain_·26 Nis

@gaunernst @elliotarledge Ohh, interesting..

English

Thien Tran@gaunernst·26 Nis

@romitjain_ @elliotarledge I would say it's the opposite from my experience 😂. I couldnt get good perf from TMA and warp specialization in triton with a non-standard kernel (a lot of times slower than non-TMA). But CUDA JustWorks ™️ (it does take more efforts)

English

348

Thien Tran@gaunernst·26 Nis

Went back to the "megakernel" mini-project yesterday. Reimplemented MLP gemv in CUDA C++ and it blows Triton out of the water on H200 (or Triton on H200 is just bad for this kernel 🤔. Or I'm just bad with Triton 🤡)

English

141

7.1K

r0@romitjain_·26 Nis

@gaunernst @elliotarledge Will add another thought - getting good perf on for any kernel that uses some hardware-specific tricks (TMA, warp-specialization etc.) is hard in CUDA vs in Triton. Triton does better lowering for these hardware, which is tough to do yourself in CUDA for a novice like me

English

382

Thien Tran@gaunernst·26 Nis

@elliotarledge Reimplement in CUDA C++ might not always be faster btw LOL. And debugging is harder. So having a triton baseline is always good That's why i took so much time on this compared to everyone else jumping to CUDA C++ directly. Not necessarily better, just my style...

English

7.1K

r0@romitjain_·25 Nis

Why is pre-training still on Hopper? What is the bottleneck here? from: newsletter.semianalysis.com/p/the-coding-a…

English

r0@romitjain_·21 Nis

@abhi9u Your articles are a delight to read. I spend a lot of time reading and taking notes.

English

Abhinav Upadhyay@abhi9u·19 Nis

It's been a while since i published a new article. I will try to compensate for that. I've an almost encyclopedic article on virtual memory, currently in the editing stage. It's around 12,000 words, and covers everything that you should know about virtual memory (without going inside the kernel internals). Despite the length, it is very digestible and an easy read. It should be out by the end of the coming week.

English

5.3K

r0@romitjain_·15 Nis

@vikhyatk Waiting for you to drop “why”

English

vik@vikhyatk·14 Nis

also eliminating triton dependency

vik@vikhyatk

i'm going to keep writing kernels in cutedsl. but i'm setting up complex build infra so there's no runtime dependency on nvidia-cutlass-dsl why? everything will soon become apparent

English

6.1K

r0@romitjain_·15 Nis

@gaunernst @vikhyatk This brings back the days when I was naive enough to consider buying consumer blackwell thinking it will help me with datacenter blackwell

English

Thien Tran@gaunernst·15 Nis

@vikhyatk Yea, should have been different names. Trying to confuse ppl lol

English

673

vik@vikhyatk·15 Nis

consumer "blackwell" doesn't have tcgen05?! why the fuck are they calling it blackwell

English

8.6K

r0@romitjain_·14 Nis

@HarveenChadha 💯

QME

Harveen Singh Chadha@HarveenChadha·14 Nis

Disappointed that the article says nothing about the OCR validation part OCR-ing 27k arxiv papers with a VLM will inevitably introduce repeated token errors and hallucinations At scale, the quality check of OCR output is a bigger challenge than OCR itself

clem 🤗@ClementDelangue

We just OCR'd 27,000 arxiv papers into Markdown using an open 5B model, 16 parallel HF Jobs on L40S GPUs, and a mounted bucket. Total cost: $850 Total time: ~29 hours Jobs that crashed: 0 This now powers "Chat with your paper" on hf.co/papers

English

100

8.3K

r0@romitjain_·4 Nis

@sumanthd17 Thank you

English

Sumanth@sumanthd17·4 Nis

@romitjain_ instagram.com/fourth.space?i… Here you go. DM here!

English

Sumanth@sumanthd17·4 Nis

Took a leap this week and turned my love for baking into a small online preorder bakery. Wrapped up the first stint with plenty of lessons, especially how difficult things get when you try to scale (should’ve been obvious after all the pretraining 😪) This side quest is just getting started. Ofc I started it on April 1st just in case things go side ways 😂😂

English

7.8K

r0@romitjain_·4 Nis

just 2.7k file addition and 4.8M LoC

Alex Zhurkevich@cudagdb

Trtllmgen kernels are now open. Fastest prefill and decode kernels for our target workloads. We wrote these to win InferenceX, MLPerf, other benchmarks. Powering some of today’s top served models. Dive in, learn, use them, or level up your own. Enjoy. github.com/flashinfer-ai/…

English

r0 retweetledi

michael.trbo@michael_trbo·1 Nis

so @sakshambatraa and I are working on re-inventing groq's LPU from scratch this last week we implemented the VXM, the LPU's arithmetic unit here's what we learned

English

233

12.4K

Keşfet

@vllm_project @auto_grad_ @Standard_Kernel @jino_rohit @vast_ai @modal @vikhyatk @gaunernst