r0

536 posts

r0 banner
r0

r0

@romitjain_

0049 I like organizing matrices

Katılım Şubat 2010
470 Takip Edilen201 Takipçiler
r0
r0@romitjain_·
Wrote down some learnings from how torch.compile is integrated into @vllm_project r0m1t.com/torch-compile-… Give it a read if interested in what makes vLLM a highly efficient inference engine
English
0
0
0
2
r0 retweetledi
finbarr
finbarr@finbarrtimbers·
Working in ML starts out as a math problem and very rapidly becomes a distributed systems problem
English
25
95
1.3K
56K
Ishaan
Ishaan@auto_grad_·
well its still an issue. even if we take batch 1 token 1 and pass it, one expert will receive (1,h_dim) and the rest would receive 0. now which expert gets the token changes with the data. even if you make the shape static, we still have the data dependent flow, so essentially we are bakc to square 1
English
1
0
0
59
Ishaan
Ishaan@auto_grad_·
since this has popped up, let me make you guys aware about why compiling MoE models is a nightmare task: 1. torch Dynamo (which creates the forward fx graph): this hits data dependent flow. see, torch.compile excels when we have static shapes and MoEs are completely opposite of static. when creating the forward graph from the python byte code, dynamo faces a problem when the each expert get diff no of tokens (which are from router) which is dependent on data. now we have two cases to handle this i.e. (a) treat output share as dynamic and the problem here is each expert has a different symbolic shape (which dynamo uses) and hence the compile time scales poorly, (b) graph break i.e. if dynamo cant handle something, it breaks the graph into sections and moe with N experts causes a lot of graph breaks. 2. AOTAutograd (the backward pass and graph): the aotautograd does functionalization which essentially means rewriting in-place operations into pure functional ones. in both forward and backward pass of MoE, compiler has to prove that the writes don't conflict, track aliasing carefully, and rewrite the whole thing into a series of index_put / scatter_add ops on fresh tensors (the graphs fucking explode). 3. Inductor: when inductor receives this enormous graph full of dynamic shapes, inductor fuses elementwise ops with their producers, but it's conservative around ops with data dependent shapes. the scatter/gather operations act as fusion barriers which means inductor cannot fuse across them because it can't reason about which output element corresponds to which input element until runtime. 4. fucking recompilations: these will necessarily be the end of me. if you get past the above stages (if you have that much of patience) then the moe models tend to recompile repeatedly during training because the actual token-to-expert distribution changes batch by batch. sorry for the rant though
maharshi@maharshii

Recently, I have been diving deeper into torch compile internals especially for inference related graph optimizations with custom kernels and below are my findings/learnings: Note that this is still a very high-level overview with lots of moving parts hidden behind the scenes. From what I understand the entire process of torch compile can be broken down into 5 stages. 1) Torch Dynamo: responsible for tracing the python bytecode and returning a fx GraphModule (gm). We can call it "functional graph module". 2) Pre-grad: this stage runs fx passes (both built-in and custom) on the gm before passing it to AOT autograd stage. 3) AOT autograd: this stage decomposes the unstable IR from dynamo+pre-grad stage into ATEN operations. ATEN is what pytorch uses behind the scenes. Ops like aten.addmm, aten.mul, and so on. It also builds and runs fx passes on the "joint" forward+backward graph if required (not necessary for inference only). 4) Post grad: this stage applies fx passes on the partitioned forward and backward graphs that come from AOT autograd stage. 5) Inductor codegen: this stage is still kinda a black box for me but I think it fuses ops in the graph, does autotuning, code generation and so on. Within all these stages, we can ask the torch compile backend/inductor to apply our own fx passes using the "pre_grad_custom_pass", "joint_custom_pre_pass", "joint_custom_post_pass", "post_grad_custom_pre_pass", "post_grad_custom_post_pass" present in inductor's config. Here, we can edit the graph nodes (add, remove, update) to have custom fusions that inductor may not do for us. Try thinking of an example as an exercise :) From a practical standpoint, if we wanted to have our own fx passes related to inference, the post grad pre passes (after AOT autograd, and before Inductor decomposition) is the best place to do it. At this point, we still have higher-level ATEN ops intact in the graph as nodes. For example, aten.addmm is not decomposed into aten .mm + aten.add here. To apply custom fx passes, we have two options: > Pattern replacement: Inductor provides a nice helper that lets us define a pattern function and a replacement function using aten ops or custom torch ops. One caveat is that the pattern must be robust, even a small change and inductor won't replace it. > Node surgery: We can search and edit the nodes in the graph directly. This is much more robust but it can get pretty hard for complex patterns. This is not documented much but by leveraging what torch compile/inductor provides us, one can as much graph optimization as they want. Another thing that inductor provides us is, registering custom lowering for built-in aten ops but that is a whole another topic to discuss, maybe next time.

English
7
5
109
8K
r0
r0@romitjain_·
@Standard_Kernel My dumb guess is that the optimizations that this system was able to surface are the ones that are already explored but not emitted with the current configs of the DSL. Is that possible?
English
0
0
0
20
r0
r0@romitjain_·
@Standard_Kernel One thing that you also note in the blog is that maybe you are not running with the best configurations of each DSL. Do you think, if each DSL is actually run with the best configs, this system might not be able to find any new optimization?
English
1
0
0
147
Standard Kernel Co.
Standard Kernel Co.@Standard_Kernel·
We built a system combining program analysis and LLMs to transform and optimize PTX. By operating at this shared layer across DSLs (e.g. Triton, TileLang, ThunderKittens, CUTLASS), our system learns the best ideas from each and generates kernels that outperform all of them (1/5)
Standard Kernel Co. tweet media
English
5
28
250
20.4K
Jino Rohit
Jino Rohit@jino_rohit·
im making a decision to switch to blackwell than hopper since the 5090s are more affordable. i was learning WGMMA and renting h100 was getting too expensive :( what are some affordable options to rent among @vast_ai @modal etc
Jino Rohit tweet media
English
11
0
60
5.4K
r0 retweetledi
kaio ken
kaio ken@kaiokendev1·
too many software engineers who do not actually enjoy solving business problems with code they just want to get paid to solve programming puzzles
English
161
235
5.3K
181.3K
r0
r0@romitjain_·
@vikhyatk Sending this to claude code. It needs to hurry tf up
English
0
0
1
43
vik
vik@vikhyatk·
the standard pace is for chumps
vik tweet media
English
9
9
288
11.7K
Thien Tran
Thien Tran@gaunernst·
@romitjain_ @elliotarledge I would say it's the opposite from my experience 😂. I couldnt get good perf from TMA and warp specialization in triton with a non-standard kernel (a lot of times slower than non-TMA). But CUDA JustWorks ™️ (it does take more efforts)
English
1
0
1
348
Thien Tran
Thien Tran@gaunernst·
Went back to the "megakernel" mini-project yesterday. Reimplemented MLP gemv in CUDA C++ and it blows Triton out of the water on H200 (or Triton on H200 is just bad for this kernel 🤔. Or I'm just bad with Triton 🤡)
Thien Tran tweet media
English
5
7
141
7.1K
r0
r0@romitjain_·
@gaunernst @elliotarledge Will add another thought - getting good perf on for any kernel that uses some hardware-specific tricks (TMA, warp-specialization etc.) is hard in CUDA vs in Triton. Triton does better lowering for these hardware, which is tough to do yourself in CUDA for a novice like me
English
1
0
1
382
Thien Tran
Thien Tran@gaunernst·
@elliotarledge Reimplement in CUDA C++ might not always be faster btw LOL. And debugging is harder. So having a triton baseline is always good That's why i took so much time on this compared to everyone else jumping to CUDA C++ directly. Not necessarily better, just my style...
English
2
1
24
7.1K
r0
r0@romitjain_·
@abhi9u Your articles are a delight to read. I spend a lot of time reading and taking notes.
English
1
0
1
20
Abhinav Upadhyay
Abhinav Upadhyay@abhi9u·
It's been a while since i published a new article. I will try to compensate for that. I've an almost encyclopedic article on virtual memory, currently in the editing stage. It's around 12,000 words, and covers everything that you should know about virtual memory (without going inside the kernel internals). Despite the length, it is very digestible and an easy read. It should be out by the end of the coming week.
English
7
0
92
5.3K
r0
r0@romitjain_·
@vikhyatk Waiting for you to drop “why”
English
1
0
1
26
r0
r0@romitjain_·
@gaunernst @vikhyatk This brings back the days when I was naive enough to consider buying consumer blackwell thinking it will help me with datacenter blackwell
English
0
0
2
43
Thien Tran
Thien Tran@gaunernst·
@vikhyatk Yea, should have been different names. Trying to confuse ppl lol
English
1
0
8
673
vik
vik@vikhyatk·
consumer "blackwell" doesn't have tcgen05?! why the fuck are they calling it blackwell
English
13
1
87
8.6K
Harveen Singh Chadha
Harveen Singh Chadha@HarveenChadha·
Disappointed that the article says nothing about the OCR validation part OCR-ing 27k arxiv papers with a VLM will inevitably introduce repeated token errors and hallucinations At scale, the quality check of OCR output is a bigger challenge than OCR itself
clem 🤗@ClementDelangue

We just OCR'd 27,000 arxiv papers into Markdown using an open 5B model, 16 parallel HF Jobs on L40S GPUs, and a mounted bucket. Total cost: $850 Total time: ~29 hours Jobs that crashed: 0 This now powers "Chat with your paper" on hf.co/papers

English
12
2
100
8.3K
Sumanth
Sumanth@sumanthd17·
Took a leap this week and turned my love for baking into a small online preorder bakery. Wrapped up the first stint with plenty of lessons, especially how difficult things get when you try to scale (should’ve been obvious after all the pretraining 😪) This side quest is just getting started. Ofc I started it on April 1st just in case things go side ways 😂😂
Sumanth tweet mediaSumanth tweet mediaSumanth tweet mediaSumanth tweet media
English
15
4
94
7.8K
r0 retweetledi
michael.trbo
michael.trbo@michael_trbo·
so @sakshambatraa and I are working on re-inventing groq's LPU from scratch this last week we implemented the VXM, the LPU's arithmetic unit here's what we learned
michael.trbo tweet media
English
9
20
233
12.4K