Darshan

768 posts

Darshan banner
Darshan

Darshan

@neuronfitting

21 y/o making gpu's go brrrrr

Katılım Aralık 2024
411 Takip Edilen57 Takipçiler
Sabitlenmiş Tweet
Darshan
Darshan@neuronfitting·
it seems optimization in cs is just doing data transfer/manipulation on chunks of data
English
0
0
0
263
Darshan retweetledi
roon
roon@tszzl·
it has only been three years since gpt4. people have joined college and not even graduated yet with a totally different world on the other side
English
106
97
2.6K
131.6K
Darshan
Darshan@neuronfitting·
First completely unoptimized pass is hitting 16.16 tokens/sec. Now that the library bloat is gone, it's time to fire up the GPU profilers and write some custom kernels to optimize the bottlenecks.
English
1
0
0
14
Darshan
Darshan@neuronfitting·
Hugging Face is great for prototyping, but you don't truly understand a system until you build it from first principles. Just got a bare-metal PyTorch graph of Qwen 3.5 (9B) running on a Blackwell B200. No HF abstractions, just pure math and compiled CUDA kernels.
English
1
0
0
43
Ankit Jxa
Ankit Jxa@kingofknowwhere·
The more I learn about CUDA, the more I get impressed by it. ZLUDA is a drop in CUDA replacement for non NVIDIA GPUs. And it's written in Rust. I have no experience writing Rust. I guess it's time to get rusted.
English
1
0
13
614
Darshan
Darshan@neuronfitting·
@maharshii would love to help out on the writing part if you are busy, looks like a great learning opportunity!
English
0
0
0
15
maharshi
maharshi@maharshii·
Recently, I have been diving deeper into torch compile internals especially for inference related graph optimizations with custom kernels and below are my findings/learnings: Note that this is still a very high-level overview with lots of moving parts hidden behind the scenes. From what I understand the entire process of torch compile can be broken down into 5 stages. 1) Torch Dynamo: responsible for tracing the python bytecode and returning a fx GraphModule (gm). We can call it "functional graph module". 2) Pre-grad: this stage runs fx passes (both built-in and custom) on the gm before passing it to AOT autograd stage. 3) AOT autograd: this stage decomposes the unstable IR from dynamo+pre-grad stage into ATEN operations. ATEN is what pytorch uses behind the scenes. Ops like aten.addmm, aten.mul, and so on. It also builds and runs fx passes on the "joint" forward+backward graph if required (not necessary for inference only). 4) Post grad: this stage applies fx passes on the partitioned forward and backward graphs that come from AOT autograd stage. 5) Inductor codegen: this stage is still kinda a black box for me but I think it fuses ops in the graph, does autotuning, code generation and so on. Within all these stages, we can ask the torch compile backend/inductor to apply our own fx passes using the "pre_grad_custom_pass", "joint_custom_pre_pass", "joint_custom_post_pass", "post_grad_custom_pre_pass", "post_grad_custom_post_pass" present in inductor's config. Here, we can edit the graph nodes (add, remove, update) to have custom fusions that inductor may not do for us. Try thinking of an example as an exercise :) From a practical standpoint, if we wanted to have our own fx passes related to inference, the post grad pre passes (after AOT autograd, and before Inductor decomposition) is the best place to do it. At this point, we still have higher-level ATEN ops intact in the graph as nodes. For example, aten.addmm is not decomposed into aten .mm + aten.add here. To apply custom fx passes, we have two options: > Pattern replacement: Inductor provides a nice helper that lets us define a pattern function and a replacement function using aten ops or custom torch ops. One caveat is that the pattern must be robust, even a small change and inductor won't replace it. > Node surgery: We can search and edit the nodes in the graph directly. This is much more robust but it can get pretty hard for complex patterns. This is not documented much but by leveraging what torch compile/inductor provides us, one can as much graph optimization as they want. Another thing that inductor provides us is, registering custom lowering for built-in aten ops but that is a whole another topic to discuss, maybe next time.
maharshi tweet media
English
4
15
225
14.8K
Jino Rohit
Jino Rohit@jino_rohit·
cuda, triton, cutlass, cute, tilelang, thunderkittens, mojo, helion. so which one do you even learn at this point?
English
45
4
255
17.6K
Darshan retweetledi
Sam Altman
Sam Altman@sama·
"post-AGI, no one is going to work and the economy is going to collapse" "i am switching to polyphasic sleep because GPT-5.5 in codex is so good that i can't afford to be sleeping for such long stretches and miss out on working"
English
1.2K
606
11.2K
1.6M
Darshan retweetledi
maharshi
maharshi@maharshii·
hiding behind jargon is easy when explaining something but the real skill is stripping it all away so anyone can understand. the people who can do this, earn my deepest respect.
English
10
5
277
4.2K
Elliot Arledge
Elliot Arledge@elliotarledge·
gpt 5.5 is here!
Elliot Arledge tweet media
English
2
2
38
1.5K
Darshan retweetledi
Paras Chopra
Paras Chopra@paraschopra·
AI bois be like:
Paras Chopra tweet media
English
124
549
7.5K
294.4K
Darshan
Darshan@neuronfitting·
@dcbaslani/my-2-cents-on-doing-hard-things-9af575ae867b" target="_blank" rel="nofollow noopener">medium.com/@dcbaslani/my-…
ZXX
0
0
0
13
Darshan
Darshan@neuronfitting·
For 4 years, I thought doing hard things was the ultimate goal. It builds your ego and your skills. But in the real world, intellectual complexity ≠ business value. I wrote about the painful realization of needing to swallow my pride and chase the low-hanging fruit. Link below
English
1
0
1
22
Darshan retweetledi
roon
roon@tszzl·
say it with me now. experts are fake, smart generalists rule the world, everything is designed by people no smarter than you, and courage is in shorter supply than genius
English
116
1.4K
10.2K
0
Darshan retweetledi
Darshan
Darshan@neuronfitting·
Global aphasia is a condition where people lose almost all ability to use or understand language due to major damage in their brain's language network. Even without language, many of these patients can still solve complex reasoning problems, do math, understand cause and effect, and plan actions.
English
0
1
2
60
Darshan retweetledi
simp 4 satoshi
simp 4 satoshi@iamgingertrash·
Consider a 5D observer; looking at you, a 4D being It would see the shape of your lifetime, all at once Everywhere in space you were, are and will be In one grand arc They could mold this shape Like you, a 4D being, could reshape a stick figure on paper (2D+T=3D)
simp 4 satoshi tweet media
English
79
37
705
35.5K
tender
tender@tenderizzation·
when hiding communication latency goes wrong (sound on)
English
5
2
82
6.2K
Darshan
Darshan@neuronfitting·
the beautiful, beautiful art of profiling kernels
Darshan tweet media
English
0
0
2
40