Sayak Paul

6.7K posts

Sayak Paul banner
Sayak Paul

Sayak Paul

@RisingSayak

ML at Hugging Face 🤗

Earth Katılım Mayıs 2012
125 Takip Edilen23.3K Takipçiler
Sayak Paul
Sayak Paul@RisingSayak·
The Claude for OSS gift has been incredibly rewarding so far. Our team has used it so far for: * Integrating new models into Diffusers (with model integration and parity skills) * Reproducible workflows to catch issues pertaining to CPU overhead and sync issues and fix them (this requires extensive profiling) * Better tooling for convenience across the library (docs, tests, etc.) Upcoming is a pipeline optimization skill that is hardware-aware (respects available VRAM and RAM) and uses the foundational optimization blocks provided by Diffusers and other libs. As you can (more than) imagine, all of these require quite a bit of prior experience to keep Claude steered and the costs under control (yes, we want to be mindful of the capacity). @claudeai, we're immensely grateful!
English
2
2
35
1.6K
Sayak Paul
Sayak Paul@RisingSayak·
@twlvone @ariG23498 @PyTorch But you still need to understand the GPU programming model to code in Triton. Like you define blocks and threads, how you make a call to warp, etc.
English
0
0
0
19
Twlvone
Twlvone@twlvone·
@ariG23498 @PyTorch most users never see this layer but it's where real acceleration lives -- Triton was huge because it lets you write custom kernels in Python without dropping to CUDA C. the gap between a naive and a fused, bandwidth-optimal kernel is often 10-100x throughput difference
English
1
0
1
63
Sayak Paul retweetledi
Aritra 🤗
Aritra 🤗@ariG23498·
When you run a @PyTorch model on a GPU, the acutal work is executed through kernels. These are low-level, hardware-specific functions designed for GPUs (or other accelerators). If you profile a model, you'll see a sequence of kernel launches. Between these launches, the GPU can sit idle, waiting for the next operation. A key optimization goal is therefore to minimize gaps between kernel execution and keep the GPU fully utilized. One common approach is `torch.compile`, which fuses multiple operations into fewer kernels, reducing overhead and improving utilization. Another approach is to write custom kernels tailored to specific workfloads (e.g., optimized attention or fused ops). However, this comes with significant challenges: > requires deep expertise in kernels writing > installation hell > integration with the model is non-trivial To address this,@huggingface introduces the `kernels` library. With this one can: > build custom kernels (with the help of a template) > upload them to the Hub (like models or datasets) > integrate them to models with ease Let's take a look at how the transformers team use the kernels library to integrate it into the already existing models. (more in the thread)
English
19
89
1.2K
81.8K
Sayak Paul
Sayak Paul@RisingSayak·
That's the tweet. Find the PR, test the code, and enjoy 🧨
Sayak Paul tweet media
English
3
0
35
2K
interplato
interplato@interplato·
@RisingSayak Hey Sayak - where can I find some way to establish equivalence between paid and open source models? I’d like to swap out paid models for open ones
English
1
0
0
164
Sayak Paul
Sayak Paul@RisingSayak·
If you want to grow the open-source community in your region, you might want to apply! As long as the focus is on open models and open tooling, we're happy to consider positions! Check out more in huggingface2.notion.site/Hugging-Face-B…
Sayak Paul tweet media
English
9
20
215
23K
Sayak Paul
Sayak Paul@RisingSayak·
@real_redp @ariG23498 @PyTorch Valid point but you need the kernel first and then worry about its launch overhead, no? With "reduce-overhead" or "max-autotune" from `torch.compile`, this should be easy.
Sayak Paul tweet media
English
1
0
1
56
red plait
red plait@real_redp·
@ariG23498 @PyTorch I don't know if your fashinable torch supports it but cuda graph api can reduce delays of repeating sequence of kernels
English
2
0
1
238
Sayak Paul
Sayak Paul@RisingSayak·
@miguelamda @ariG23498 @PyTorch Oh that's fire! Would love to know if we can work together to package and ship the kernels you have written so that they can be usable directly through `get_kernel()` from the kernels lib.
English
0
0
2
21
Miguel Á. Martínez-del-Amor
@ariG23498 @PyTorch Cool stuff! I'm on the other side, I have a lot of experience writing kernels, but integrating them in pytorch is a mess (e.g. input data type? Debugging both Python and CUDA C++ is a nightmare....)
English
2
0
5
246
Ilyas Moutawwakil
Ilyas Moutawwakil@IlysMoutawwakil·
@ariG23498 @PyTorch Gotta love the kernels library ! I was able to quickly create and productionize my fp8 moe kernels and integrate them into the transformers lib with no friction ! The PR removed more LoC than it added btw haha
Ilyas Moutawwakil tweet media
English
2
0
3
490
Paras Madan
Paras Madan@ParasMadan9·
@RisingSayak Would love to do that in Delhi. Have already a community of 250k+ AI Builders to distribute to.
English
2
0
2
498
Sayak Paul
Sayak Paul@RisingSayak·
@ariG23498 @PyTorch True. However, the magic of `get_kernel()` is worth pointing out. Behind the scenes, it's resolving the build variant appropriate for your environment, downloading it, and then getting it ready as a local module ready to be used and cause brrr.
English
1
0
6
142
Sayak Paul
Sayak Paul@RisingSayak·
@himanshustwts GAN was coded in a night that too by a drunk Ian Goodfellow.
English
5
4
76
2.8K
Sayak Paul
Sayak Paul@RisingSayak·
@tomcocobrico @claudeai We can also look into adding support for CuteDSL in `kernels`. If you could open an issue on our repo, that would be golden!
English
0
0
0
17
Sayak Paul
Sayak Paul@RisingSayak·
Many things that seemed like eternities away feel possible with the advent of things like @claudeai! For example, I never had the courage to do any kernel work, let alone look into its source code. But I can now at least take specific help from Claude and be effective even when studying things. Of course, you cannot start from an empty slate here -- you still need some basic knowledge about computer architecture, processors, GPU programming model, etc. But even that can be structured these days. I am still learning to best use these powerful tools and I am absolutely enjoying the process.
English
7
5
73
3.8K