Sayak Paul

6.8K posts

Sayak Paul

@RisingSayak

ML at Hugging Face 🤗

Earth Katılım Mayıs 2012

124 Takip Edilen23.7K Takipçiler

Sabitlenmiş Tweet

Sayak Paul@RisingSayak·4 Tem

Had the honor to present diffusion transformers at CS25, Stanford. The place is truly magical. Slides: bit.ly/dit-cs25 Recording: youtu.be/vXtapCFctTI?si… Thanks to @stevenyfeng for making it happen!

YouTube

English

137

1.1K

187.1K

Sayak Paul@RisingSayak·1h

@sagnikcodes In order for us to get there, we need traces for finetuning SLMs.

English

Sagnik@sagnikcodes·3h

@RisingSayak Skills are fine , I am just waiting for the day when small slms will write good kernels for my gpu locally

English

Sayak Paul@RisingSayak·9h

The kernels project at Hugging Face has been growing! We want it to be the go-to place for kernel devs and kernel users. We're looking to work w/ folks who're interested in doing agentic kernel dev, providing real optim value to real models. Reach out if interested :)

English

Sayak Paul@RisingSayak·1h

@FrostForger Open an issue and we will take it from there

English

Frosty40@FrostForger·2h

@RisingSayak My brother, this is my lifestyle. How can I be of service

English

Sayak Paul@RisingSayak·9h

The project isn't specific to CUDA or PyTorch, btw. It has multiple backends, going beyond CUDA, such as ROCm and `tvm-ffi`.

English

451

Sayak Paul@RisingSayak·9h

We already have some references in place, but admittedly, these are quite minimal: * huggingface.co/blog/custom-cu… * huggingface.co/docs/kernels/c…

English

313

Sayak Paul@RisingSayak·1d

After working on releasing the v5, this is the latest release from the Transformers team at ⁦@huggingface⁩.

English

1.1K

Sayak Paul@RisingSayak·1d

Repo is open-sourced at github.com/sayakpaul/qwen… Thanks to @claudeai OSS for providing Claude Credits. Thanks to @ksoonson (and his team at @googledevs) for providing GCP credits. Thanks to TRC program for providing TPU access. That was fun chugging some TPUs after a LONG time!

English

374

Sayak Paul@RisingSayak·1d

Some additional things I did: * Use XLA Pallas Flash Attention kernel * Profiling with `xprof` Profiling definitely revealed some of the shortcomings of the Diffusers implementation w.r.t the XLA-specific aspects but that was out of scope for the project.

English

338

Sayak Paul@RisingSayak·1d

I implemented a PyTorch/XLA variant of Qwen/QwenImage with SMPD to fit on a TPU v6e-8. Inference for 50 steps after compilation takes ~20 sec. Not bad 🔥 Longer ⬇️

English

1.3K

Sayak Paul@RisingSayak·2d

@stalmico Been a while.

English

Steven Collard@stalmico·3d

@RisingSayak flash attention 4 already dropped?

English

103

Sayak Paul@RisingSayak·3d

We released Diffusers 0.38.0, and it's packed with new pipelines and several library-related improvements 🔥 A bunch of new pipelines, including audio 🎼 * Ace-Step 1.5 * LongCat-AudioDiT * Ernie-Image And more! Next up, we added support for: * Flash Attention 4 * Loading with FlashPack * Ring Anything as a new backend for context parallelism Last but not least, we added an example on how to profile a DiffusionPipeline and potentially improve its performance. Enjoy 🧨

English

16.8K

Sayak Paul@RisingSayak·3d

@ls_brd Get out

English

tsunami_crypto@ls_brd·3d

@RisingSayak so when does the infinite loop get broken by realization

English

Sayak Paul@RisingSayak·3d

1. Read the post. 2. Contemplate. 3. Repeat 1.

Arthur Zucker@art_zucker

This is going to be a little bit long, but I want to give hope to my fellow anxious ML engineers. We see a lot of propaganda on how this or that AI one shotted something, about how incredibly strong the models are getting and how we don't even need to review PRs and we can just ship to production. Although this can be true for some cases, its also far from being representative of all the challenges we have to face. I started using claude code 4 month ago, and quickly realized how it really does change the way we work. I can experiment 10x faster, fix small issues without coding and refactor code without sweating. BUT, these tasks were "just" tedious and not hard. The challenge in my day to day work is to take a research code and integrate it into transformers using our standards. Its challenging because code beauty is abstract and subjective just like a philosophy. By relying too much on claude, and on how seemingly good the code it produces look, I pushed the deepseekv4 integration without realizing that claude really did not understand the model. I gave it access to `transformers`, the original paper, the original code, the different blog posts and my past chats and skills created to add a model, a b200 node node and a LOT of tokens, but it did NOT nail it. It did not understand the eager attention path, it did not understand the basics of causal attention. It was even wrong implementing the manifold constrained hyper connections. It helped to reduce the burden of exploring implementation and debugging but it did not help reason around the model. I am not a doomer, I think our job as Software Engineers has never been this great, I am just saying that we still have a job, and we should still be a bit careful when it looks to good to be true 😉

English

1.4K

Sayak Paul@RisingSayak·3d

@gonlenidefi Get out

English

Hunter Gon@gonlenidefi·3d

@RisingSayak alright you read the title, now youre loop trapped eternal learning, zero action

English

Sayak Paul@RisingSayak·3d

Check out the release notes here: github.com/huggingface/di…

English

490

Sayak Paul@RisingSayak·3d

@thebongcook Feel free to email me sayak@hf.co

English

Sayak Paul@RisingSayak·3d

@thebongcook We're in Otemachi. I think coffee works, too.

English

Sayak Paul@RisingSayak·4d

Expansion is a good thing, and may it never run out! Plus it's Japan 🧨

English

10.4K

Keşfet

@sagnikcodes @FrostForger @huggingface @claudeai @ksoonson @googledevs @stalmico @ls_brd