driss guessous

553 posts

driss guessous

@drisspg

bytes and nuggets @pytorch https://t.co/gWVJmW741f

Katılım Aralık 2023

267 Takip Edilen1.6K Takipçiler

driss guessous@drisspg·4d

@kalomaze Hmmm open an issue with what you got would be fun to figure out what’s going on

English

168

kalomaze@kalomaze·4d

bf16 MoE routers b like

English

6.9K

driss guessous@drisspg·4d

Honestly the most fun part was that the lil intra kernel profiling was actually helpful here and led me to try different ideas after looking at pipeline dependencies

English

241

driss guessous@drisspg·4d

Also forgot to post the receipts: github.com/drisspg/transf… I was claudxing here playing around with ideas. M=1, pure memory bound -> do we really need tcgen? Not using cublaslt autotuning and this is against pt nightly with 13.2 on B200. Im sure folks can do much better :)

English

303

driss guessous@drisspg·6d

"ohh I beat cublas by 2% by implementing Hilbert curves" The trick to beating cublas is to go where they aint looking; And pray the next toolkit hasn't caught back up (it will)

English

113

12.4K

driss guessous@drisspg·4d

@remi_or_ yup mostly comes up in diffusion

English

107

Rémi Ouazan@remi_or_·4d

@drisspg Wow, I missed this! Will definitely try it out. For bidirectional, I admit I haven’t looked into it too much, not a lot of model run on it these days… thanks for the tips!

English

116

Rémi Ouazan@remi_or_·5d

There is no point running a model with SDPA attention. Running an LLM without flash is like nerfing your own hardware. 3 years ago, sure, it was hard to setup, but not anymore. Just look at the impact of flash on throughput... and im not even going to talk about CB... 🚀

English

4.2K

driss guessous@drisspg·4d

@remi_or_ Makes sense, the Api forsure is missing features needed for autoregressive decoding. We did add docs.pytorch.org/docs/2.13/nn.a… which is much more aligned, supports FA3/FA4 and cudnn. For dense bidirectional attention SDPA is often quite hard to beat I find. Regardless, happy coding :)

English

127

Rémi Ouazan@remi_or_·4d

@drisspg It means that it’s often a better idea to use flash attention (through the package `flash_attn` or `kernels`) than `torch.sdpa` ! Although it can fall back to flash using it explicitly is better IMO

English

201

driss guessous@drisspg·4d

@Kimi_Moonshot

QME

382

Kimi.ai@Kimi_Moonshot·4d

ZXX

696

1.1K

15.6K

3.2M

driss guessous@drisspg·5d

I would like for someone to train a model who's entire chain of thought is gibberish and yet it still scales with test time compute

Ant Ling@AntLingAGI

🏆 High-Quality CoT Our model doesn't just get answers right, it does so beautifully 🌺 Comprehensibility: structured CoT ⚡ Token-Efficient: Solves AIME using < half the tokens of baselines 🌟 Reproducible: 100K of our traces distilled into Qwen-32B beat 800K of DeepSeek-R1

English

4.8K

driss guessous@drisspg·5d

@giffmana Indeed, I have been quite impressed by its imporovemnts

English

187

Lucas Beyer (bl16)@giffmana·5d

@drisspg Even the chatbot is not good daily use, i a/b test with chatgpt a lot and all that's missing is the nice auto memory from past chats and a bunch if UI features.

English

1.2K

driss guessous@drisspg·5d

Not a shill, meta.ai kinda rips for image gen

English

1.9K

driss guessous@drisspg·5d

@_seemethere nvm I hate it lol

English

eli@_seemethere·5d

@drisspg Determined not a shill.

English

driss guessous@drisspg·6d

@tenderizzation without

English

247

tender@tenderizzation·6d

@drisspg is this with or without cublas autotuning

English

502

driss guessous@drisspg·6d

@gaunernst perhapppsssssssss

English

412

Thien Tran@gaunernst·6d

@drisspg Is this for small M?

English

520

driss guessous@drisspg·6d

@elliotarledge yeah, for mxfp8 and nvfp4 we route scaled_mm to cublaslt

English

369

Elliot Arledge@elliotarledge·6d

@drisspg cublas and cublas-lt right?

English

936

driss guessous@drisspg·6d

@yacineMTB my partner in crime

English

kache@yacineMTB·6d

@drisspg i ;love sol

English

874

driss guessous@drisspg·6d

@GoonGarrett I think it just gives it time to ferment slowly, and build a more "complex flavor" not to sound too douchey. The one I let go for like 4 days deff tasted the most unique

English

Garrett Goon@GoonGarrett·6d

@drisspg What does the overnight refrigeration add? Surprised by that step

English

driss guessous@drisspg·12 Tem

On a more important note: drisspg.github.io/nuggets/Pizza-…

English

driss guessous@drisspg·13 Tem

@Birchlabs Cool I’ll take a look

English

Birchlabs@Birchlabs·13 Tem

@drisspg we use coreweave torch-extras container images github.com/coreweave/ml-c… it installs FA4 [cu13] from source like so: #L843" target="_blank" rel="nofollow noopener">github.com/coreweave/ml-c…

English

driss guessous@drisspg·11 Tem

If PyTorch added cutedsl as a required Runtime dependency for our Cuda wheels, how much would this mess you up? Would >= X.y for latest X.y at time of pt release work?

English

7.1K

driss guessous@drisspg·13 Tem

@tenderizzation Never attribute to malice that which is adequately explained by stupidity. To my thought

English

168

tender@tenderizzation·13 Tem

@drisspg it can’t be a “i’m breaking up with you next week” ahh relationship without the next week part

English

385

driss guessous@drisspg·13 Tem

It’s interesting the keep putting an end date on these extension promises. I wonder if they have a big pretain planned and have to keep pushing it back

Claude@claudeai

We're extending Claude Fable 5 access on all paid plans, as well as keeping Claude Code’s weekly rate limits 50% higher, through July 19.

English

2.3K

Keşfet

@kalomaze @remi_or_ @Kimi_Moonshot @giffmana @_seemethere @tenderizzation @elonmusk @BarackObama