Lotfi Slim

250 posts

Lotfi Slim

Lotfi Slim

@EpiSlim

AI Developer Technology Engineer @Nvidia

London, England Katılım Şubat 2017
1.2K Takip Edilen181 Takipçiler
Lotfi Slim retweetledi
Fleetwood
Fleetwood@fleetwood___·
CuTe algebra is extremely elegant. I can't believe we've all been writhing around on the floor like Terence Tao to do tile indexing.
Fleetwood tweet media
English
5
66
620
79.3K
Lotfi Slim retweetledi
tetsuo.cpp (no slop)
tetsuo.cpp (no slop)@tetsuo_cpp·
Oh, you're writing CUDA kernels? Everyone's on Triton now. Just kidding, we're all on Mojo. We're using cuTile. We're using ROCm. We have an in-house DSL compiler targeting the NVGPU MLIR dialect but wait, Tile IR just dropped so we're going to target that instead. Our PM is on TileLang. The team lead was on CuTe but now she's back to handwriting PTX. If you're not on Pallas, you're ngmi. Our intern is building on TT-Metalium for our Wormholes. Our CFO approved an order for some big chungus wafer-scale chips so now we're porting our kernels to CSL. Our CTO is working on a kernel-less graph compiler so we won't need to write kernels anymore. Our CEO thinks we're talking about the Linux kernel. We're building Claude for dogs.
English
67
179
2.8K
183.7K
Lotfi Slim retweetledi
bilal
bilal@bilaltwovec·
lots of fun quick tpu/performance engineering interview questions in here :D jax-ml.github.io/scaling-book/
English
4
31
355
41.8K
Lotfi Slim retweetledi
Vijay
Vijay@__tensorcore__·
🔥🚨 CUTLASS Blackwell is here 🚨🔥 3.8 release is loaded with support for new features of Blackwell, even an attention kernel 👀 Go check it out here: github.com/nvidia/cutlass Can't wait to see what y'all end up cooking with this over the next few moths and years 💚
Vijay tweet media
English
6
29
123
12.5K
Lotfi Slim retweetledi
Omar Sanseviero
Omar Sanseviero@osanseviero·
BREAKING NEWS The Royal Swedish Academy of Sciences has decided to award the 2024 #NobelPrize in Literature to the Attention Is All You Need authors. Their work has made thousands cry, laugh, or rich and made GPUs go brrr
Omar Sanseviero tweet mediaOmar Sanseviero tweet media
English
31
168
2K
161.7K
Lotfi Slim retweetledi
Michael Goin
Michael Goin@mgoin_·
Come learn today at 2pm ET why we use @NVIDIAAI CUTLASS for high-throughput inference kernels in vLLM! If you are interested in peak quantized GEMM performance on GPUs, this is the talk to attend and ask questions
Red Hat AI@RedHat_AI

🚨 vLLM Office Hours continue on Thursday, September 5th, at 2PM ET / 11AM PT! Tyler Smith (@tms_jr), vLLM Committer & Technical Director at Neural Magic, will dive deep into using NVIDIA CUTLASS for high-performance INT8 & FP8 vLLM inference. Sign up: neuralmagic.com/community-offi…

English
0
5
31
2.3K
Lotfi Slim retweetledi
Hieu Pham
Hieu Pham@hyhieu226·
pytorch.org/blog/int4-deco… I highly recommend this study to build intuitions for GPU programming. One of the difficulties with GPU programming is that it's hard to test a hypothesis about a certain optimization. Say we expect an optimization to improve the final speed, but we don't observe any improvement. We would start to wonder whether our intuition about the optimization is wrong, or whether it's a bug in our code. That is why studies like this one should be cherished. Apart from providing a good INT4 KV cache attention kernel, the study provides the improvement analysis for each of their 10 optimizations on top of the previous ones. The analysis come with annotated screenshots from NCU, with clear pointers to where to look. I learned a lot from this study.
English
1
75
446
33K
Lotfi Slim
Lotfi Slim@EpiSlim·
@XLR This is what I call a good start to the year!
English
0
0
6
109
Lotfi Slim retweetledi
Sumanth Hegde
Sumanth Hegde@sumanthrh·
I've done a deep dive into distributed training and efficient fine-tuning of LLMs. I get into the messy internals of DeepSpeed ZeRO and FSDP, summarize practical guidelines and highlight gotchas with multi-GPU training. sumanthrh.com/post/distribut… Do read, should be fun!
English
12
143
890
210.6K
Lotfi Slim
Lotfi Slim@EpiSlim·
@koalalesque Nonetheless, for all its merits, I won't start learning CSS anytime soon 😆
English
0
0
2
102
Lotfi Slim
Lotfi Slim@EpiSlim·
Completely blown away by template programming in C++ being Turing complete.
English
1
0
2
281
Lotfi Slim retweetledi
Inflection AI
Inflection AI@inflectionAI·
We’re proud to have helped set a new standard for Generative AI by building a cluster with 3,584 @NVIDIA H100 GPUs in the debut MLPerf benchmark. Our system completed the massive GPT-3-based training benchmark in under eleven minutes. 🚀 Read more: blogs.nvidia.com/blog/2023/06/2…
English
7
32
232
75.7K
Lotfi Slim retweetledi
John Carmack
John Carmack@ID_AA_Carmack·
H100 GPUs are very fast! For those unfamiliar with GPU matrix multiplies, the jaggies in the graph relate to packing occupancy, and are not noise. You can’t just divide theoretical teraflops by your problem size and get accurate times.
John Carmack tweet media
English
23
58
761
242.6K