A few new CUDA hacker friends joined the effort and now llm.c is only 2X slower than PyTorch (fp32, forward pass) compared to 4 days ago, when it was at 4.2X slower 📈
The biggest improvements were:
- turn on TF32 (NVIDIA TensorFLoat-32) instead of FP32 for matmuls. This is a new mathmode in GPUs starting with Ampere+. This is a very nice, ~free optimization that sacrifices a little bit of precision for a large increase in performance, by running the matmuls on tensor cores, while chopping off the mantissa to only 10 bits (the least significant 19 bits of the float get lost). So the inputs, outputs and internal accumulates remain in fp32, but the multiplies are lower precision. Equivalent to PyTorch `torch.set_float32_matmul_precision('high')`
- call cuBLASLt API instead of cuBLAS for the sGEMM (fp32 matrix multiply), as this allows you to also fuse the bias into the matmul and deletes the need for a separate add_bias kernel, which caused a silly round trip to global memory for one addition.
- a more efficient attention kernel that uses 1) cooperative_groups reductions that look much cleaner and I only just learned about (they are not covered by the CUDA PMP book...), 2) the online softmax algorithm used in flash attention, 3) fused attention scaling factor multiply, 4) "built in" autoregressive mask bounds.
(big thanks to ademeure, ngc92, lancerts on GitHub for writing / helping with these kernels!)
Finally, ChatGPT created this amazing chart to illustrate our progress. 4 days ago we were 4.6X slower, today we are 2X slower. So we are going to beat PyTorch imminently 😂
Now (personally) going to focus on the backward pass, so we have the full training loop in CUDA.
#Aérien : @CorsairFr s’est associé au chef guadeloupéen Jimmy Bibrac pour ses repas en business au départ des #Antilles
🍽️🛫Une façon de mettre en avant "la #gastronomie créole" et le "terroir antillais"
➡️bit.ly/3AJ5neO
Old joke about agnostic technologists building artificial super intelligence to find out if there’s a God.
They finally finish & ask the question.
AI replies: “There is now, mfs!!”
L'exercice international de préparation aux tsunamis #CaribeWave2023 se déroule aujourd'hui dans toute la Caraïbe à partir de 10h, avec comme scénario, l'effondrement d'un pan de la Montagne Pelée causant une vague destructrice.
Mais savez-vous quoi faire en cas de tsunami ?
🇫🇷 FLASH - "Si les Français étaient vraiment en colère, je n'aurais pas été réélu il y a un an", affirme Emmanuel #Macron. (France 2) #Macron13h#reformedesretraites