
Day 24 of llm.c: we now do multi-GPU training, in bfloat16, with flash attention, directly in ~3000 lines of C/CUDA, and it is FAST! 🚀 We're running ~7% faster than PyTorch nightly, with no asterisks, i.e. this baseline includes all modern & standard bells-and-whistles: mixed precision training, torch compile and flash attention, and manually padding vocab. (Previous comparisons included asterisks like *only inference, or *only fp32 etc.) Compared to the current PyTorch stable release 2.3.0, llm.c is actually ~46% faster. My point in these comparisons is just to say "llm.c is fast", not to cast any shade on PyTorch. It's really amazing that PyTorch trains this fast in a fully generic way, with ability to cook up and run ~arbitrary neural networks and run them on a ton of platforms. I see the goals and pros and cons of these two projects as different, even complementary. Actually I started llm.c with my upcoming education videos in mind, to explain what PyTorch does for you under the hood. How we got here over the last ~1.5 weeks - added: ✅ mixed precision training (bfloat16) ✅ many kernel optimizations, including e.g. a FusedClassifier that (unlike current torch.compile) does not materialize the normalized logits. ✅ flash attention (right now from cudnn) ✅ Packed128 data structure that forces the A100 to utilize 128-bit load (LDG.128) and store (STS.128) instructions. It's now also possible to train multi-GPU - added: ✅ First version of multi-gpu training with MPI+NCCL ✅ Profiling the full training run for NVIDIA Nsight Compute ✅ PR for stage 1 of ZeRO (optimizer state sharding) merging imminently We're still at "only" 3,000 lines of code of C/CUDA. It's getting a bit less simple, but still bit better than ~3 million. We also split off the fp32 code base into its own file, which will be pure CUDA kernels only (no cublas or cudnn or etc), and which I think would make a really nice endpoint of a CUDA course. You start with the gpt2.c pure CPU implementation, and see how fast you can make it by the end of the course on GPU, with kernels only and no dependencies. Our goal now is to create a reliable, clean, tested, minimal, hardened and sufficiently optimized LLM stack that reproduces the GPT-2 miniseries of all model sizes, from 124M to 1.6B, directly in C/CUDA. A lot more detail on: "State of the Union [May 3, 2024]" github.com/karpathy/llm.c…





















