VHPC
223 posts

VHPC
@VHPCworkshop
Workshop on Virtualization in High-Performance Cloud Computing. News and views on #HPC #virtualization #VHPC #VMs #containers #cloud #ai



【プレスリリース】エネルギー効率を飛躍的に高める革新的なEUVリソグラフィー先端半導体製造技術を発表 従来の常識を覆し、わずか4枚の反射ミラーで構成された新型EUVリソグラフィ| 日本の研究.com research-er.jp/articles/view/… あかん!ASMLしんでまう!!!





Google Cloud accidentally deleted a company's entire cloud environment (Unisuper, an investment company, which manages $80B). The company had backups in another region, but GCP deleted those too. Luckily, they had yet more backups on another provider. theguardian.com/australia-news…



A few new CUDA hacker friends joined the effort and now llm.c is only 2X slower than PyTorch (fp32, forward pass) compared to 4 days ago, when it was at 4.2X slower 📈 The biggest improvements were: - turn on TF32 (NVIDIA TensorFLoat-32) instead of FP32 for matmuls. This is a new mathmode in GPUs starting with Ampere+. This is a very nice, ~free optimization that sacrifices a little bit of precision for a large increase in performance, by running the matmuls on tensor cores, while chopping off the mantissa to only 10 bits (the least significant 19 bits of the float get lost). So the inputs, outputs and internal accumulates remain in fp32, but the multiplies are lower precision. Equivalent to PyTorch `torch.set_float32_matmul_precision('high')` - call cuBLASLt API instead of cuBLAS for the sGEMM (fp32 matrix multiply), as this allows you to also fuse the bias into the matmul and deletes the need for a separate add_bias kernel, which caused a silly round trip to global memory for one addition. - a more efficient attention kernel that uses 1) cooperative_groups reductions that look much cleaner and I only just learned about (they are not covered by the CUDA PMP book...), 2) the online softmax algorithm used in flash attention, 3) fused attention scaling factor multiply, 4) "built in" autoregressive mask bounds. (big thanks to ademeure, ngc92, lancerts on GitHub for writing / helping with these kernels!) Finally, ChatGPT created this amazing chart to illustrate our progress. 4 days ago we were 4.6X slower, today we are 2X slower. So we are going to beat PyTorch imminently 😂 Now (personally) going to focus on the backward pass, so we have the full training loop in CUDA.


