
I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more. Wrote the full transformer architecture, and BPE tokenizer from scratch. The framework features: - Custom CUDA kernels (Flash Attention, fused LayerNorm, fused GELU) for 3x increased throughput - Automatic WebGPU fallback for non-NVIDIA devices - TypeScript API with Rust compute backend - One npm install to get started, prebuilt binaries for every platform Try out the model for yourself: mni-ml.github.io/demos/transfor… Built with @_reesechong. Check out the repos and blog if you want to learn more. Shoutout to @modal for the compute credits allowing me to train on 2 A100 GPUs without going broke cc @sundeep @GavinSherry

















