Muhammad Kone
136 posts

Muhammad Kone
@ProgrammingProg
NYC software engineer



What? Pre-training? No, no, no, no. No pre-training. Why would you do pre-training?! If you do pre-training, people will ask "HOW MANY TOKENS?" And it will never be enough. The model that was the frontier breakthrough becomes "just distillation from chatgpt" But if you just do SFT, some RL on benchmarks? You're efficient. You're doing *reasoning*. It's not about capabilities, it's about the benchmark score. And who tops the leaderboard? Labs doing RLFT on a Chinese frontier base.



I trained a 12M parameter LLM on my own ML framework using a Rust backend and CUDA kernels for flash attention, AdamW, and more. Wrote the full transformer architecture, and BPE tokenizer from scratch. The framework features: - Custom CUDA kernels (Flash Attention, fused LayerNorm, fused GELU) for 3x increased throughput - Automatic WebGPU fallback for non-NVIDIA devices - TypeScript API with Rust compute backend - One npm install to get started, prebuilt binaries for every platform Try out the model for yourself: mni-ml.github.io/demos/transfor… Built with @_reesechong. Check out the repos and blog if you want to learn more. Shoutout to @modal for the compute credits allowing me to train on 2 A100 GPUs without going broke cc @sundeep @GavinSherry





protip for stanford undergrads: beware the classes with guest speaker lineups that read like AI coachella. you’re basically paying $5k to listen to a live podcast series.






