
Happy llama3.1 day to those who celebrate
ishan
839 posts

@0xishand
inference (dynamo + sglang) @nvidia | prev. @brevdev (acq.) @agora_io (acq.), @columbia | my views ≠ employer views

Happy llama3.1 day to those who celebrate







"Your job won't be taken by AI, it will be taken by Meek Mill using AI" - Jensen Huang

You can just do things in prime-rl - like teach GLM5 to answer math in <2000 tokens using 16 nodes to train and 12 nodes to do inference with 2P4D configuration with only uv run rl @ rl.toml ( @samsja19 told me I should tweet more things)

1/4 LLMs solve research grade math problems but struggle with basic calculations. We bridge this gap by turning them to computers. We built a computer INSIDE a transformer that can run programs for millions of steps in seconds solving even the hardest Sudokus with 100% accuracy


Useful for modding/reverse engineering Claude Code: CC is not open source, but the installed npm package contains a single minified JS file that Claude whose logic is readable to Claudes, who are very clever and know how this kinda stuff works.


Today I’m sharing a new research paper that explores a new idea in mixture of experts architecture called “DynaMoE”. DynaMoE is a Mixture-of-Experts framework where: - the number of active experts per token is dynamic. - the number of all experts can be scheduled differently across layers. From my findings the best model has a descending expert scheduler, where beginning layers have the most experts and the end layer have the least (1 expert). This removes the rigid Top-K routing used in most MoE models and improves parameter efficiency and training stability. Paper: arxiv.org/abs/2603.01697

Today I’m sharing a new research paper that explores a new idea in mixture of experts architecture called “DynaMoE”. DynaMoE is a Mixture-of-Experts framework where: - the number of active experts per token is dynamic. - the number of all experts can be scheduled differently across layers. From my findings the best model has a descending expert scheduler, where beginning layers have the most experts and the end layer have the least (1 expert). This removes the rigid Top-K routing used in most MoE models and improves parameter efficiency and training stability. Paper: arxiv.org/abs/2603.01697



I guarantee that any industry expert, with a little time and effort, can make a better (or at least more focused) skill than the default Anthropic ones. This is not an insult to Anthropic, it just is a reminder that specialist experts know more about their jobs than AI labs do.



🚀 Our new blog: 1.53X over GB200 - Deploying DeepSeek on GB300 NVL72, with 226 TPS/GPU on long-context inference! Together with @nvidia, we have achieved new milestones on GB300 NVL72 for 128K/8K long-context serving: ⚡ 226 TPS/GPU peak throughput (1.53X vs GB200) 🧠 1.87X TPS/User gain with MTP under matched throughput 💾 1.6X higher decode batch size via GB300's 288GB HBM3e ⏱ 8.6s TTFT for 128K prefill with dynamic chunked PP 🔧 1.35X faster FMHA kernel via 2x SFU softmax throughput on Blackwell Ultra Powered by: PD disaggregation + Wide-EP + chunked PP + MTP overlap scheduling + FP8 attention, and orchestrated with NVIDIA Dynamo @NVIDIAAIDev