On the systems side, we provide custom CUDA kernels for Ampere GPUs and vLLM integration (see below for OSS code) to fully leverage nested models:
- 3x-5.6x kernel speedups when memory-bound
- 1.5x-3.5x end-to-end speedups in vLLM
We’re releasing MatGPTQ (Matryoshka GPTQ) an accurate and efficient post-training quantization (PTQ) method that jointly optimizes a single model across multiple bit-widths, producing a sliceable checkpoint that can be deployed across diverse hardware and memory budgets. [1/4]
Credit goes to Ionut Modoranu, Philip Zmushko, Erik Schultheis and Mher Safaryan (and Denis Kuznedelev for the front image suggestion).
📄 Paper: arxiv.org/abs/2602.02016
💻 Code: github.com/IST-DASLab/DASH
DASH updates preconditioners _at every step_ while preserving throughput:
⚡ DASH runs 4.83× faster per step than Distributed Shampoo
📉 DASH with Power-Iteration scaling hits the best loss across configs, outperforming EVD
🧠 Slightly better memory utilization (72vs76 GB/GPU)
We're releasing DASH (Distributed Accelerated Shampoo), an improved implementation of the Shampoo optimizer that achieves up to 4.83× faster optimizer steps, while matching or improving final model quality.
[1/6]
Accepted to ICLR 2026 (@iclr_conf)!
See you in Rio🇧🇷 I’d love to connect w/ efficient ML guys!
TL;DR: LLMs can be ~99% sparse without catastrophic collapse → ~2.5× faster inference + ~4.6× compression (2.9GB vs 14GB dense 7B).
OpenReview: openreview.net/forum?id=ek6dQ…
[1/5]
Quartet II is validated on models up to 1.9B params / 38B tokens using the Nanochat pipeline. We are looking forward to scaling it further!
📄 Paper: arxiv.org/abs/2601.22813
💻 Code: github.com/IST-DASLab/Qua…
Credit goes to @black_samorez , Erik Schultheis, and Rush Tabesh.
On the systems side, we have custom CUDA kernels for Blackwell GPUs that achieve up to 4.2× speedup over BF16, and 2.4× higher throughput in real 1B-parameter training. Key is a new post hoc range alignment trick avoids costly double tensor loads during re-quantization.
Happy to release Quartet II, a new method that pushes the frontier of 4-bit LLM training in NVFP4.
Fully-quantized pre-training in NVFP4 can now match FP8/FP16 quality much more closely, while maintaining full hardware acceleration!
[1/4]
Reviewer: hallucinates a baseline that doesn't exist.
Meta-reviewer: cites the hallucination as the paper's fatal flaw.
Decision: reject.
@iclr_conf is amazing
Our FP4 QAT research (Quartet, NeurIPS'25) suggests that the quality gap shrinks with larger models and longer runs, while speedups compound—making low-bit training a strong alternative on modern GPUs:
arxiv.org/abs/2505.14669
Releasing QuTLASS v0.2: fast, end-to-end quantization-aware training (QAT) with kernel support and applications!
1. Nanochat-QAT: a fully-quantized extension of @karpathy 's nanochat
2. General QAT recipe with MXFP4 forward/MXFP8 backward GEMMs
3. Transformers/vLLM integrations