Dan Alistarh

164 posts

Dan Alistarh

Dan Alistarh

@DAlistarh

Professor at IST Austria

Vienna Katılım Mayıs 2022
292 Takip Edilen1.7K Takipçiler
Dan Alistarh
Dan Alistarh@DAlistarh·
On the systems side, we provide custom CUDA kernels for Ampere GPUs and vLLM integration (see below for OSS code) to fully leverage nested models: - 3x-5.6x kernel speedups when memory-bound - 1.5x-3.5x end-to-end speedups in vLLM
English
1
1
12
591
Dan Alistarh
Dan Alistarh@DAlistarh·
We’re releasing MatGPTQ (Matryoshka GPTQ) an accurate and efficient post-training quantization (PTQ) method that jointly optimizes a single model across multiple bit-widths, producing a sliceable checkpoint that can be deployed across diverse hardware and memory budgets. [1/4]
Dan Alistarh tweet media
English
1
11
53
4.2K
Dan Alistarh
Dan Alistarh@DAlistarh·
DASH updates preconditioners _at every step_ while preserving throughput: ⚡ DASH runs 4.83× faster per step than Distributed Shampoo 📉 DASH with Power-Iteration scaling hits the best loss across configs, outperforming EVD 🧠 Slightly better memory utilization (72vs76 GB/GPU)
English
1
1
6
361
Dan Alistarh
Dan Alistarh@DAlistarh·
We're releasing DASH (Distributed Accelerated Shampoo), an improved implementation of the Shampoo optimizer that achieves up to 4.83× faster optimizer steps, while matching or improving final model quality. [1/6]
Dan Alistarh tweet media
English
1
13
65
2.9K
Dan Alistarh retweetledi
Kwanhee Lee
Kwanhee Lee@kwanhee_l·
Accepted to ICLR 2026 (@iclr_conf)! See you in Rio🇧🇷 I’d love to connect w/ efficient ML guys! TL;DR: LLMs can be ~99% sparse without catastrophic collapse → ~2.5× faster inference + ~4.6× compression (2.9GB vs 14GB dense 7B). OpenReview: openreview.net/forum?id=ek6dQ… [1/5]
Kwanhee Lee tweet media
English
3
9
36
3.9K
Dan Alistarh
Dan Alistarh@DAlistarh·
On the systems side, we have custom CUDA kernels for Blackwell GPUs that achieve up to 4.2× speedup over BF16, and 2.4× higher throughput in real 1B-parameter training. Key is a new post hoc range alignment trick avoids costly double tensor loads during re-quantization.
English
1
1
18
1.3K
Dan Alistarh
Dan Alistarh@DAlistarh·
Happy to release Quartet II, a new method that pushes the frontier of 4-bit LLM training in NVFP4. Fully-quantized pre-training in NVFP4 can now match FP8/FP16 quality much more closely, while maintaining full hardware acceleration! [1/4]
English
5
25
170
19.3K
Dan Alistarh retweetledi
Eldar Kurtić
Eldar Kurtić@_EldarKurtic·
Reviewer: hallucinates a baseline that doesn't exist. Meta-reviewer: cites the hallucination as the paper's fatal flaw. Decision: reject. @iclr_conf is amazing
Eldar Kurtić tweet mediaEldar Kurtić tweet media
English
23
38
906
108.5K
Dan Alistarh
Dan Alistarh@DAlistarh·
Our FP4 QAT research (Quartet, NeurIPS'25) suggests that the quality gap shrinks with larger models and longer runs, while speedups compound—making low-bit training a strong alternative on modern GPUs: arxiv.org/abs/2505.14669
English
1
0
3
920
Dan Alistarh
Dan Alistarh@DAlistarh·
Releasing QuTLASS v0.2: fast, end-to-end quantization-aware training (QAT) with kernel support and applications! 1. Nanochat-QAT: a fully-quantized extension of @karpathy 's nanochat 2. General QAT recipe with MXFP4 forward/MXFP8 backward GEMMs 3. Transformers/vLLM integrations
Dan Alistarh tweet media
English
1
37
158
13.2K