Dan Alistarh

181 posts

Dan Alistarh

Dan Alistarh

@DAlistarh

Professor at IST Austria

Vienna Katılım Mayıs 2022
300 Takip Edilen1.8K Takipçiler
Dan Alistarh
Dan Alistarh@DAlistarh·
🚀 Scalar structure means GSQ can use highly-optimized low-precision GEMM kernels: Using vLLM + the excellent Humming kernels (github.com/inclusionAI/hu…) on L40s GPUs, 2-bit GSQ-quantized Llama-3.1-70B hits up to 6.2× throughput vs BF16!
Dan Alistarh tweet media
English
2
0
7
607
Dan Alistarh
Dan Alistarh@DAlistarh·
Weight-only quantization powers local LLMs like llama.cpp or Ollama. But SOTA quantized accuracy requires complex kernels that are notoriously hard to implement. Can we get SOTA accuracy and keep things simple? Our new GSQ (Gumbel-Softmax Quantization) method says yes. 🧵
Dan Alistarh tweet media
English
1
12
51
6.1K
Dan Alistarh retweetledi
Eldar Kurtić
Eldar Kurtić@_EldarKurtic·
TurboQuant has drawn a lot of attention recently, but the accompanying evals didn't tell the full story. So we ran what I believe is the first comprehensive study of TurboQuant: where it helps, where it falls short, and how it impacts accuracy, latency, and throughput. Findings:
Eldar Kurtić tweet mediaEldar Kurtić tweet media
English
11
52
322
79.9K
Dan Alistarh retweetledi
Jiale Chen
Jiale Chen@JialeChenEdu·
🚀 Our #ICLR2026 paper: The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm We show GPTQ is exactly Babai's nearest plane algorithm, giving a geometric view of LLM quantization and inspiring improved PTQ methods. Efficient GPTQ Triton kernels included!
Jiale Chen tweet media
English
1
7
45
2.6K
Dan Alistarh
Dan Alistarh@DAlistarh·
Speedrunning GPT-2 is now routine thanks to @karpathy. But can we speedrun GPT3-175B? We attempted to match accuracy on a <$10K budget; while we didn't quite reach it, our first results show that quality data, engineering, and native FP4 can get close. Details in 🧵
Dan Alistarh tweet media
English
4
22
170
12.4K
Dan Alistarh
Dan Alistarh@DAlistarh·
Credit goes to Erik Schultheis, Matin Ansaripour @matin_asp, Andrei Panferov @black_samorez, and George Vlassis @gvlassis98. Thanks to @verdacloud (particularly Paul Chang) for compute support, and Jen Iofinova for safety testing. This work was supported by FWF BilAI and SwissAI.
Dan Alistarh tweet media
English
1
0
15
946
Dan Alistarh
Dan Alistarh@DAlistarh·
On the systems side, we provide custom CUDA kernels for Ampere GPUs and vLLM integration (see below for OSS code) to fully leverage nested models: - 3x-5.6x kernel speedups when memory-bound - 1.5x-3.5x end-to-end speedups in vLLM
English
1
1
12
691
Dan Alistarh
Dan Alistarh@DAlistarh·
We’re releasing MatGPTQ (Matryoshka GPTQ) an accurate and efficient post-training quantization (PTQ) method that jointly optimizes a single model across multiple bit-widths, producing a sliceable checkpoint that can be deployed across diverse hardware and memory budgets. [1/4]
Dan Alistarh tweet media
English
1
11
53
4.5K