Dan Alistarh

181 posts

Dan Alistarh

@DAlistarh

Professor at IST Austria

Vienna Katılım Mayıs 2022

300 Takip Edilen1.8K Takipçiler

Dan Alistarh@DAlistarh·19 May

As always, our release is fully open: 💻 Paper and code: arxiv.org/pdf/2604.18556 @huggingface : huggingface.co/collections/IS… Credits: Alireza Dadgarnia, Soroush Tabesh, Mahdi Nikdan, Michael Helcig, Eldar Kurtić, and Max Kleinegger We thank @RedHat_AI and @verdacloud for the support!

English

568

Dan Alistarh@DAlistarh·19 May

🚀 Scalar structure means GSQ can use highly-optimized low-precision GEMM kernels: Using vLLM + the excellent Humming kernels (github.com/inclusionAI/hu…) on L40s GPUs, 2-bit GSQ-quantized Llama-3.1-70B hits up to 6.2× throughput vs BF16!

English

610

Dan Alistarh@DAlistarh·19 May

Weight-only quantization powers local LLMs like llama.cpp or Ollama. But SOTA quantized accuracy requires complex kernels that are notoriously hard to implement. Can we get SOTA accuracy and keep things simple? Our new GSQ (Gumbel-Softmax Quantization) method says yes. 🧵

English

6.1K

Dan Alistarh retweetledi

Eldar Kurtić@_EldarKurtic·11 May

TurboQuant has drawn a lot of attention recently, but the accompanying evals didn't tell the full story. So we ran what I believe is the first comprehensive study of TurboQuant: where it helps, where it falls short, and how it impacts accuracy, latency, and throughput. Findings:

English

322

79.9K

Dan Alistarh retweetledi

Jiale Chen@JialeChenEdu·23 Nis

🚀 Our #ICLR2026 paper: The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm We show GPTQ is exactly Babai's nearest plane algorithm, giving a geometric view of LLM quantization and inspiring improved PTQ methods. Efficient GPTQ Triton kernels included!

English

2.6K

Dan Alistarh@DAlistarh·25 Mar

@norpadon @karpathy Yes, dense (non-MoE).

Nederlands

362

Artur Chakhvadze@norpadon·25 Mar

@DAlistarh @karpathy Dense?

Dansk

473

Dan Alistarh@DAlistarh·25 Mar

Speedrunning GPT-2 is now routine thanks to @karpathy. But can we speedrun GPT3-175B? We attempted to match accuracy on a <$10K budget; while we didn't quite reach it, our first results show that quality data, engineering, and native FP4 can get close. Details in 🧵

English

170

12.4K

Dan Alistarh@DAlistarh·25 Mar

As usual, our release is fully open: - Full report: github.com/IST-DASLab/Clo… - QuartetII paper: arxiv.org/abs/2601.22813 - Tokenized data: huggingface.co/datasets/dasla… - Model: huggingface.co/daslab-testing… - Codebase: github.com/IST-DASLab/Clo…

English

955

Dan Alistarh@DAlistarh·25 Mar

Credit goes to Erik Schultheis, Matin Ansaripour @matin_asp, Andrei Panferov @black_samorez, and George Vlassis @gvlassis98. Thanks to @verdacloud (particularly Paul Chang) for compute support, and Jen Iofinova for safety testing. This work was supported by FWF BilAI and SwissAI.

English

946

Dan Alistarh@DAlistarh·18 Şub

Credits go to @maxkleinegger and Elvir Crnčević! - Paper: arxiv.org/abs/2602.03537 - Models: huggingface.co/collections/IS… - Code: github.com/IST-DASLab/Mat…

488

Dan Alistarh@DAlistarh·18 Şub

On the systems side, we provide custom CUDA kernels for Ampere GPUs and vLLM integration (see below for OSS code) to fully leverage nested models: - 3x-5.6x kernel speedups when memory-bound - 1.5x-3.5x end-to-end speedups in vLLM

English

691

Dan Alistarh@DAlistarh·18 Şub

We’re releasing MatGPTQ (Matryoshka GPTQ) an accurate and efficient post-training quantization (PTQ) method that jointly optimizes a single model across multiple bit-widths, producing a sliceable checkpoint that can be deployed across diverse hardware and memory budgets. [1/4]

English

4.5K

Keşfet

@huggingface @RedHat_AI @verdacloud @norpadon @karpathy @matin_asp @black_samorez @gvlassis98