Markus Nagel

42 posts

Markus Nagel

Markus Nagel

@mnagel87

Katılım Mayıs 2020
109 Takip Edilen298 Takipçiler
Markus Nagel
Markus Nagel@mnagel87·
I am excited to host and present at our #CVPR2025 tutorial on Power-efficient Neural Networks Using Low-precision Data Types and Quantization. This is together with @TiRune (Meta) and Thomas Pfeil (Recogni). 📅 Thursday Jun 12, 13:00 🏠 CVPR, room 205B ℹ️ power-efficient-nn.github.io
English
0
0
5
108
Markus Nagel retweetledi
Andrii Skliar 🇺🇦
Andrii Skliar 🇺🇦@avskliar·
Proud to present our work on optimizing Mixture of Experts models for on-device generation speed: arxiv.org/pdf/2412.00099 We introduce a cache-aware routing that boosts memory efficiency of commonly used MoEs, improving generation throughput by 2×—all without retraining. Perfect for real-world, memory-constrained devices.  This is a joint work with wonderful team here at Qualcomm: @tivaro @r_lepert @BabakEht Todor Boinovski @mnagel87 @martvanbaalen and Paul Whatmough
English
1
4
7
351
Markus Nagel
Markus Nagel@mnagel87·
Interested in boosting quantized LLM performance with QAT? Check out our latest work on Low-Rank Quantization-Aware Training (LR-QAT) which can train 7B LLMs on a single consumer-grade GPU with just 24GB of memory. New work with @yell1337 and @delchia arxiv.org/abs/2406.06385
Markus Nagel tweet media
English
0
8
31
2K
Markus Nagel retweetledi
Tanishq Mathew Abraham, Ph.D.
Tanishq Mathew Abraham, Ph.D.@iScienceLuvr·
Sparse High Rank Adapters abs: arxiv.org/abs/2406.13175 "In this paper, we propose Sparse High Rank Adapters (SHiRA), a new paradigm which incurs no inference overhead, enables rapid switching, and significantly reduces concept-loss. Specifically, SHiRA can be trained by directly tuning only 1-2% of the base model weights while leaving others unchanged."
Tanishq Mathew Abraham, Ph.D. tweet media
English
4
26
142
12.5K
Markus Nagel retweetledi
AK
AK@_akhaliq·
Qualcomm presents GPTVQ The Blessing of Dimensionality for LLM Quantization show that the size versus accuracy trade-off of neural network quantization can be significantly improved by increasing the quantization dimensionality. We propose the GPTVQ method, a new fast method for post-training vector quantization (VQ) that scales well to Large Language Models (LLMs). Our method interleaves quantization of one or more columns with updates to the remaining unquantized weights, using information from the Hessian of the per-layer output reconstruction MSE. Quantization codebooks are initialized using an efficient data-aware version of the EM algorithm. The codebooks are then updated, and further compressed by using integer quantization and SVD-based compression. GPTVQ establishes a new state-of-the art in the size vs accuracy trade-offs on a wide range of LLMs such as Llama-v2 and Mistral. Furthermore, our method is efficient: on a single H100 it takes between 3 and 11 hours to process a Llamav2-70B model, depending on quantization setting. Lastly, with on-device timings for VQ decompression on a mobile CPU we show that VQ leads to improved latency compared to using a 4-bit integer format.
AK tweet media
English
5
62
376
55.5K
Markus Nagel retweetledi
The TWIML AI Podcast
The TWIML AI Podcast@twimlai·
The approach of removing outliers, as emphasized in Markus’ Quantizable Transformers paper, is not a quantization method but rather an approach to address and eliminate the root cause of activation quantization issues. Catch @mnagel87’s episode at buff.ly/3TL8Vrs.
English
0
2
9
590
Markus Nagel
Markus Nagel@mnagel87·
If you are around and want to connect, please send me a PM. You can also find me at our poster sessions or the Qualcomm booth. Poster sessions: - Pruning vs Quantization: Tuesday 17:15-19:15 - Quantizable Transformers: Thursday 17:00-19:00
English
0
0
0
151
Markus Nagel
Markus Nagel@mnagel87·
This week I’m at #NeurIPS2023 to present our recent model efficiency research: 1) Pruning vs Quantization: Which is Better?, w/ A Kuzmin, @martvanbaalen, A Behboodi, @TiRune 2) Quantizable Transformers: Removing Outliers by Helping Attention Heads do Nothing, w/ @yell1337,@TiRune
English
1
1
7
517
Markus Nagel
Markus Nagel@mnagel87·
ResQ: Residual Quantization for Video Perception Davide Abati, Haitam Ben Yahia, Markus Nagel, Amirhossein Habibian arxiv.org/abs/2308.09511 Friday 6th @ 10:30 AM-12:30 PM (room nord, poster 102)
English
0
0
2
218
Markus Nagel
Markus Nagel@mnagel87·
This week I'm in Paris at #ICCV2023 to present some of our recent work on model efficiency and quantization. Please join me for our talks and posters or at the Qualcomm booth. (1/4, schedule follows)
English
2
1
10
1.2K
Markus Nagel
Markus Nagel@mnagel87·
@Tracing47202686 @yell1337 @TiRune Unlike with clipped softmax, to achieve an exact zero in the output using softmax1 for a (partial) no-update, the input requires to be -infinity. However, after @EvMill blog post we experimented with softmax1 and found it in practice competitive with our proposed approaches.
English
0
1
12
4K
Markus Nagel
Markus Nagel@mnagel87·
TL;DR: Transformers learn strong activation outliers making them difficult to quantize. We study their root cause and relate outliers to a no-op and partial update behavior. Our proposed clipped softmax and gated attention avoid outliers and make transformer easily quantizable.
English
0
1
6
342
Markus Nagel
Markus Nagel@mnagel87·
TL;DR: We compare pruning and quantization analytically and empirically for various levels, on distributions, per-layer and for full neural networks with fine-tuning. Our results show that in most cases quantization outperforms pruning.
English
0
1
12
395