
As AI and LLMs become more and more powerful, the boundaries for efficient inference simply break down.
If you want to experiment with quantized models as well, you can follow the code in the video, and check out
huggingface.co/RedHatAI
to pick any model you'd like!
Red Hat AI@RedHat_AI
What compression looks like on @vllm_project. Same Gemma 4 31B. Red Hat AI's quantized version runs at nearly 2x tokens/sec, half the memory, 99%+ accuracy retained. Open source. Quantized with LLM Compressor. Links in comments. 🙏 @_soyr_ for the 2-minute demo.
English

