Lila Rest

37 posts

Lila Rest banner
Lila Rest

Lila Rest

@LilaRest

npm install life 🌷

انضم Haziran 2021
191 يتبع2.5K المتابعون
تغريدة مثبتة
Lila Rest
Lila Rest@LilaRest·
Introducing 𝐆𝐞𝐦𝐦𝐚 𝟒 𝟑𝟏𝐁 𝐓𝐮𝐫𝐛𝐨 ⚡️ It runs on a 𝘴𝘪𝘯𝘨𝘭𝘦 RTX 5090, at 51 tok/s (single) and 1244 tok/s (batched). And prefills up to 15359 tok/s. It's 𝟔𝟖% 𝐬𝐦𝐚𝐥𝐥𝐞𝐫 in GPU memory and ~𝟐.𝟓𝐱 𝐟𝐚𝐬𝐭𝐞𝐫 than the base model, and retains nearly 𝐢𝐝𝐞𝐧𝐭𝐢𝐜𝐚𝐥 𝐪𝐮𝐚𝐥𝐢𝐭𝐲 on benchmarks (1-3% loss). Turbo is a derivative of the NVFP4 quant that NVIDIA released a few days ago. It fully leverages NVIDIA Blackwell FP4 tensor cores for ~𝟐× 𝐡𝐢𝐠𝐡𝐞𝐫 𝐜𝐨𝐧𝐜𝐮𝐫𝐫𝐞𝐧𝐭 𝐭𝐡𝐫𝐨𝐮𝐠𝐡𝐩𝐮𝐭 𝐭𝐡𝐚𝐧 𝐨𝐭𝐡𝐞𝐫 𝐪𝐮𝐚𝐧𝐭𝐬. I'm using it for hard classification tasks — on internal benchmarks it showed 𝐒𝐨𝐧𝐧𝐞𝐭-𝟒.𝟓-𝐥𝐞𝐯𝐞𝐥 𝐢𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞 (scored well above Haiku 4.5), at a 600𝘵𝘩 of the cost. A single RTX 5090 scales up to 18 req/s at 1000in/20out 🥵. Model card and benchmark in comments 👇 I'd love to hear your use cases.
Lila Rest tweet media
English
10
7
69
11.3K
Lila Rest
Lila Rest@LilaRest·
@jaymos @ClementDelangue @sundarpichai @abidlabs @huggingface thanks! the primary use case for turbo was maximizing text throughput (most use cases need mainly text). there is an open issue about it on the model’s page, give it an upvote! if enough people show interest i’ll ship a multimodal variant
English
0
0
0
10
Sundar Pichai
Sundar Pichai@sundarpichai·
Lots of love for Gemma 4! Team just told me it’s already had 10M+ downloads since last week’s launch. Gemma models have now been downloaded 500M+ times! Excited to see what you all are creating 👀
English
213
293
5.8K
334.6K
Lila Rest
Lila Rest@LilaRest·
@_maxime_db @huggingface @outsource_ Hi Maxime, no script, but the changes I applied are documented in the "Approach" section of the Hugging Face model card. It's quite straightforward, but shoot me a DM if you need help with it!
English
0
0
1
58
Lila Rest
Lila Rest@LilaRest·
Introducing 𝐆𝐞𝐦𝐦𝐚 𝟒 𝟑𝟏𝐁 𝐓𝐮𝐫𝐛𝐨 ⚡️ It runs on a 𝘴𝘪𝘯𝘨𝘭𝘦 RTX 5090, at 51 tok/s (single) and 1244 tok/s (batched). And prefills up to 15359 tok/s. It's 𝟔𝟖% 𝐬𝐦𝐚𝐥𝐥𝐞𝐫 in GPU memory and ~𝟐.𝟓𝐱 𝐟𝐚𝐬𝐭𝐞𝐫 than the base model, and retains nearly 𝐢𝐝𝐞𝐧𝐭𝐢𝐜𝐚𝐥 𝐪𝐮𝐚𝐥𝐢𝐭𝐲 on benchmarks (1-3% loss). Turbo is a derivative of the NVFP4 quant that NVIDIA released a few days ago. It fully leverages NVIDIA Blackwell FP4 tensor cores for ~𝟐× 𝐡𝐢𝐠𝐡𝐞𝐫 𝐜𝐨𝐧𝐜𝐮𝐫𝐫𝐞𝐧𝐭 𝐭𝐡𝐫𝐨𝐮𝐠𝐡𝐩𝐮𝐭 𝐭𝐡𝐚𝐧 𝐨𝐭𝐡𝐞𝐫 𝐪𝐮𝐚𝐧𝐭𝐬. I'm using it for hard classification tasks — on internal benchmarks it showed 𝐒𝐨𝐧𝐧𝐞𝐭-𝟒.𝟓-𝐥𝐞𝐯𝐞𝐥 𝐢𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞 (scored well above Haiku 4.5), at a 600𝘵𝘩 of the cost. A single RTX 5090 scales up to 18 req/s at 1000in/20out 🥵. Model card and benchmark in comments 👇 I'd love to hear your use cases.
Lila Rest tweet media
English
10
7
69
11.3K
Lila Rest
Lila Rest@LilaRest·
I'd say both, different archs, different use cases, the base models are raw, most of the time that's purely innefficient to use it directly. I shipped Gemma 4 31B Turbo because I needed something that runs extremely fast on Blackwell architecture, especially RTX 5090. huggingface.co/LilaRest/gemma…
English
0
0
0
32
Lila Rest
Lila Rest@LilaRest·
NVIDIA's NVFP4 quant only quantized the MLP layers and left attention in BF16, so it's still 31GB. I quantized the attention layers too and stripped the unused vision/audio encoders. Gets it down to 18.5GB with 1-3% quality loss. Same kernel path, same FP4 tensor core support, just smaller and faster :)
English
1
0
1
14
Benjamin Marie
Benjamin Marie@bnjmn_marie·
@LilaRest Interesting! What's the difference with a normal NVFP4? I'm not sure I get it
English
1
0
1
23
Benjamin Marie
Benjamin Marie@bnjmn_marie·
Running now: - Evaluation of quantized Gemma 4 models - Evaluation of turboquant with GGUF models - My own quantization of Gemma 4 Renting: - 1 B200 (RunPod) - 1 RTX 5090 (RunPod) - 1 GH200 (Lambda through PrimeIntellect) Total cost: $174/day Hesitating to be unreasonable and add one more B200 to speed up everything 😅
English
7
1
85
6K
Hugging Models
Hugging Models@HuggingModels·
Meet Gemma-4-31B-IT-NVFP4. This isn't just another large language model. It's a highly optimized, quantized version of the Gemma-4-31B instruction-tuned model, designed for efficient text generation. The community is buzzing because it delivers top-tier performance in a more accessible package.
Hugging Models tweet media
English
3
5
54
4.9K
Lila Rest
Lila Rest@LilaRest·
GGUF wouldn't really make sense here, the whole point of Turbo is the modelopt/CUTLASS kernel path that hits Blackwell FP4 tensor cores. GGUF can't use that, so you'd lose the speed advantage. For LM Studio you're better off with one of the existing GGUF quants of Gemma 4 on HF :)
English
0
0
0
282
Stranger
Stranger@StrangerFTruth·
@LilaRest @huggingface @outsource_ Hi, Thank you for uploading the model to Hugging Face! I was wondering if you are planning to release a GGUF version (or any quantized format) that would be compatible with LM Studio for easy local running? It would be really convenient for many users. Thanks in advance!
English
1
0
1
304
Lila Rest
Lila Rest@LilaRest·
Quantized the attention layers that NVIDIA left in BF16, stripped the vision/audio encoders, and kept everything on the modelopt/CUTLASS kernel path so it actually hits Blackwell's FP4 tensor cores. That's the key difference vs other quants at the same size. Full details in the model card.
English
0
0
2
272
Can Vardar
Can Vardar@icanvardar·
i need a cofounder
English
134
4
193
15.2K
Eric ⚡️ Building...
🚀 NEW GEMMA 4 31B TURBO DROPPED Runs on a SINGLE RTX 5090: ⚡️18.5 GB VRAM only (68% smaller) 🧠51 tok/s single decode 💻1,244 tok/s batched 🤖15,359 tok/s prefill ← yes, fifteen thousand 🚨2.5× faster than base model with basically zero quality loss. It hits Sonnet-4.5 level on hard classification tasks… at 1/600th the cost. Local models are shipping faster than we can test 👇🏻 🔥 HF: huggingface.co/LilaRest/gemma…
Eric ⚡️ Building... tweet media
English
82
176
2.3K
164.3K
TommyGPT
TommyGPT@jailbreakersAI·
@LilaRest @outsource_ 2.5x !? 😱 if actual, then jesus christ. Similar (or perhaps even bigger) gains on latency / first token? That would make the MoE tradeoff equation redundant... (first token on MoE is usually a step change faster which keeps it relevant for immediate use cases)
English
1
0
1
22
TommyGPT
TommyGPT@jailbreakersAI·
@outsource_ I already run Gemma 4 31B on a single 5090, it is fantastic - what is the obvious dumb thing I'm missing, please?
English
2
0
2
647
Lila Rest
Lila Rest@LilaRest·
@AVaclav73853 @outsource_ Can probably push that a bit further on RTX 5090, but if you need long context, yeah you'd need a PRO 6000 or higher.
English
1
0
1
35