Lila Rest (@LilaRest) - ملف تويتر | Zamantika Mersobahis Locabet

تغريدة مثبتة

Introducing 𝐆𝐞𝐦𝐦𝐚 𝟒 𝟑𝟏𝐁 𝐓𝐮𝐫𝐛𝐨 ⚡️ It runs on a 𝘴𝘪𝘯𝘨𝘭𝘦 RTX 5090, at 51 tok/s (single) and 1244 tok/s (batched). And prefills up to 15359 tok/s. It's 𝟔𝟖% 𝐬𝐦𝐚𝐥𝐥𝐞𝐫 in GPU memory and ~𝟐.𝟓𝐱 𝐟𝐚𝐬𝐭𝐞𝐫 than the base model, and retains nearly 𝐢𝐝𝐞𝐧𝐭𝐢𝐜𝐚𝐥 𝐪𝐮𝐚𝐥𝐢𝐭𝐲 on benchmarks (1-3% loss). Turbo is a derivative of the NVFP4 quant that NVIDIA released a few days ago. It fully leverages NVIDIA Blackwell FP4 tensor cores for ~𝟐× 𝐡𝐢𝐠𝐡𝐞𝐫 𝐜𝐨𝐧𝐜𝐮𝐫𝐫𝐞𝐧𝐭 𝐭𝐡𝐫𝐨𝐮𝐠𝐡𝐩𝐮𝐭 𝐭𝐡𝐚𝐧 𝐨𝐭𝐡𝐞𝐫 𝐪𝐮𝐚𝐧𝐭𝐬. I'm using it for hard classification tasks — on internal benchmarks it showed 𝐒𝐨𝐧𝐧𝐞𝐭-𝟒.𝟓-𝐥𝐞𝐯𝐞𝐥 𝐢𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞 (scored well above Haiku 4.5), at a 600𝘵𝘩 of the cost. A single RTX 5090 scales up to 18 req/s at 1000in/20out 🥵. Model card and benchmark in comments 👇 I'd love to hear your use cases.

English

10

7

69

11.3K

Lila Rest@LilaRest·2h

@jaymos @ClementDelangue @sundarpichai @abidlabs @huggingface thanks! the primary use case for turbo was maximizing text throughput (most use cases need mainly text). there is an open issue about it on the model’s page, give it an upvote! if enough people show interest i’ll ship a multimodal variant

English

0

10

Jay Martel@jaymos·2h

@LilaRest @ClementDelangue @sundarpichai @abidlabs @huggingface great work mate! do you think you'll be able to keep the speed ups with the addition of video and audio in the future?

English

1

0

11

Sundar Pichai@sundarpichai·2d

Lots of love for Gemma 4! Team just told me it’s already had 10M+ downloads since last week’s launch. Gemma models have now been downloaded 500M+ times! Excited to see what you all are creating 👀

English

213

293

5.8K

334.6K

Lila Rest@LilaRest·3h

@_maxime_db @huggingface @outsource_ Hi Maxime, no script, but the changes I applied are documented in the "Approach" section of the Hugging Face model card. It's quite straightforward, but shoot me a DM if you need help with it!

English

0

1

58

Maxime De Bruyn@_maxime_db·4h

@LilaRest @huggingface @outsource_ Thank you for this. Do you plan on releasing a "script"? How could I apply this to a custom fine-tuned 31B?

English

1

0

2

67

Lila Rest@LilaRest·1d

Introducing 𝐆𝐞𝐦𝐦𝐚 𝟒 𝟑𝟏𝐁 𝐓𝐮𝐫𝐛𝐨 ⚡️ It runs on a 𝘴𝘪𝘯𝘨𝘭𝘦 RTX 5090, at 51 tok/s (single) and 1244 tok/s (batched). And prefills up to 15359 tok/s. It's 𝟔𝟖% 𝐬𝐦𝐚𝐥𝐥𝐞𝐫 in GPU memory and ~𝟐.𝟓𝐱 𝐟𝐚𝐬𝐭𝐞𝐫 than the base model, and retains nearly 𝐢𝐝𝐞𝐧𝐭𝐢𝐜𝐚𝐥 𝐪𝐮𝐚𝐥𝐢𝐭𝐲 on benchmarks (1-3% loss). Turbo is a derivative of the NVFP4 quant that NVIDIA released a few days ago. It fully leverages NVIDIA Blackwell FP4 tensor cores for ~𝟐× 𝐡𝐢𝐠𝐡𝐞𝐫 𝐜𝐨𝐧𝐜𝐮𝐫𝐫𝐞𝐧𝐭 𝐭𝐡𝐫𝐨𝐮𝐠𝐡𝐩𝐮𝐭 𝐭𝐡𝐚𝐧 𝐨𝐭𝐡𝐞𝐫 𝐪𝐮𝐚𝐧𝐭𝐬. I'm using it for hard classification tasks — on internal benchmarks it showed 𝐒𝐨𝐧𝐧𝐞𝐭-𝟒.𝟓-𝐥𝐞𝐯𝐞𝐥 𝐢𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞 (scored well above Haiku 4.5), at a 600𝘵𝘩 of the cost. A single RTX 5090 scales up to 18 req/s at 1000in/20out 🥵. Model card and benchmark in comments 👇 I'd love to hear your use cases.

English

10

7

69

11.3K

Lila Rest@LilaRest·7h

⚡️ Gemma 4 31B Turbo is now the #3 trending quant on HuggingFace. Just 24h after release. Didn't expect this 🤯 Thanks to everyone who downloaded and shared. Really. --- What did you ship with 2.5× faster Gemma 4 31B? 👇

Lila Rest@LilaRest

Introducing 𝐆𝐞𝐦𝐦𝐚 𝟒 𝟑𝟏𝐁 𝐓𝐮𝐫𝐛𝐨 ⚡️ It runs on a 𝘴𝘪𝘯𝘨𝘭𝘦 RTX 5090, at 51 tok/s (single) and 1244 tok/s (batched). And prefills up to 15359 tok/s. It's 𝟔𝟖% 𝐬𝐦𝐚𝐥𝐥𝐞𝐫 in GPU memory and ~𝟐.𝟓𝐱 𝐟𝐚𝐬𝐭𝐞𝐫 than the base model, and retains nearly 𝐢𝐝𝐞𝐧𝐭𝐢𝐜𝐚𝐥 𝐪𝐮𝐚𝐥𝐢𝐭𝐲 on benchmarks (1-3% loss). Turbo is a derivative of the NVFP4 quant that NVIDIA released a few days ago. It fully leverages NVIDIA Blackwell FP4 tensor cores for ~𝟐× 𝐡𝐢𝐠𝐡𝐞𝐫 𝐜𝐨𝐧𝐜𝐮𝐫𝐫𝐞𝐧𝐭 𝐭𝐡𝐫𝐨𝐮𝐠𝐡𝐩𝐮𝐭 𝐭𝐡𝐚𝐧 𝐨𝐭𝐡𝐞𝐫 𝐪𝐮𝐚𝐧𝐭𝐬. I'm using it for hard classification tasks — on internal benchmarks it showed 𝐒𝐨𝐧𝐧𝐞𝐭-𝟒.𝟓-𝐥𝐞𝐯𝐞𝐥 𝐢𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞 (scored well above Haiku 4.5), at a 600𝘵𝘩 of the cost. A single RTX 5090 scales up to 18 req/s at 1000in/20out 🥵. Model card and benchmark in comments 👇 I'd love to hear your use cases.

English

2

0

4

862

Lila Rest@LilaRest·9h

I'd say both, different archs, different use cases, the base models are raw, most of the time that's purely innefficient to use it directly. I shipped Gemma 4 31B Turbo because I needed something that runs extremely fast on Blackwell architecture, especially RTX 5090. huggingface.co/LilaRest/gemma…

English

0

32

Jay Martel@jaymos·1d

@ClementDelangue @sundarpichai @abidlabs @huggingface What's driving the surge in variants do ya think? Are people fine-tuning on domain-specific data or mostly experimenting with different architectures 🤔

English

1

0

69

Lila Rest@LilaRest·9h

@michalmajzlik Let me know how it goes!

English

0

1

16

Michal Majzlík@michalmajzlik·9h

Fresh new model with great fit for 5090. Will be testing it over weekend. Let see how its different from base model.

Lila Rest@LilaRest

Introducing 𝐆𝐞𝐦𝐦𝐚 𝟒 𝟑𝟏𝐁 𝐓𝐮𝐫𝐛𝐨 ⚡️ It runs on a 𝘴𝘪𝘯𝘨𝘭𝘦 RTX 5090, at 51 tok/s (single) and 1244 tok/s (batched). And prefills up to 15359 tok/s. It's 𝟔𝟖% 𝐬𝐦𝐚𝐥𝐥𝐞𝐫 in GPU memory and ~𝟐.𝟓𝐱 𝐟𝐚𝐬𝐭𝐞𝐫 than the base model, and retains nearly 𝐢𝐝𝐞𝐧𝐭𝐢𝐜𝐚𝐥 𝐪𝐮𝐚𝐥𝐢𝐭𝐲 on benchmarks (1-3% loss). Turbo is a derivative of the NVFP4 quant that NVIDIA released a few days ago. It fully leverages NVIDIA Blackwell FP4 tensor cores for ~𝟐× 𝐡𝐢𝐠𝐡𝐞𝐫 𝐜𝐨𝐧𝐜𝐮𝐫𝐫𝐞𝐧𝐭 𝐭𝐡𝐫𝐨𝐮𝐠𝐡𝐩𝐮𝐭 𝐭𝐡𝐚𝐧 𝐨𝐭𝐡𝐞𝐫 𝐪𝐮𝐚𝐧𝐭𝐬. I'm using it for hard classification tasks — on internal benchmarks it showed 𝐒𝐨𝐧𝐧𝐞𝐭-𝟒.𝟓-𝐥𝐞𝐯𝐞𝐥 𝐢𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞 (scored well above Haiku 4.5), at a 600𝘵𝘩 of the cost. A single RTX 5090 scales up to 18 req/s at 1000in/20out 🥵. Model card and benchmark in comments 👇 I'd love to hear your use cases.

English

1

0

2

59

Lila Rest@LilaRest·10h

@realvoidboy1 you can rent an 5090 on Vast.ai or Clore.ai for ~$0.25/hr (~R$1,29/hr) 🙃

English

0

19

void@realvoidboy1·10h

O único problema pra gente do Brasil é que uma 5090 tá custando míseros 25 mil reais 🥲 Mas de qualquer forma isso é insano. Um modelo 31B rodando em UMA GPU, mais rápido e quase sem perder qualidade. A eficiência está avançando mais rápido que o hardware.

Lila Rest@LilaRest

Introducing 𝐆𝐞𝐦𝐦𝐚 𝟒 𝟑𝟏𝐁 𝐓𝐮𝐫𝐛𝐨 ⚡️ It runs on a 𝘴𝘪𝘯𝘨𝘭𝘦 RTX 5090, at 51 tok/s (single) and 1244 tok/s (batched). And prefills up to 15359 tok/s. It's 𝟔𝟖% 𝐬𝐦𝐚𝐥𝐥𝐞𝐫 in GPU memory and ~𝟐.𝟓𝐱 𝐟𝐚𝐬𝐭𝐞𝐫 than the base model, and retains nearly 𝐢𝐝𝐞𝐧𝐭𝐢𝐜𝐚𝐥 𝐪𝐮𝐚𝐥𝐢𝐭𝐲 on benchmarks (1-3% loss). Turbo is a derivative of the NVFP4 quant that NVIDIA released a few days ago. It fully leverages NVIDIA Blackwell FP4 tensor cores for ~𝟐× 𝐡𝐢𝐠𝐡𝐞𝐫 𝐜𝐨𝐧𝐜𝐮𝐫𝐫𝐞𝐧𝐭 𝐭𝐡𝐫𝐨𝐮𝐠𝐡𝐩𝐮𝐭 𝐭𝐡𝐚𝐧 𝐨𝐭𝐡𝐞𝐫 𝐪𝐮𝐚𝐧𝐭𝐬. I'm using it for hard classification tasks — on internal benchmarks it showed 𝐒𝐨𝐧𝐧𝐞𝐭-𝟒.𝟓-𝐥𝐞𝐯𝐞𝐥 𝐢𝐧𝐭𝐞𝐥𝐥𝐢𝐠𝐞𝐧𝐜𝐞 (scored well above Haiku 4.5), at a 600𝘵𝘩 of the cost. A single RTX 5090 scales up to 18 req/s at 1000in/20out 🥵. Model card and benchmark in comments 👇 I'd love to hear your use cases.

Português

1

0

52

Lila Rest@LilaRest·13h

NVIDIA's NVFP4 quant only quantized the MLP layers and left attention in BF16, so it's still 31GB. I quantized the attention layers too and stripped the unused vision/audio encoders. Gets it down to 18.5GB with 1-3% quality loss. Same kernel path, same FP4 tensor core support, just smaller and faster :)

English

1

0

1

14

Benjamin Marie@bnjmn_marie·20h

@LilaRest Interesting! What's the difference with a normal NVFP4? I'm not sure I get it

English

1

0

1

23

Benjamin Marie@bnjmn_marie·3d

Running now: - Evaluation of quantized Gemma 4 models - Evaluation of turboquant with GGUF models - My own quantization of Gemma 4 Renting: - 1 B200 (RunPod) - 1 RTX 5090 (RunPod) - 1 GH200 (Lambda through PrimeIntellect) Total cost: $174/day Hesitating to be unreasonable and add one more B200 to speed up everything 😅

English

7

1

85

6K

Lila Rest@LilaRest·14h

@HuggingModels @OlivierBachem and here is a turbo variant of it, 40% smaller, fits on a single RTX 5090 huggingface.co/LilaRest/gemma…

English

0

2

850

Hugging Models@HuggingModels·15h

Meet Gemma-4-31B-IT-NVFP4. This isn't just another large language model. It's a highly optimized, quantized version of the Gemma-4-31B instruction-tuned model, designed for efficient text generation. The community is buzzing because it delivers top-tier performance in a more accessible package.

English

3

5

54

4.9K

Lila Rest@LilaRest·15h

@OlivierBachem and fast as hell ⚡️ huggingface.co/LilaRest/gemma…

English

0

2

149

Olivier Bachem@OlivierBachem·18h

We want Gemma 4 to be genuinely useful: Strong performance with limited thinking # of tokens is critical for on-device models to be practical. No point in having to unnecessarily wait for model responses.

Benjamin Marie@bnjmn_marie

Another win for Gemma 4 31B: token efficiency, by far

English

5

9

87

5K

Lila Rest@LilaRest·23h

GGUF wouldn't really make sense here, the whole point of Turbo is the modelopt/CUTLASS kernel path that hits Blackwell FP4 tensor cores. GGUF can't use that, so you'd lose the speed advantage. For LM Studio you're better off with one of the existing GGUF quants of Gemma 4 on HF :)

English

0

282

Stranger@StrangerFTruth·1d

@LilaRest @huggingface @outsource_ Hi, Thank you for uploading the model to Hugging Face! I was wondering if you are planning to release a GGUF version (or any quantized format) that would be compatible with LM Studio for easy local running? It would be really convenient for many users. Thanks in advance!

English

1

0

1

304

Lila Rest@LilaRest·23h

@shawntenam @huggingface @outsource_ let me know what you ship with it!

English

0

1

382

Shawn Tenam@shawntenam·1d

@LilaRest @huggingface @outsource_ Wowoww thats insane!!! Anyone building your own AI saas is about to Eat good!! @LilaRest thanks for sharing legend!!!

English

1

0

2

431

Lila Rest@LilaRest·23h

Quantized the attention layers that NVIDIA left in BF16, stripped the vision/audio encoders, and kept everything on the modelopt/CUTLASS kernel path so it actually hits Blackwell's FP4 tensor cores. That's the key difference vs other quants at the same size. Full details in the model card.

English

0

2

272

Savio 👽@savio_sou·23h

@LilaRest @huggingface @outsource_ How? 🤔

1

0

2

302

Lila Rest@LilaRest·1d

@icanvardar 15min call?

English

0

44

Can Vardar@icanvardar·1d

i need a cofounder

English

134

4

193

15.2K

Lila Rest@LilaRest·1d

@DevinSoto @outsource_ 🫶

QME

0

2

14

Devin Soto@DevinSoto·1d

@outsource_ Incredible work tbh

English

2

0

3

1K

Eric ⚡️ Building...@outsource_·1d

🚀 NEW GEMMA 4 31B TURBO DROPPED Runs on a SINGLE RTX 5090: ⚡️18.5 GB VRAM only (68% smaller) 🧠51 tok/s single decode 💻1,244 tok/s batched 🤖15,359 tok/s prefill ← yes, fifteen thousand 🚨2.5× faster than base model with basically zero quality loss. It hits Sonnet-4.5 level on hard classification tasks… at 1/600th the cost. Local models are shipping faster than we can test 👇🏻 🔥 HF: huggingface.co/LilaRest/gemma…

English

82

176

2.3K

164.3K

Lila Rest@LilaRest·1d

@jailbreakersAI @outsource_ yes! check the bench

English

1

0

1

23

TommyGPT@jailbreakersAI·1d

@LilaRest @outsource_ 2.5x !? 😱 if actual, then jesus christ. Similar (or perhaps even bigger) gains on latency / first token? That would make the MoE tradeoff equation redundant... (first token on MoE is usually a step change faster which keeps it relevant for immediate use cases)

English

1

0

1

22

Lila Rest@LilaRest·1d

@keithofaptos 👀

QME

0

20