mrciffa

146 posts

mrciffa banner
mrciffa

mrciffa

@davideciffa

Working on inference and computers. Engineer, researcher & founder.

Milano, Lombardia เข้าร่วม Mart 2021
185 กำลังติดตาม526 ผู้ติดตาม
mrciffa รีทวีตแล้ว
Sandro
Sandro@pupposandro·
TQ3_0 (TurboQuant) KV cache just landed in Lucebox Hub. 22% less VRAM than Q4_0, same decode speed. 262K context on a single RTX 3090 with 1024 MiB to spare. Qwen3.5-27B, Q4_K_M target, DFlash speculative decode. TurboQuant 3.5 bpv with FWHT rotation, CUDA kernels end-to-end, flash-attention plugged in for both K and V. Prefill pays ~12% for the rotation, decode pays nothing. Huge thanks to @dusterbloom for providing this to the community. Repo as usual in the first comment ⬇️
Sandro tweet media
English
14
10
111
4.8K
mrciffa รีทวีตแล้ว
Ivan Fioravanti ᯅ
Ivan Fioravanti ᯅ@ivanfioravanti·
My 3090 TI is becoming relevant now 😎
Sudo su@sudoingX

this guy just cracked 134 tok/s on qwen 3.5-27b dense and 73 on new qwen 3.6-27b on a single 3090. open source moves at godspeed in 2026. weights ship in the evening, dynamic ggufs land by midnight, fused kernel + speculative decoding stack runs the new model 12 hours after release. his dflash + ddtree stack loads qwen 3.6 asis because the architecture string matches 3.5. zero retraining of the draft model, zero waiting for upstream support. the same hand tuned consumer hardware kernel work that pushed 3.5 to 134 tok/s already eats 3.6 at 73, with a regression he is openly flagging because the draft model needs a dedicated pass for 3.6. this is the lane almost nobody is working on. major labs are stuck shipping framework abstractions optimized for h100 fleets. @pupposandro is hand tuning kernels for the silicon actual builders own. 3090 has 24 gigs of vram, mature cuda support, and almost zero kernel level optimization coming out of the big shops. it is the most underrated research platform in consumer ai right now. i am running honest baseline q4_k_m on llama.cpp now to set the dense floor without tricks. then sandro's stack runs on the same gpu, same model, same prompt. generic inference vs hand tuned kernels with speculative decoding. that delta is where the next 5 years of consumer ai live. receipts incoming.

English
1
4
42
4.3K
mrciffa รีทวีตแล้ว
Sandro
Sandro@pupposandro·
Qwen3.6-27B at 35 tok/s on a GB10 DGX. Almost 3× faster than vLLM+DFlash, 9× vs vLLM bf16. Luce DFlash is now available on Blackwell consumer GPUs. 5090 and GB10 owners, you've been asking. OpenAI-compatible tool calling works out of the box, so it drops straight into OpenCode, Hermes, Cline, whatever you run. Huge thanks to the incredible @superoo7 for shipping this to the community. Repo in the first comment.
Sandro tweet media
English
31
19
226
34.4K
mrciffa
mrciffa@davideciffa·
Huge thanks to @superoo7 for testing Luce DFlash on a GB10 DGX: Qwen3.6-27B at 35 tok/s, almost 3× vs vLLM+DFlash and 9× vs vLLM bf16. Plus OpenAI tool-calling, so you can drop it into OpenCode / Hermes / Cline etc
mrciffa tweet media
English
3
3
13
1.3K
mrciffa รีทวีตแล้ว
Loktar 🇺🇸
Loktar 🇺🇸@loktar00·
This is insane!
Sudo su@sudoingX

this guy just cracked 134 tok/s on qwen 3.5-27b dense and 73 on new qwen 3.6-27b on a single 3090. open source moves at godspeed in 2026. weights ship in the evening, dynamic ggufs land by midnight, fused kernel + speculative decoding stack runs the new model 12 hours after release. his dflash + ddtree stack loads qwen 3.6 asis because the architecture string matches 3.5. zero retraining of the draft model, zero waiting for upstream support. the same hand tuned consumer hardware kernel work that pushed 3.5 to 134 tok/s already eats 3.6 at 73, with a regression he is openly flagging because the draft model needs a dedicated pass for 3.6. this is the lane almost nobody is working on. major labs are stuck shipping framework abstractions optimized for h100 fleets. @pupposandro is hand tuning kernels for the silicon actual builders own. 3090 has 24 gigs of vram, mature cuda support, and almost zero kernel level optimization coming out of the big shops. it is the most underrated research platform in consumer ai right now. i am running honest baseline q4_k_m on llama.cpp now to set the dense floor without tricks. then sandro's stack runs on the same gpu, same model, same prompt. generic inference vs hand tuned kernels with speculative decoding. that delta is where the next 5 years of consumer ai live. receipts incoming.

English
1
1
29
2.5K
mrciffa
mrciffa@davideciffa·
Get x2 throughput on Qwen3.6 27B ggufs on your local machine for free with luce dflash ✈️
mrciffa tweet media
English
0
1
7
512
mrciffa
mrciffa@davideciffa·
@__tinygrad__ Why you said to me that I wasn't comparing apples to apples and then you go and do the same lol. Btw our kernel can go up to 433 tok/s without power limitation 🏎️
English
1
0
6
1.3K
mrciffa
mrciffa@davideciffa·
@__tinygrad__ I didn't meant to upset you ahah. Just saying that with cheaper hardware you can get better results
English
1
0
2
254
the tiny corp
the tiny corp@__tinygrad__·
@davideciffa On a 3090, not an M3 Max! That's like saying I love Lance Armstrong, but with my Ferrari I can go 211 mph 🏎️
English
1
0
121
4.1K
mrciffa รีทวีตแล้ว
Sandro
Sandro@pupposandro·
The new Qwen3.6-27B now runs on Luce DFlash. Up to 2x throughput on a single RTX 3090. Qwen3.6-27B ships the same Qwen35 architecture string and identical layer/head dims as 3.5, so the existing DFlash draft + DDTree stack loads it as-is. Throughput is lower than on 3.5. Looking forward for the updated version from the DFlash team to implement it as well! Repo in the first comment ⬇️
Sandro tweet media
English
34
41
472
118.4K
mrciffa
mrciffa@davideciffa·
@alexocheema @pupposandro Yes! Currently supporting only cuda machines, we are starting some implementation for amd rn
English
1
0
8
2.1K
mrciffa
mrciffa@davideciffa·
@_Suresh2 Prompt processing doesn’t get dramatic speed up compared to the decoding phase
English
0
0
1
1.1K
Suresh
Suresh@_Suresh2·
@davideciffa 415 tok/s is wild, how bad is prompt processing with the megakernel?
English
1
0
3
1.6K
mrciffa
mrciffa@davideciffa·
@Leik0w0 We have made it, check our lucebox hub repo
English
1
0
1
1.1K
Léo
Léo@Leik0w0·
@davideciffa Did someone make this one already ? I just answered to their post by asking about megakernels.
English
1
0
1
1.4K
mrciffa
mrciffa@davideciffa·
@Danmoreng Great 🚀 how much did you get?
English
1
0
0
1.3K
mrciffa รีทวีตแล้ว
Loktar 🇺🇸
Loktar 🇺🇸@loktar00·
Opus 4.7 at $25 per million output tokens for 87.6 swe-bench verified.... qwen3.6 in the same class on a single 24gb card you already own. 😎
English
6
6
120
7K
mrciffa รีทวีตแล้ว
Qwen
Qwen@Alibaba_Qwen·
🚀 Meet Qwen3.6-27B, our latest dense, open-source model, packing flagship-level coding power! Yes, 27B, and Qwen3.6-27B punches way above its weight. 👇 What's new: 🧠 Outstanding agentic coding — surpasses Qwen3.5-397B-A17B across all major coding benchmarks 💡 Strong reasoning across text & multimodal tasks 🔄 Supports thinking & non-thinking modes ✅ Apache 2.0 — fully open, fully yours Smaller model. Bigger results. Community's favorite. ❤️ We can't wait to see what you build with Qwen3.6-27B! 👀 🔗👇 Blog: qwen.ai/blog?id=qwen3.… Qwen Studio: chat.qwen.ai/?models=qwen3.… Github: github.com/QwenLM/Qwen3.6 Hugging Face: huggingface.co/Qwen/Qwen3.6-2… huggingface.co/Qwen/Qwen3.6-2… ModelScope: modelscope.cn/models/Qwen/Qw… modelscope.cn/models/Qwen/Qw…
Qwen tweet media
English
502
1.7K
12.3K
3.5M
mrciffa รีทวีตแล้ว
Geek Lite
Geek Lite@QingQ77·
消费级显卡其实有足够的硬件潜力,只是通用框架把大部分性能浪费在无效开销上。Lucebox 通过手写内核针对性释放这部分潜力,让 2020 年的 RTX 3090 也能跑出接近 Apple 最新芯片的能效比。 github.com/Luce-Org/luceb… Lucebox 是一个在消费级 GPU 上手动调优 LLM 推理的项目,目前公开了两个成果。Megakernel 针对 Qwen3.5-0.8B 这个混合 DeltaNet/Attention 模型,把原本分散在 ~100 次 CUDA 内核_launch 的计算全部合并成一次 dispatch,在 RTX 3090 上 prefill 能跑到 37,800 tok/s,decode 413 tok/s,能效比 1.87 tok/J,和 Apple M5 Max 持平。把功耗从 350W 降到 220W 后,速度只掉了 5%,能效却提高了将近一半。 DFlash 则把推测解码的 GGUF 路线第一次跑在了单卡上。Qwen3.5-27B 用 Q4_K_M 量化后加上 BF16 draft,在 RTX 3090 上 HumanEval 得分 129.5 tok/s,是纯自回归的 3.43 倍,128K 上下文也不需要超过 24 GB。 实现这个结果的难点在于显存约束:目标模型、draft 模型和 DDTree 验证树的中间状态必须同时塞进 24 GB,因此被迫在 ggml 基础上改写了 GGUF 加载器和三个 CUDA 树操作内核。项目代码全部 MIT 许可,文档详细,可以直接复现基准测试。
Geek Lite tweet media
中文
5
31
175
16.5K