mrciffa (@davideciffa) - โปรไฟล์ Twitter

ทวีตที่ปักหมุด

mrciffa@davideciffa·8 Nis

Working hard to democratize local AI

Sandro@pupposandro

x.com/i/article/2041…

English

0

19

3.1K

mrciffa รีทวีตแล้ว

Sandro@pupposandro·11h

TQ3_0 (TurboQuant) KV cache just landed in Lucebox Hub. 22% less VRAM than Q4_0, same decode speed. 262K context on a single RTX 3090 with 1024 MiB to spare. Qwen3.5-27B, Q4_K_M target, DFlash speculative decode. TurboQuant 3.5 bpv with FWHT rotation, CUDA kernels end-to-end, flash-attention plugged in for both K and V. Prefill pays ~12% for the rotation, decode pays nothing. Huge thanks to @dusterbloom for providing this to the community. Repo as usual in the first comment ⬇️

English

14

10

111

4.8K

mrciffa รีทวีตแล้ว

Ivan Fioravanti ᯅ@ivanfioravanti·1d

My 3090 TI is becoming relevant now 😎

Sudo su@sudoingX

this guy just cracked 134 tok/s on qwen 3.5-27b dense and 73 on new qwen 3.6-27b on a single 3090. open source moves at godspeed in 2026. weights ship in the evening, dynamic ggufs land by midnight, fused kernel + speculative decoding stack runs the new model 12 hours after release. his dflash + ddtree stack loads qwen 3.6 asis because the architecture string matches 3.5. zero retraining of the draft model, zero waiting for upstream support. the same hand tuned consumer hardware kernel work that pushed 3.5 to 134 tok/s already eats 3.6 at 73, with a regression he is openly flagging because the draft model needs a dedicated pass for 3.6. this is the lane almost nobody is working on. major labs are stuck shipping framework abstractions optimized for h100 fleets. @pupposandro is hand tuning kernels for the silicon actual builders own. 3090 has 24 gigs of vram, mature cuda support, and almost zero kernel level optimization coming out of the big shops. it is the most underrated research platform in consumer ai right now. i am running honest baseline q4_k_m on llama.cpp now to set the dense floor without tricks. then sandro's stack runs on the same gpu, same model, same prompt. generic inference vs hand tuned kernels with speculative decoding. that delta is where the next 5 years of consumer ai live. receipts incoming.

English

1

4

42

4.3K

mrciffa รีทวีตแล้ว

Sandro@pupposandro·1d

Qwen3.6-27B at 35 tok/s on a GB10 DGX. Almost 3× faster than vLLM+DFlash, 9× vs vLLM bf16. Luce DFlash is now available on Blackwell consumer GPUs. 5090 and GB10 owners, you've been asking. OpenAI-compatible tool calling works out of the box, so it drops straight into OpenCode, Hermes, Cline, whatever you run. Huge thanks to the incredible @superoo7 for shipping this to the community. Repo in the first comment.

English

31

19

226

34.4K

mrciffa@davideciffa·1d

@EthosVentures @superoo7 yes bf16

English

0

68

Ethan Kravitz@EthosVentures·1d

@davideciffa @superoo7 This is unquantized yeah?

English

1

0

1

91

mrciffa@davideciffa·1d

Huge thanks to @superoo7 for testing Luce DFlash on a GB10 DGX: Qwen3.6-27B at 35 tok/s, almost 3× vs vLLM+DFlash and 9× vs vLLM bf16. Plus OpenAI tool-calling, so you can drop it into OpenCode / Hermes / Cline etc

English

3

13

1.3K

mrciffa รีทวีตแล้ว

Loktar 🇺🇸@loktar00·1d

This is insane!

Sudo su@sudoingX

this guy just cracked 134 tok/s on qwen 3.5-27b dense and 73 on new qwen 3.6-27b on a single 3090. open source moves at godspeed in 2026. weights ship in the evening, dynamic ggufs land by midnight, fused kernel + speculative decoding stack runs the new model 12 hours after release. his dflash + ddtree stack loads qwen 3.6 asis because the architecture string matches 3.5. zero retraining of the draft model, zero waiting for upstream support. the same hand tuned consumer hardware kernel work that pushed 3.5 to 134 tok/s already eats 3.6 at 73, with a regression he is openly flagging because the draft model needs a dedicated pass for 3.6. this is the lane almost nobody is working on. major labs are stuck shipping framework abstractions optimized for h100 fleets. @pupposandro is hand tuning kernels for the silicon actual builders own. 3090 has 24 gigs of vram, mature cuda support, and almost zero kernel level optimization coming out of the big shops. it is the most underrated research platform in consumer ai right now. i am running honest baseline q4_k_m on llama.cpp now to set the dense floor without tricks. then sandro's stack runs on the same gpu, same model, same prompt. generic inference vs hand tuned kernels with speculative decoding. that delta is where the next 5 years of consumer ai live. receipts incoming.

English

1

29

2.5K

mrciffa@davideciffa·1d

Get x2 throughput on Qwen3.6 27B ggufs on your local machine for free with luce dflash ✈️

English

0

1

7

512

mrciffa@davideciffa·1d

@__tinygrad__ Why you said to me that I wasn't comparing apples to apples and then you go and do the same lol. Btw our kernel can go up to 433 tok/s without power limitation 🏎️

English

1

0

6

1.3K

the tiny corp@__tinygrad__·1d

This megakernel is using a 3090. Stock tinygrad beats this (420 tok/sec!) using a cheaper 7900XTX. With our custom driver AMD hardware can really shine.

mrciffa@davideciffa

I love tinygrad, but with our megakernel you can go to 415 tok/s in decoding speed 🚄

English

16

27

684

47.6K

mrciffa@davideciffa·1d

@__tinygrad__ I didn't meant to upset you ahah. Just saying that with cheaper hardware you can get better results

English

1

0

2

254

the tiny corp@__tinygrad__·2d

@davideciffa On a 3090, not an M3 Max! That's like saying I love Lance Armstrong, but with my Ferrari I can go 211 mph 🏎️

English

1

0

121

4.1K

mrciffa@davideciffa·2d

I love tinygrad, but with our megakernel you can go to 415 tok/s in decoding speed 🚄

the tiny corp@__tinygrad__

We set out to replicate Kimi's 193 tok/s Qwen3.5-0.8B on M3 Max. Our baseline is already 178 tok/s, beating LMStudio (160) and llama.cpp (140) out of the box, but with tinygrad's custom kernel feature Claude cranked it to 195.7!

English

14

12

274

90.4K

mrciffa@davideciffa·2d

@schuttdev @alexocheema @pupposandro This is amazing! Are you planning to transform it in a megakernel to gain even more speed?

English

0

1

600

Kaden@schuttdev·2d

@davideciffa @alexocheema @pupposandro I get ~391 tok/s .8b on my 7900xtx w/ custom kernels Kaden-Schutt/hipfire on gh

English

2

0

6

616

mrciffa รีทวีตแล้ว

Sandro@pupposandro·2d

The new Qwen3.6-27B now runs on Luce DFlash. Up to 2x throughput on a single RTX 3090. Qwen3.6-27B ships the same Qwen35 architecture string and identical layer/head dims as 3.5, so the existing DFlash draft + DDTree stack loads it as-is. Throughput is lower than on 3.5. Looking forward for the updated version from the DFlash team to implement it as well! Repo in the first comment ⬇️

English

34

41

472

118.4K

mrciffa@davideciffa·2d

@alexocheema @pupposandro Yes! Currently supporting only cuda machines, we are starting some implementation for amd rn

English

1

0

8

2.1K

Alex Cheema@alexocheema·2d

@davideciffa @pupposandro The tinygrad example is on M3 Max. Is the 415 tok/sec on RTX 3090?

English

1

0

14

3.9K

mrciffa@davideciffa·2d

@_Suresh2 Prompt processing doesn’t get dramatic speed up compared to the decoding phase

English

0

1

1.1K

Suresh@_Suresh2·2d

@davideciffa 415 tok/s is wild, how bad is prompt processing with the megakernel?

English

1

0

3

1.6K

mrciffa@davideciffa·2d

@Leik0w0 We have made it, check our lucebox hub repo

English

1

0

1

1.1K

Léo@Leik0w0·2d

@davideciffa Did someone make this one already ? I just answered to their post by asking about megakernels.

English

1

0

1

1.4K

mrciffa@davideciffa·2d

@Danmoreng Great 🚀 how much did you get?

English

1

0

1.3K

Sebastian@Danmoreng·2d

@davideciffa Interesting, tried it out on my 5080 mobile - better performance than llama.cpp, but not by that much: github.com/Danmoreng/qwen…

English

1

0

1

1.9K

mrciffa รีทวีตแล้ว

Loktar 🇺🇸@loktar00·2d

Opus 4.7 at $25 per million output tokens for 87.6 swe-bench verified.... qwen3.6 in the same class on a single 24gb card you already own. 😎

English

6

120

7K

mrciffa@davideciffa·2d

Almost Opus 4.5 level on your local machine! It fits in 24GB of VRAM, local AI is on fire

stevibe@stevibe

Completed a first hour side-by-side comparison between Qwen3.5 27b and Qwen3.6 27b on the same 4 canvas coding tests. Running the Qwen3.6 27b FP8, vLLM. What do you think?

English

0

8

918

mrciffa รีทวีตแล้ว

Qwen@Alibaba_Qwen·2d

🚀 Meet Qwen3.6-27B, our latest dense, open-source model, packing flagship-level coding power! Yes, 27B, and Qwen3.6-27B punches way above its weight. 👇 What's new: 🧠 Outstanding agentic coding — surpasses Qwen3.5-397B-A17B across all major coding benchmarks 💡 Strong reasoning across text & multimodal tasks 🔄 Supports thinking & non-thinking modes ✅ Apache 2.0 — fully open, fully yours Smaller model. Bigger results. Community's favorite. ❤️ We can't wait to see what you build with Qwen3.6-27B! 👀 🔗👇 Blog: qwen.ai/blog?id=qwen3.… Qwen Studio: chat.qwen.ai/?models=qwen3.… Github: github.com/QwenLM/Qwen3.6 Hugging Face: huggingface.co/Qwen/Qwen3.6-2… huggingface.co/Qwen/Qwen3.6-2… ModelScope: modelscope.cn/models/Qwen/Qw… modelscope.cn/models/Qwen/Qw…

English

502

1.7K

12.3K

3.5M

mrciffa รีทวีตแล้ว

Geek Lite@QingQ77·2d

消费级显卡其实有足够的硬件潜力，只是通用框架把大部分性能浪费在无效开销上。Lucebox 通过手写内核针对性释放这部分潜力，让 2020 年的 RTX 3090 也能跑出接近 Apple 最新芯片的能效比。 github.com/Luce-Org/luceb… Lucebox 是一个在消费级 GPU 上手动调优 LLM 推理的项目，目前公开了两个成果。Megakernel 针对 Qwen3.5-0.8B 这个混合 DeltaNet/Attention 模型，把原本分散在 ~100 次 CUDA 内核_launch 的计算全部合并成一次 dispatch，在 RTX 3090 上 prefill 能跑到 37,800 tok/s，decode 413 tok/s，能效比 1.87 tok/J，和 Apple M5 Max 持平。把功耗从 350W 降到 220W 后，速度只掉了 5%，能效却提高了将近一半。 DFlash 则把推测解码的 GGUF 路线第一次跑在了单卡上。Qwen3.5-27B 用 Q4_K_M 量化后加上 BF16 draft，在 RTX 3090 上 HumanEval 得分 129.5 tok/s，是纯自回归的 3.43 倍，128K 上下文也不需要超过 24 GB。实现这个结果的难点在于显存约束：目标模型、draft 模型和 DDTree 验证树的中间状态必须同时塞进 24 GB，因此被迫在 ggml 基础上改写了 GGUF 加载器和三个 CUDA 树操作内核。项目代码全部 MIT 许可，文档详细，可以直接复现基准测试。

中文

5

31

175

16.5K

mrciffa

ค้นพบ