CJ

17 posts

CJ banner
CJ

CJ

@maxweicj

Optimized for LLM

Shenzhen,China Katılım Ocak 2010
49 Takip Edilen32 Takipçiler
CJ
CJ@maxweicj·
@rumgewieselt @UnslothAI looks pretty usable decoding speed, what about prefill? MTP always slowed down my prefill to great extent
English
0
0
0
50
Daniel Moll
Daniel Moll@rumgewieselt·
3× GTX 1080 Ti (Pascal, 2017, 33VRAM) llama.cpp -> mtp-clean Build (PR #22673) 😍 @UnslothAI UD 2.0 MTP GGUFs are now available ... # Qwen 3.6 27B Dense MTP (row-split): 29.9 t/s @ 196K context -> with UD 2.0 = 30.9 t/s @ 196K context # Qwen 3.6 35B A3B MoE MTP (layer-split): 67 t/s @ 229K context -> with UD 2.0 = 71.3 t/s @ 229K context -> with UD 2.0 = 77.3 t/s @ 64k context config: --cache-type-k q4_0 --cache-type-v q4_0 --spec-type mtp --spec-draft-n-max 2 -np 1 --ctx-checkpoints 0 --reasoning off
HT
6
1
37
3.7K
CJ retweetledi
Sandro
Sandro@pupposandro·
2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, 128 GiB unified memory). Qwen3.6-27B Q4_K_M, end-to-end on the same silicon: ▸ Decode: 26.85 tok/s, 2.23x faster (DFlash + DDTree, budget 22) ▸ Prefill 16K: 20.2s, 3.05x faster (PFlash) ▸ Wall clock, 16K prompt + 1K gen: 58s vs 147s ~100 GiB still free in the box. 122B and 139B MoE class is next. Massive thanks to @smpurkis0 for the contribution 🙏
Sandro@pupposandro

x.com/i/article/2054…

English
29
42
350
37.4K
Sandro
Sandro@pupposandro·
Should I get a 256gb VRAM, 8 Nvidia V100 SXM2 Server for ~$1800? I'm so tempted. Pros: Amazing value for the price. Can fit big models, almost the same bandwidth of the 3090 (900 gb/s). Cons: V100s are very old (2017. Volta), no bf16 support.
Sandro tweet media
English
63
9
348
49.2K
CJ retweetledi
mrciffa
mrciffa@davideciffa·
You can now benchmark Lucebox Speculative Inference on CUDA/HIP mixed backends, thanks to @maxweicj ! Full AMD HIP server support coming soon 🏎️
mrciffa tweet media
English
3
2
36
3K
CJ
CJ@maxweicj·
@rumgewieselt 33G vram should be enough for 27B plain KV quant, much faster than TQ
English
0
0
0
12
Daniel Moll
Daniel Moll@rumgewieselt·
Now its getting crazy ... 3x 1080 Ti (Pascal, 33GB VRAM) Qwen 3.6 27B MTP with 196K TurboQuant ~28-30 t/s consistently
Daniel Moll tweet media
English
18
9
217
15.4K
Daniel Moll
Daniel Moll@rumgewieselt·
@maxweicj @davideciffa Have tested it but getting just 13 tok/s - on pascal the row split layer works best in llama.cpp and i reached now 26 tok/s with MTP.
English
1
0
0
36
CJ retweetledi
mrciffa
mrciffa@davideciffa·
Huge thanks to our contributor github.com/weicj for integrating dual GPU split support for Luce DFlash! Now you can run draft models on one GPU and target models on another one (--target-gpu --draft-gpu param.) This is the inception of our vision for heterogeneous hardware speculative inference 🏎️
mrciffa tweet media
English
11
8
125
14.9K
CJ
CJ@maxweicj·
@_Suresh2 @davideciffa Copying is not a bottle-neck for the layer-split pipeline path here. For 27B model 1:1 split, only up to 3.5GB F32 activation buffer at 256K full prompt will be transferred and for only once, which is far below PCIE bandwidth (espeically when P2P enabled)
English
0
0
1
20
Suresh
Suresh@_Suresh2·
@davideciffa @maxweicj layer split on small cards: the 3x is about the GPU time, not the wall-clock with all the copying
English
2
0
0
51
CJ retweetledi
mrciffa
mrciffa@davideciffa·
Now thanks to @maxweicj Luce DFlash works across multiple GPU with layer split! This enable usage on multiple small GPUs with a 3x speed up compared to autoregressive decoding. 🏎️ github.com/Luce-Org/luceb…
mrciffa tweet media
English
4
4
48
5.9K
Prince Canuma
Prince Canuma@Prince_Canuma·
mlx-vlm v0.5.0 is here 🚀 This is the largest release ever 🙌🏽 → Continuous batching server + KV cache quantization → MTP and DFlash speculative decoding (single, batch, server) → Distributed inference: Qwen3.5, Kimi K2.5 & K2.6 → Prompt caching w/ warm-disk persistence → Gemma 4 video (multi-video) + MTP drafter @googlegemma → New models: Youtu-VL, Nemotron 3 Nano Omni, SAM 3D Body → Server: json_schema response_format, thinking mode flag Huge thanks to all 21 contributors and in particular the 18 new contributors, welcome aboard 🚢 Get started today: > uv pip install -U mlx-vlm Leave us a star ⭐️ github.com/Blaizzy/mlx-vlm
Prince Canuma tweet media
English
38
55
472
42.8K
CJ
CJ@maxweicj·
@BitcoinComfy Haven’t tried it out yet, but yes I’ll do it later
English
0
0
0
29
CJ
CJ@maxweicj·
King of <30B is incoming? #Ragent6 benchmarks whether local agent models can actually finish work: read evidence, edit files, run checks, stay safe, recover from errors, and reason through constraints. Run and test: github.com/weicj/Ragent6 #llm #localllm #qwen #gemma
CJ tweet mediaCJ tweet mediaCJ tweet mediaCJ tweet media
English
1
0
0
70
CJ
CJ@maxweicj·
@rumgewieselt @davideciffa Hey man, DFlash GPU layer split harness right now is merged into main. Given the previous target/draft split harness, I guess a workaround for you is GPU0 for draft, GPU1/2 to split target 1:1 (like 27B Q4). Your try and feedback will be much appreciated ❤️
English
1
0
2
44
Daniel Moll
Daniel Moll@rumgewieselt·
@davideciffa I am on 3x 1080 Ti with Pascal ... do someone see any chance to see support for this. This will be a game changer für cheap GPUs ... For now getting 20/ts with llama.cpp and Qwen 27B ct64K ... but now i am in the game and fighting for the old gpus :D
English
2
0
4
393
CJ
CJ@maxweicj·
@David_M_Roth @davideciffa Hey buddy, GPU split right now is still restricted as a bench harness for testing across various hardware environments. Your try and feedback will be very helpful for us to fix bugs and strengthen the project before GPU split is officially integrated into server path❤️
English
0
0
1
14
David Roth 🇺🇸
David Roth 🇺🇸@David_M_Roth·
@davideciffa @maxweicj But server.py is broken for dual GPU (3090)😞 We need OpenAPI working out of the box—cleaner separation of concerns, broader integration, easier testing. 🚀 Looks promising though!
English
2
0
2
104
CJ
CJ@maxweicj·
@davideciffa Thank you for merge this PR, and hopefully this patch will open up a bigger imagination for P/DFlash🤗
English
0
0
2
72