17 posts

CJ

@maxweicj

Optimized for LLM

Shenzhen,China Katılım Ocak 2010

49 Takip Edilen32 Takipçiler

CJ@maxweicj·3d

@rumgewieselt @UnslothAI looks pretty usable decoding speed, what about prefill? MTP always slowed down my prefill to great extent

English

Daniel Moll@rumgewieselt·4d

3× GTX 1080 Ti (Pascal, 2017, 33VRAM) llama.cpp -> mtp-clean Build (PR #22673) 😍 @UnslothAI UD 2.0 MTP GGUFs are now available ... # Qwen 3.6 27B Dense MTP (row-split): 29.9 t/s @ 196K context -> with UD 2.0 = 30.9 t/s @ 196K context # Qwen 3.6 35B A3B MoE MTP (layer-split): 67 t/s @ 229K context -> with UD 2.0 = 71.3 t/s @ 229K context -> with UD 2.0 = 77.3 t/s @ 64k context config: --cache-type-k q4_0 --cache-type-v q4_0 --spec-type mtp --spec-draft-n-max 2 -np 1 --ctx-checkpoints 0 --reasoning off

3.7K

CJ retweetledi

Sandro@pupposandro·4d

2.5x faster than llama.cpp on Strix Halo. We just shipped DFlash + PFlash for the AMD Ryzen AI MAX+ 395 iGPU (gfx1151, 128 GiB unified memory). Qwen3.6-27B Q4_K_M, end-to-end on the same silicon: ▸ Decode: 26.85 tok/s, 2.23x faster (DFlash + DDTree, budget 22) ▸ Prefill 16K: 20.2s, 3.05x faster (PFlash) ▸ Wall clock, 16K prompt + 1K gen: 58s vs 147s ~100 GiB still free in the box. 122B and 139B MoE class is next. Massive thanks to @smpurkis0 for the contribution 🙏

Sandro@pupposandro

x.com/i/article/2054…

English

350

37.4K

CJ@maxweicj·5d

@pupposandro @davideciffa Still good value for current 32G price

English

Sandro@pupposandro·5d

Should I get a 256gb VRAM, 8 Nvidia V100 SXM2 Server for ~$1800? I'm so tempted. Pros: Amazing value for the price. Can fit big models, almost the same bandwidth of the 3090 (900 gb/s). Cons: V100s are very old (2017. Volta), no bf16 support.

English

348

49.2K

CJ retweetledi

mrciffa@davideciffa·5d

You can now benchmark Lucebox Speculative Inference on CUDA/HIP mixed backends, thanks to @maxweicj ! Full AMD HIP server support coming soon 🏎️

English

CJ@maxweicj·9 May

@rumgewieselt 33G vram should be enough for 27B plain KV quant, much faster than TQ

English

Daniel Moll@rumgewieselt·8 May

Now its getting crazy ... 3x 1080 Ti (Pascal, 33GB VRAM) Qwen 3.6 27B MTP with 196K TurboQuant ~28-30 t/s consistently

English

217

15.4K

CJ@maxweicj·8 May

@rumgewieselt @davideciffa Is that speed tested in HumanEval bench? What's the AL and acceptance?

English

Daniel Moll@rumgewieselt·7 May

@maxweicj @davideciffa Have tested it but getting just 13 tok/s - on pascal the row split layer works best in llama.cpp and i reached now 26 tok/s with MTP.

English

CJ retweetledi

mrciffa@davideciffa·4 May

Huge thanks to our contributor github.com/weicj for integrating dual GPU split support for Luce DFlash! Now you can run draft models on one GPU and target models on another one (--target-gpu --draft-gpu param.) This is the inception of our vision for heterogeneous hardware speculative inference 🏎️

English

125

14.9K

CJ@maxweicj·7 May

@_Suresh2 @davideciffa Copying is not a bottle-neck for the layer-split pipeline path here. For 27B model 1:1 split, only up to 3.5GB F32 activation buffer at 256K full prompt will be transferred and for only once, which is far below PCIE bandwidth (espeically when P2P enabled)

English

Suresh@_Suresh2·6 May

@davideciffa @maxweicj layer split on small cards: the 3x is about the GPU time, not the wall-clock with all the copying

English

CJ retweetledi

mrciffa@davideciffa·6 May

Now thanks to @maxweicj Luce DFlash works across multiple GPU with layer split! This enable usage on multiple small GPUs with a 3x speed up compared to autoregressive decoding. 🏎️ github.com/Luce-Org/luceb…

English

5.9K

CJ@maxweicj·7 May

@Prince_Canuma epic release!

English

Prince Canuma@Prince_Canuma·7 May

mlx-vlm v0.5.0 is here 🚀 This is the largest release ever 🙌🏽 → Continuous batching server + KV cache quantization → MTP and DFlash speculative decoding (single, batch, server) → Distributed inference: Qwen3.5, Kimi K2.5 & K2.6 → Prompt caching w/ warm-disk persistence → Gemma 4 video (multi-video) + MTP drafter @googlegemma → New models: Youtu-VL, Nemotron 3 Nano Omni, SAM 3D Body → Server: json_schema response_format, thinking mode flag Huge thanks to all 21 contributors and in particular the 18 new contributors, welcome aboard 🚢 Get started today: > uv pip install -U mlx-vlm Leave us a star ⭐️ github.com/Blaizzy/mlx-vlm

English

472

42.8K

CJ@maxweicj·6 May

@BitcoinComfy Haven’t tried it out yet, but yes I’ll do it later

English

Bitcoin Comfy@BitcoinComfy·6 May

@maxweicj aeon7 one is nice also, how does it scores?

English

CJ@maxweicj·5 May

King of <30B is incoming? #Ragent6 benchmarks whether local agent models can actually finish work: read evidence, edit files, run checks, stay safe, recover from errors, and reason through constraints. Run and test: github.com/weicj/Ragent6 #llm #localllm #qwen #gemma

English

CJ@maxweicj·6 May

@rumgewieselt @davideciffa Hey man, DFlash GPU layer split harness right now is merged into main. Given the previous target/draft split harness, I guess a workaround for you is GPU0 for draft, GPU1/2 to split target 1:1 (like 27B Q4). Your try and feedback will be much appreciated ❤️

English

Daniel Moll@rumgewieselt·4 May

@davideciffa I am on 3x 1080 Ti with Pascal ... do someone see any chance to see support for this. This will be a game changer für cheap GPUs ... For now getting 20/ts with llama.cpp and Qwen 27B ct64K ... but now i am in the game and fighting for the old gpus :D

English

393

CJ@maxweicj·6 May

@David_M_Roth @davideciffa Hey buddy, GPU split right now is still restricted as a bench harness for testing across various hardware environments. Your try and feedback will be very helpful for us to fix bugs and strengthen the project before GPU split is officially integrated into server path❤️

English

David Roth 🇺🇸@David_M_Roth·6 May

@davideciffa @maxweicj But server.py is broken for dual GPU (3090)😞 We need OpenAPI working out of the box—cleaner separation of concerns, broader integration, easier testing. 🚀 Looks promising though!

English

104

CJ@maxweicj·6 May

@davideciffa Let’s go 🚀

English

CJ@maxweicj·4 May

@davideciffa Thank you for merge this PR, and hopefully this patch will open up a bigger imagination for P/DFlash🤗

English

Keşfet

@rumgewieselt @UnslothAI @pupposandro @davideciffa @_Suresh2 @Prince_Canuma @googlegemma @BitcoinComfy