Derek Colley
4.6K posts

Derek Colley
@DerekColley_
Consulting Technology Lead, CTO & CIO Building https://t.co/rHR4F65LtV https://t.co/XGysrW0D1H





Introducing eve, an agent framework. ๐๐๐๐๐/ ๐๐๐๐๐.๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐.๐๐ ๐๐๐๐๐/ ๐๐๐๐๐๐/ ๐๐๐๐๐๐๐ก/ ๐๐๐๐๐๐๐๐๐/ Like Next.js, for agents. vercel.com/blog/introduciโฆ



Theory: China encourages the release of open source models because they figure customers outside of China won't trust a model running in a Chinese datacenter anyway, so the best they can do is try and erode at the margins of US frontier labs so they don't compound faster.





Introducing eve, an agent framework. ๐๐๐๐๐/ ๐๐๐๐๐.๐๐ ๐๐๐๐๐๐๐๐๐๐๐๐.๐๐ ๐๐๐๐๐/ ๐๐๐๐๐๐/ ๐๐๐๐๐๐๐ก/ ๐๐๐๐๐๐๐๐๐/ Like Next.js, for agents. vercel.com/blog/introduciโฆ


Context window test on my local AI. Chinese X99 board, Xeon 2680 v4 ,128 GB used server RAM, used RX 580 GPUb, 8 GB All older tech. Running lamma.CPP with Openwebui. Model is Qwen3-30B-A3B-Element6-1M.Q4_K_M.gguf

Run Gemma 4 26b MTP on 8 GB VRAM GPUs at 25+ tokens/second. Flags included! local llm space is moving at terminal velocity. only 3 days ago google released gemma 4 26b a4b qat quants. more efficient than before, ran on 8gb vram at 20 tok/sec. and now just a few hours ago, mainline llama.cpp merged a massive update and we just shattered our own record. decode throughput went 25-40% up on the same 8 GB VRAM setup! Before MTP: 20 tps -> After MTP: 28 tps! llama.cpp just officially merged PR #23398 ("add Gemma4 MTP"), bringing native Multi-Token Prediction (MTP) support to Gemma 4 models. By running speculative drafting on the same 8GB VRAM RTX 4060 setup, my decode throughput on a 64k context instantly leaped to a blistering 25โ27 tokens/sec thats 25-30% increase with the same hardware. Here is the architectural catch you need to know: Unlike the Qwen 3.5 and 3.6 series, which bake the MTP heads directly into the base GGUF, the Gemma 4 MTP head is not built in. You must download a separate, specialized MTP drafter GGUF (the assistant model) to act as the speculator. (I've dropped the download link in the replies). copy and try the exact flags: -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf --spec-type draft-mtp --spec-draft-n-max 6 --spec-draft-p-min 0.7 --spec-draft-model gemma-4-26b-A4B-it-assistant-Q4_0.gguf -c 64000 -v n-max 4 and p-min 0.7 is also worth checking out. benchmark on your setup and workflow. if you have a single 8 gb vram nvidia rtx 4060, 3060, 3070, 2080, 2070, grab the MTP drafter GGUF link in the comments and try it yourself. Check it out even if you have asmaller or a larger gpu, such as a single rtx 3090, 4090, 3060, 2060. MTP works for all gemma 4 sizes such as gemma 4 12b, gemma 4 31b etc. but remember to grab the correct mtp draft assistant models respectively. what are you benchmarking today











