Chris

749 posts

Chris banner
Chris

Chris

@chrisdrit

Digging into new things! @FrameworkPuter @OmarchyLinux @Neovim, LLMs, Agents, AI and loving it!

Earth شامل ہوئے Kasım 2009
304 فالونگ302 فالوورز
پن کیا گیا ٹویٹ
Chris
Chris@chrisdrit·
131 tokens / second w/Gemma 4 MTP
Chris tweet media
HT
2
0
3
70
Chris
Chris@chrisdrit·
@aijoey That’s super nice, with a good context size. What are you getting for tok/s so far?
English
0
0
0
22
Joey
Joey@aijoey·
morning dgx spark notes wanted a better gemma 4 nvfp4 model to run locally, so i searched hugging face by actual deployment fit, not just model size looked at: gemma 4 nvfp4 gemma 4 26b a4b nvfp4 gemma 4 31b nvfp4 gb10 and dgx spark mentions vllm tags safetensor size downloads and likes whether there was a real serving recipe the winner for my nvidia dgx spark was AEON-7/Gemma-4-26B-A4B-it-Uncensored-NVFP4 @SpaceTimeViking why: it is 26b moe with around 4b active only about 16gb of weights has a matching dflash drafter has a dgx spark specific vllm container fits the gb10 memory profile nicely and it actually has a practical launch path instead of just a checkpoint downloaded already, started the container, hit the openai compatible endpoint, and got a clean response back running now on the dgx spark at 262k context next step is benchmarking it against the heavier 31b nvfp4 options and seeing where the sweet spot is for latency vs quality
English
5
2
35
2.3K
Chris
Chris@chrisdrit·
@ppressdev @mvanhorn just printed my first CLI for SEC Edgar data. Printing Press is a nice project! Congrats on the hard work, excited to see my CLI added!
English
2
0
1
22
Chris
Chris@chrisdrit·
@aijoey That’s huge! awesome win 👏 Curious, for the 90 t/s what size context window are you running? For interactive, non-batch throughput this is nice.
English
1
0
1
336
Joey
Joey@aijoey·
got Gemma 4 26B A4B uncensored running locally on the DGX Spark. setup: - NVIDIA GB10 / Blackwell - 128GB unified memory - NVFP4 quantized model - vLLM-compatible OpenAI API - DFlash speculative decoding - local only, no cloud API the interesting part: this is small enough to run comfortably on the Spark, but still capable enough for agentic workflows. with the @SpaceTimeViking vLLM container + DFlash drafter, it’s hitting interactive speeds that feel usable for local coding / research agents, roughly ~90 tok/s range in smoke tests, depending on prompt and settings. still caveating this heavily: - batch throughput and single user latency are different games - DFlash helps a lot for interactive decode - high concurrency may favor non speculative serving - GB10/SM121 still has some weird kernel edge cases but this is exactly why i wanted local hardware. not just “run a model locally.” actually tune the stack: model → quantization → kernels → serving → speculation → agent loop local AI is becoming less about downloading weights and more about owning the whole inference system. that’s the fun part.
English
15
9
114
10.1K
Chris
Chris@chrisdrit·
@mr_r0b0t @GIGABYTEUSA Interesting, I’ve had similar results bypassing my window manager, Hyprland, and just going directly through a TTY
English
1
0
1
13
mr-r0b0t
mr-r0b0t@mr_r0b0t·
@chrisdrit @GIGABYTEUSA The minute I stopped using GNOME it went from really good to excellent! Definitely need to use it headless whenever possible!
English
1
0
1
16
mr-r0b0t
mr-r0b0t@mr_r0b0t·
@morganlinton @NVIDIAAI @NousResearch @Teknium Z-lab doesn’t have a draft model for Nemotron so in pure tok/s it lost out. Concurrency however is where it makes up for it. It ran up to 192 concurrent requests before I stopped it for fear of OOM crash 😭 139.70 tok/s at c4 is stable and very workable!
mr-r0b0t tweet media
English
2
0
4
128
mr-r0b0t
mr-r0b0t@mr_r0b0t·
Productive day for the @NVIDIAAI GB10! “r0b0t-dgx” my @NousResearch Hermes agent finished up 2 more benchmark suites, 3 total today (all NVFP4): Gemma4-31B + DFlash Qwen3.6-35B-A3B + DFlash Nemotron-3-Nano-30B-A3B It just wrote the reports for the last 2 and emailed them to me 🤓
mr-r0b0t tweet media
English
6
0
32
1.4K
Mass
Mass@MemoryReboot_·
Spent today getting DFlash running on dual 3090 + Gemma 4 31B From the very beginning I took a wrong turn - AWQ 8bit + DFlash = 0.4% acceptance, drafter was trained on a different quant - pip install PR branch → trashed my venv What worked: @malikwas1f club 3090 recipe, pre patched docker container. Just docker compose up Results: - 86 tok/s, accept rate 61.6% +33% over my MTP result (52 tok/s) Couldn't hit those 168 tok/s (hello PCIe x4 on the second card) Gonna try to get better numbers tomorrow
Mass tweet media
English
6
4
18
2.1K
Chris
Chris@chrisdrit·
@Stellanhaglund Yeah, fp8 + MTP gave us 49% mean draft acceptance (3.45/6). No fp16 baseline (probably should've run one). My hunch is fp8 costs some acceptance vs fp16, but fp8 matmul kernels are ~2× faster than fp16 so the tok/s tradeoff still favors fp8.
English
0
0
1
44
Chris
Chris@chrisdrit·
131 tokens / second w/Gemma 4 MTP
Chris tweet media
HT
2
0
3
70
Michael Guo
Michael Guo@Michaelzsguo·
my local LLM community, give me one reason I shouldn't place the order.
Michael Guo tweet media
English
78
1
68
22K
Chris
Chris@chrisdrit·
Well... this is interesting!
Matt Van Horn@mvanhorn

Introducing the Printing Press, a CLI-factory and a CLI-library. Built with @trevin. 🏭🖨📚 Most APIs suck for agents. Most MCPs suck for agents. Most official CLIs suck for agents. They waste tokens and time. @steipete started making his own because of this. 📚 A Library of agent-native CLIs you install today (Linear, ESPN, Flight GOAT (Google Flights + Kayak nonstop), Contact Goat (LinkedIn + Happenstance + Deepline more) +30+ more) 🏭 A factory that prints new ones for any service - just type /printing-press CLIs are fast, local, SQLite-backed. Work in Claude Code, Codex, OpenClaw, Hermes. 🌐 printingpress.dev

English
0
0
0
27
Chris
Chris@chrisdrit·
@aijoey I need to get a DGX Spark 😅
English
0
0
2
188
Joey
Joey@aijoey·
follow up receipt for the gemma 4 26b a4b nvfp4 + dflash demo i posted. same local vllm setup on dgx spark / gb10, but this time with a fixed prompt streamed benchmark sweep: single stream decode avg: 112.6 tok/s 8 stream wall aggregate avg: 684.6 tok/s 75 measured requests, 0 errors token counts from final vllm usage packets. not an official benchmark, just a reproducible local run.
English
3
3
17
6.1K
Joey
Joey@aijoey·
captured live run on DGX Spark: Gemma 4 26B A4B NVFP4 + DFlash via vLLM hit ~82 decode tok/s on a codegen stream and ~69 decode tok/s on a simultaneous debug/patch stream. not a formal benchmark, just a real local streamed run. Model: github.com/AEON-7/Gemma-4… @SpaceTimeViking
English
7
1
54
9.4K
Chris
Chris@chrisdrit·
@ai_hakase_ On a Mac M2 Max? That's insane! Thanks for linking to the discussion.
English
0
0
0
153
ハカセ アイ(Ai-Hakase)🐾最新トレンドAIのためのX 🐾
【2.5倍速】Qwen 3.6 27B × MTPでローカルAIが爆速化! Qwen 3.6で「2.5倍速」という驚異の推論速度が実現されました!🚀 MTP(Multi-Token Prediction)技術により、Mac M2 Max環境でも28 tok/sを記録。これまでの常識を覆す爆速のコーディング体験が可能です。 さらに4-bit KVキャッシュ圧縮で、262kもの超ロングコンテキストに対応。膨大なドキュメント解析も、外部APIを使わず低コストで運用できます。ビジネスの生産性が劇的に向上しますね!✨ #Qwen #ローカルLLM
ハカセ アイ(Ai-Hakase)🐾最新トレンドAIのためのX 🐾 tweet media
日本語
2
5
81
5.1K
eBot Servers
eBot Servers@eBotServers·
@chrisdrit @MemoryReboot_ @modal That's insane. Going to try Gemma and dflash today on rtx 6000 pro. 🤤 27b hit 149tps, so I can imagine how high we can go vs the base models🙏
English
2
0
2
60
Mass
Mass@MemoryReboot_·
Tested Google's new MTP drafter for Gemma 31B on dual 3090 MTP off: 31 tok/s MTP on: 52 tok/s (+68%), acceptance rate ~55% Tried MTP=8 (officially recommended for 31B), got OOM For comparison, my Qwen 3.6 27B + MTP on the same dual 3090 hits 70 tok/s. Gemma 31B is bigger so the gap makes sense DFlash test next, going to push it further
Google for Developers@googledevs

Gemma 4: Now up to 3x Faster. ⚡ Same quality, way more speed. Our new MTP drafters allow Gemma 4 to predict multiple tokens at once, effectively tripling your output speed without compromising intelligence.

English
4
3
52
8.3K
Mass
Mass@MemoryReboot_·
Hosting your own LLM is like growing your own food First it's expensive, takes a lot of time and mistakes, everyone asks "why are you bothering just buy it at the store" Then you taste a tomato from your garden and realize that supermarket plastic wasn't even close Your own weights, context, your own rules — nobody nerfs the model or jacks up the token price And most importantly: when the internet down or the store is closed — you still eat
Mass tweet media
English
1
0
4
142