Chris

746 posts

Chris banner
Chris

Chris

@chrisdrit

Digging into new things! @FrameworkPuter @OmarchyLinux @Neovim, LLMs, Agents, AI and loving it!

Earth Katılım Kasım 2009
299 Takip Edilen301 Takipçiler
Sabitlenmiş Tweet
Chris
Chris@chrisdrit·
131 tokens / second w/Gemma 4 MTP
Chris tweet media
HT
2
0
3
68
Chris
Chris@chrisdrit·
@ppressdev @mvanhorn just printed my first CLI for SEC Edgar data. Printing Press is a nice project! Congrats on the hard work, excited to see my CLI added!
English
2
0
1
7
Chris
Chris@chrisdrit·
@aijoey That’s huge! awesome win 👏 Curious, for the 90 t/s what size context window are you running? For interactive, non-batch throughput this is nice.
English
1
0
1
331
Joey
Joey@aijoey·
got Gemma 4 26B A4B uncensored running locally on the DGX Spark. setup: - NVIDIA GB10 / Blackwell - 128GB unified memory - NVFP4 quantized model - vLLM-compatible OpenAI API - DFlash speculative decoding - local only, no cloud API the interesting part: this is small enough to run comfortably on the Spark, but still capable enough for agentic workflows. with the @SpaceTimeViking vLLM container + DFlash drafter, it’s hitting interactive speeds that feel usable for local coding / research agents, roughly ~90 tok/s range in smoke tests, depending on prompt and settings. still caveating this heavily: - batch throughput and single user latency are different games - DFlash helps a lot for interactive decode - high concurrency may favor non speculative serving - GB10/SM121 still has some weird kernel edge cases but this is exactly why i wanted local hardware. not just “run a model locally.” actually tune the stack: model → quantization → kernels → serving → speculation → agent loop local AI is becoming less about downloading weights and more about owning the whole inference system. that’s the fun part.
English
15
9
114
10K
Chris
Chris@chrisdrit·
@mr_r0b0t @GIGABYTEUSA Interesting, I’ve had similar results bypassing my window manager, Hyprland, and just going directly through a TTY
English
1
0
1
12
mr-r0b0t
mr-r0b0t@mr_r0b0t·
@chrisdrit @GIGABYTEUSA The minute I stopped using GNOME it went from really good to excellent! Definitely need to use it headless whenever possible!
English
1
0
1
15
mr-r0b0t
mr-r0b0t@mr_r0b0t·
@morganlinton @NVIDIAAI @NousResearch @Teknium Z-lab doesn’t have a draft model for Nemotron so in pure tok/s it lost out. Concurrency however is where it makes up for it. It ran up to 192 concurrent requests before I stopped it for fear of OOM crash 😭 139.70 tok/s at c4 is stable and very workable!
mr-r0b0t tweet media
English
2
0
4
124
mr-r0b0t
mr-r0b0t@mr_r0b0t·
Productive day for the @NVIDIAAI GB10! “r0b0t-dgx” my @NousResearch Hermes agent finished up 2 more benchmark suites, 3 total today (all NVFP4): Gemma4-31B + DFlash Qwen3.6-35B-A3B + DFlash Nemotron-3-Nano-30B-A3B It just wrote the reports for the last 2 and emailed them to me 🤓
mr-r0b0t tweet media
English
6
0
32
1.4K
Mass
Mass@MemoryReboot_·
Spent today getting DFlash running on dual 3090 + Gemma 4 31B From the very beginning I took a wrong turn - AWQ 8bit + DFlash = 0.4% acceptance, drafter was trained on a different quant - pip install PR branch → trashed my venv What worked: @malikwas1f club 3090 recipe, pre patched docker container. Just docker compose up Results: - 86 tok/s, accept rate 61.6% +33% over my MTP result (52 tok/s) Couldn't hit those 168 tok/s (hello PCIe x4 on the second card) Gonna try to get better numbers tomorrow
Mass tweet media
English
6
4
18
2K
Chris
Chris@chrisdrit·
@Stellanhaglund Yeah, fp8 + MTP gave us 49% mean draft acceptance (3.45/6). No fp16 baseline (probably should've run one). My hunch is fp8 costs some acceptance vs fp16, but fp8 matmul kernels are ~2× faster than fp16 so the tok/s tradeoff still favors fp8.
English
0
0
1
43
Chris
Chris@chrisdrit·
131 tokens / second w/Gemma 4 MTP
Chris tweet media
HT
2
0
3
68
Michael Guo
Michael Guo@Michaelzsguo·
my local LLM community, give me one reason I shouldn't place the order.
Michael Guo tweet media
English
78
1
68
22K
Chris
Chris@chrisdrit·
Well... this is interesting!
Matt Van Horn@mvanhorn

Introducing the Printing Press, a CLI-factory and a CLI-library. Built with @trevin. 🏭🖨📚 Most APIs suck for agents. Most MCPs suck for agents. Most official CLIs suck for agents. They waste tokens and time. @steipete started making his own because of this. 📚 A Library of agent-native CLIs you install today (Linear, ESPN, Flight GOAT (Google Flights + Kayak nonstop), Contact Goat (LinkedIn + Happenstance + Deepline more) +30+ more) 🏭 A factory that prints new ones for any service - just type /printing-press CLIs are fast, local, SQLite-backed. Work in Claude Code, Codex, OpenClaw, Hermes. 🌐 printingpress.dev

English
0
0
0
27
Chris
Chris@chrisdrit·
@aijoey I need to get a DGX Spark 😅
English
0
0
2
187
Joey
Joey@aijoey·
follow up receipt for the gemma 4 26b a4b nvfp4 + dflash demo i posted. same local vllm setup on dgx spark / gb10, but this time with a fixed prompt streamed benchmark sweep: single stream decode avg: 112.6 tok/s 8 stream wall aggregate avg: 684.6 tok/s 75 measured requests, 0 errors token counts from final vllm usage packets. not an official benchmark, just a reproducible local run.
English
3
3
17
6K
Joey
Joey@aijoey·
captured live run on DGX Spark: Gemma 4 26B A4B NVFP4 + DFlash via vLLM hit ~82 decode tok/s on a codegen stream and ~69 decode tok/s on a simultaneous debug/patch stream. not a formal benchmark, just a real local streamed run. Model: github.com/AEON-7/Gemma-4… @SpaceTimeViking
English
7
1
54
9.3K
Chris
Chris@chrisdrit·
@ai_hakase_ On a Mac M2 Max? That's insane! Thanks for linking to the discussion.
English
0
0
0
151
ハカセ アイ(Ai-Hakase)🐾最新トレンドAIのためのX 🐾
【2.5倍速】Qwen 3.6 27B × MTPでローカルAIが爆速化! Qwen 3.6で「2.5倍速」という驚異の推論速度が実現されました!🚀 MTP(Multi-Token Prediction)技術により、Mac M2 Max環境でも28 tok/sを記録。これまでの常識を覆す爆速のコーディング体験が可能です。 さらに4-bit KVキャッシュ圧縮で、262kもの超ロングコンテキストに対応。膨大なドキュメント解析も、外部APIを使わず低コストで運用できます。ビジネスの生産性が劇的に向上しますね!✨ #Qwen #ローカルLLM
ハカセ アイ(Ai-Hakase)🐾最新トレンドAIのためのX 🐾 tweet media
日本語
2
5
81
5.1K
eBot Servers
eBot Servers@eBotServers·
@chrisdrit @MemoryReboot_ @modal That's insane. Going to try Gemma and dflash today on rtx 6000 pro. 🤤 27b hit 149tps, so I can imagine how high we can go vs the base models🙏
English
2
0
2
57
Mass
Mass@MemoryReboot_·
Tested Google's new MTP drafter for Gemma 31B on dual 3090 MTP off: 31 tok/s MTP on: 52 tok/s (+68%), acceptance rate ~55% Tried MTP=8 (officially recommended for 31B), got OOM For comparison, my Qwen 3.6 27B + MTP on the same dual 3090 hits 70 tok/s. Gemma 31B is bigger so the gap makes sense DFlash test next, going to push it further
Google for Developers@googledevs

Gemma 4: Now up to 3x Faster. ⚡ Same quality, way more speed. Our new MTP drafters allow Gemma 4 to predict multiple tokens at once, effectively tripling your output speed without compromising intelligence.

English
4
3
52
8.3K
Mass
Mass@MemoryReboot_·
Hosting your own LLM is like growing your own food First it's expensive, takes a lot of time and mistakes, everyone asks "why are you bothering just buy it at the store" Then you taste a tomato from your garden and realize that supermarket plastic wasn't even close Your own weights, context, your own rules — nobody nerfs the model or jacks up the token price And most importantly: when the internet down or the store is closed — you still eat
Mass tweet media
English
1
0
4
135
Chris
Chris@chrisdrit·
This is sick @above_spec
AboveSpec@above_spec

RTX 5060 Ti 16GB. $429 GPU. Last night I got 128 t/s on Qwen3.6-35B using ik_llama.cpp's R4 quant format. Crushing performance. Faster than the 5070 Ti on mainline llama.cpp. Performance stays consistent from 0 to 139k context and no speculative decoding used!🤯 Special thanks to @MakJoris for sharing ik_llama.cpp with us! Today I wanted to know if it's actually *useful* at that speed. So I gave it a coding agent and 4 creative challenges. Here's what it built. 🧵

English
0
0
1
26
Chris
Chris@chrisdrit·
Running on @modal, per second billing through Pi.dev I had to build the nightly of vLLM to get the support but it rocks!
English
0
0
1
40