Chris

746 posts

Chris

@chrisdrit

Digging into new things! @FrameworkPuter @OmarchyLinux @Neovim, LLMs, Agents, AI and loving it!

Earth Katılım Kasım 2009

299 Takip Edilen301 Takipçiler

Sabitlenmiş Tweet

Chris@chrisdrit·4d

131 tokens / second w/Gemma 4 MTP

Chris@chrisdrit·20m

@iamdothash Nice theme!

English

Bjarne Øverli@iamdothash·5h

Quickshell menu, Iron Man style for Omarchy. I'm going to push the limits of Quickshell going forward to learn what it is capable of.

Vlad Tarko 🌐 🏗️@vladtarko

@iamdothash Can you make the Omarchy menu like this?

English

131

4.6K

Chris@chrisdrit·21m

sec.gov github.com/mvanhorn/print…

ZXX

Chris@chrisdrit·21m

@ppressdev @mvanhorn just printed my first CLI for SEC Edgar data. Printing Press is a nice project! Congrats on the hard work, excited to see my CLI added!

English

Chris@chrisdrit·1h

I need to try this...

NVIDIA AI@NVIDIAAI

OpenShell v0.0.37 🧩 pluggable compute drivers: Docker, Podman, Kubernetes, MicroVM 🔒 OIDC + RBAC gateway auth ☸️ Helm chart + Kubernetes user namespaces 📦 Debian, RPM, and Homebrew packages breaking: recreate the gateway before upgrading. github.com/NVIDIA/OpenShe…

English

Chris@chrisdrit·1d

@aijoey That’s huge! awesome win 👏 Curious, for the 90 t/s what size context window are you running? For interactive, non-batch throughput this is nice.

English

331

Joey@aijoey·1d

got Gemma 4 26B A4B uncensored running locally on the DGX Spark. setup: - NVIDIA GB10 / Blackwell - 128GB unified memory - NVFP4 quantized model - vLLM-compatible OpenAI API - DFlash speculative decoding - local only, no cloud API the interesting part: this is small enough to run comfortably on the Spark, but still capable enough for agentic workflows. with the @SpaceTimeViking vLLM container + DFlash drafter, it’s hitting interactive speeds that feel usable for local coding / research agents, roughly ~90 tok/s range in smoke tests, depending on prompt and settings. still caveating this heavily: - batch throughput and single user latency are different games - DFlash helps a lot for interactive decode - high concurrency may favor non speculative serving - GB10/SM121 still has some weird kernel edge cases but this is exactly why i wanted local hardware. not just “run a model locally.” actually tune the stack: model → quantization → kernels → serving → speculation → agent loop local AI is becoming less about downloading weights and more about owning the whole inference system. that’s the fun part.

English

114

10K

Chris@chrisdrit·2d

Agreed. Important post to read.

Armin Ronacher ⇌@mitsuhiko

I think @antirez ds4.c is important! I wrote down my thoughts on why I built pi-ds4 and why we need to focus our local model efforts stronger than we do currently. lucumr.pocoo.org/2026/5/8/local…

English

Chris@chrisdrit·2d

@mr_r0b0t @GIGABYTEUSA Interesting, I’ve had similar results bypassing my window manager, Hyprland, and just going directly through a TTY

English

mr-r0b0t@mr_r0b0t·2d

@chrisdrit @GIGABYTEUSA The minute I stopped using GNOME it went from really good to excellent! Definitely need to use it headless whenever possible!

English

mr-r0b0t@mr_r0b0t·3d

@GIGABYTEUSA AI TOP ATOM*

mr-r0b0t@mr_r0b0t

Hats off to @GIGABYTEUSA for making the AI TOP 🔝 The metal case and front panel look quite nice and the whole thing is both small and sturdy. Looks like it’s going to be a really great way to host a @NVIDIAAI GB10!

Latviešu

1.3K

Chris@chrisdrit·2d

@mr_r0b0t @morganlinton @NVIDIAAI @NousResearch @Teknium That’s very nice, was this on the dgx spark?

English

mr-r0b0t@mr_r0b0t·2d

@morganlinton @NVIDIAAI @NousResearch @Teknium Z-lab doesn’t have a draft model for Nemotron so in pure tok/s it lost out. Concurrency however is where it makes up for it. It ran up to 192 concurrent requests before I stopped it for fear of OOM crash 😭 139.70 tok/s at c4 is stable and very workable!

English

124

mr-r0b0t@mr_r0b0t·2d

Productive day for the @NVIDIAAI GB10! “r0b0t-dgx” my @NousResearch Hermes agent finished up 2 more benchmark suites, 3 total today (all NVFP4): Gemma4-31B + DFlash Qwen3.6-35B-A3B + DFlash Nemotron-3-Nano-30B-A3B It just wrote the reports for the last 2 and emailed them to me 🤓

English

1.4K

Chris@chrisdrit·3d

@MemoryReboot_ @mr_r0b0t @malikwas1f Nice!

English

Mass@MemoryReboot_·3d

Spent today getting DFlash running on dual 3090 + Gemma 4 31B From the very beginning I took a wrong turn - AWQ 8bit + DFlash = 0.4% acceptance, drafter was trained on a different quant - pip install PR branch → trashed my venv What worked: @malikwas1f club 3090 recipe, pre patched docker container. Just docker compose up Results: - 86 tok/s, accept rate 61.6% +33% over my MTP result (52 tok/s) Couldn't hit those 168 tok/s (hello PCIe x4 on the second card) Gonna try to get better numbers tomorrow

English

Chris@chrisdrit·3d

@Stellanhaglund Yeah, fp8 + MTP gave us 49% mean draft acceptance (3.45/6). No fp16 baseline (probably should've run one). My hunch is fp8 costs some acceptance vs fp16, but fp8 matmul kernels are ~2× faster than fp16 so the tok/s tradeoff still favors fp8.

English

Stellan Haglund@Stellanhaglund·3d

@chrisdrit Notice any difference in acceptance rate on fp8?

English

Chris@chrisdrit·4d

131 tokens / second w/Gemma 4 MTP

Chris@chrisdrit·3d

@Michaelzsguo marketplace.nvidia.com/en-us/enterpri…

QME

Michael Guo@Michaelzsguo·4d

my local LLM community, give me one reason I shouldn't place the order.

English

22K

Chris@chrisdrit·3d

Well... this is interesting!

Matt Van Horn@mvanhorn

Introducing the Printing Press, a CLI-factory and a CLI-library. Built with @trevin. 🏭🖨📚 Most APIs suck for agents. Most MCPs suck for agents. Most official CLIs suck for agents. They waste tokens and time. @steipete started making his own because of this. 📚 A Library of agent-native CLIs you install today (Linear, ESPN, Flight GOAT (Google Flights + Kayak nonstop), Contact Goat (LinkedIn + Happenstance + Deepline more) +30+ more) 🏭 A factory that prints new ones for any service - just type /printing-press CLIs are fast, local, SQLite-backed. Work in Claude Code, Codex, OpenClaw, Hermes. 🌐 printingpress.dev

English

Chris@chrisdrit·4d

@aijoey I need to get a DGX Spark 😅

English

187

Joey@aijoey·4d

follow up receipt for the gemma 4 26b a4b nvfp4 + dflash demo i posted. same local vllm setup on dgx spark / gb10, but this time with a fixed prompt streamed benchmark sweep: single stream decode avg: 112.6 tok/s 8 stream wall aggregate avg: 684.6 tok/s 75 measured requests, 0 errors token counts from final vllm usage packets. not an official benchmark, just a reproducible local run.

English

Joey@aijoey·4d

captured live run on DGX Spark: Gemma 4 26B A4B NVFP4 + DFlash via vLLM hit ~82 decode tok/s on a codegen stream and ~69 decode tok/s on a simultaneous debug/patch stream. not a formal benchmark, just a real local streamed run. Model: github.com/AEON-7/Gemma-4… @SpaceTimeViking

English

9.3K

Chris@chrisdrit·4d

@ai_hakase_ On a Mac M2 Max? That's insane! Thanks for linking to the discussion.

English

151

【2.5倍速】Qwen 3.6 27B × MTPでローカルAIが爆速化！ Qwen 3.6で「2.5倍速」という驚異の推論速度が実現されました！🚀 MTP（Multi-Token Prediction）技術により、Mac M2 Max環境でも28 tok/sを記録。これまでの常識を覆す爆速のコーディング体験が可能です。さらに4-bit KVキャッシュ圧縮で、262kもの超ロングコンテキストに対応。膨大なドキュメント解析も、外部APIを使わず低コストで運用できます。ビジネスの生産性が劇的に向上しますね！✨ #Qwen #ローカルLLM

ハカセアイ(Ai-Hakase)🐾最新トレンドＡＩのためのＸ 🐾 tweet media

日本語

5.1K

Chris@chrisdrit·4d

@eBotServers @MemoryReboot_ @modal @eBotServers "rtx 6000 pro" is that on a desktop? 149tps, i love where things are going!

English

eBot Servers@eBotServers·4d

@chrisdrit @MemoryReboot_ @modal That's insane. Going to try Gemma and dflash today on rtx 6000 pro. 🤤 27b hit 149tps, so I can imagine how high we can go vs the base models🙏

English

Mass@MemoryReboot_·4d

Tested Google's new MTP drafter for Gemma 31B on dual 3090 MTP off: 31 tok/s MTP on: 52 tok/s (+68%), acceptance rate ~55% Tried MTP=8 (officially recommended for 31B), got OOM For comparison, my Qwen 3.6 27B + MTP on the same dual 3090 hits 70 tok/s. Gemma 31B is bigger so the gap makes sense DFlash test next, going to push it further

Google for Developers@googledevs

Gemma 4: Now up to 3x Faster. ⚡ Same quality, way more speed. Our new MTP drafters allow Gemma 4 to predict multiple tokens at once, effectively tripling your output speed without compromising intelligence.

English

8.3K

Chris@chrisdrit·4d

@MemoryReboot_ Love the analogy!

English

Mass@MemoryReboot_·6d

Hosting your own LLM is like growing your own food First it's expensive, takes a lot of time and mistakes, everyone asks "why are you bothering just buy it at the store" Then you taste a tomato from your garden and realize that supermarket plastic wasn't even close Your own weights, context, your own rules — nobody nerfs the model or jacks up the token price And most importantly: when the internet down or the store is closed — you still eat

English

135

Chris@chrisdrit·4d

This is sick @above_spec

AboveSpec@above_spec

RTX 5060 Ti 16GB. $429 GPU. Last night I got 128 t/s on Qwen3.6-35B using ik_llama.cpp's R4 quant format. Crushing performance. Faster than the 5070 Ti on mainline llama.cpp. Performance stays consistent from 0 to 139k context and no speculative decoding used!🤯 Special thanks to @MakJoris for sharing ik_llama.cpp with us! Today I wanted to know if it's actually *useful* at that speed. So I gave it a coding agent and 4 creative challenges. Here's what it built. 🧵

English

Chris@chrisdrit·4d

Running on @modal, per second billing through Pi.dev I had to build the nightly of vLLM to get the support but it rocks!

English

Keşfet

@iamdothash @ppressdev @mvanhorn @aijoey @SpaceTimeViking @mr_r0b0t @GIGABYTEUSA @morganlinton