Steffen Röcker

3.6K posts

Steffen Röcker

@sroecker

OG local LLaMA shill. Sr. Solution Architect @RedHat, ex particle physicist. Born @ 347 ppm CO₂. Personal account, potentially unaligned.

Stuttgart, Germany Katılım Mart 2009

6.8K Takip Edilen1.9K Takipçiler

Sabitlenmiş Tweet

Steffen Röcker@sroecker·12 Nis

Your Hermes Agent can now delegate to RLMs 🙌 Recreated the document analyzer example with the converted skill. 136 PDF pages analyzed. Best part: Auto-configures from HERMES_MODEL / HERMES_PROVIDER env vars @NousResearch @Teknium github.com/sroecker/predi…

Gabriel Lespérance@GabLesperance

x.com/i/article/2042…

English

331

48.1K

Steffen Röcker retweetledi

Arnav Chavan@ArnavChavan6·20 May

🚀 Organizing the Efficient Qwen Competition @icmlconf ! Goal: Minimize LLM inference latency for a single GPU without breaking model quality. Prizes: $3K / $2K / $1K + present at ICML 2026, Seoul Getting Started - adaptfm.gitlab.io/call-for-compe… Leaderboard - d1krc5fcnf73gi.cloudfront.net

English

144

10.3K

Steffen Röcker retweetledi

Julien Chaumond@julien_c·6d

What hardware actually powers open-source AI? Not benchmarks. Not vendor marketing. Real-world community usage. We’re launching @huggingface Hardware: → trending GPUs & CPUs → VRAM distribution → inference hardware trends → what the OSS AI ecosystem really runs on

English

412

79.3K

Steffen Röcker retweetledi

Dan Alistarh@DAlistarh·19 May

Weight-only quantization powers local LLMs like llama.cpp or Ollama. But SOTA quantized accuracy requires complex kernels that are notoriously hard to implement. Can we get SOTA accuracy and keep things simple? Our new GSQ (Gumbel-Softmax Quantization) method says yes. 🧵

English

6.1K

Steffen Röcker retweetledi

Daniel Han@danielhanchen·13 May

We released experimental MTP Qwen3.6 Unsloth GGUFs! Qwen3.6 27B MTP now runs at 140 tokens/s. Qwen3.6 35B-A3B MTP gets 220 tokens/s generation on a single GPU. Qwen3.6 27B and 35B-A3B have >1.4x speed-up over the original GGUFs without any change in accuracy. Guide + GGUFs + Benchmarks: #mtp-guide" target="_blank" rel="nofollow noopener">unsloth.ai/docs/models/qw… In terms of average speedup, we see a 1.4x for dense models at draft tokens = 2 and for the MoE around 1.15 to 1.2x. We do not recommend more than 2 draft tokens because the acceptance rate drops precipitously from 83% to 50% with 4 draft tokens, and the forward passes for MTP become less beneficial. Use `--spec-type mtp --spec-draft-n-max 2` Thanks to Aman for github.com/ggml-org/llama…!

English

117

789

122.7K

Steffen Röcker retweetledi

Tom Turney@no_stp_on_snek·11 May

appreciate the comprehensive write-up from @_EldarKurtic, @mgoin_, @RedHat_AI on TurboQuant. data on H100 with native FP8 Tensor Cores looks right for what was tested. few things to add from the non-H100 side, where most of my testing lives:

Eldar Kurtić@_EldarKurtic

TurboQuant has drawn a lot of attention recently, but the accompanying evals didn't tell the full story. So we ran what I believe is the first comprehensive study of TurboQuant: where it helps, where it falls short, and how it impacts accuracy, latency, and throughput. Findings:

English

Steffen Röcker retweetledi

Eldar Kurtić@_EldarKurtic·11 May

For more details and results check the full blog at vllm.ai/blog/turboquant . This is joint work with @mgoin_ and Alexandre Marques from @RedHat_AI and @vllm_project .

English

1.4K

Steffen Röcker retweetledi

Eldar Kurtić@_EldarKurtic·11 May

English

322

79.9K

Steffen Röcker retweetledi

Armin Ronacher ⇌@mitsuhiko·8 May

I think @antirez ds4.c is important! I wrote down my thoughts on why I built pi-ds4 and why we need to focus our local model efforts stronger than we do currently. lucumr.pocoo.org/2026/5/8/local…

English

375

30.4K

Steffen Röcker retweetledi

tender@tenderizzation·7 May

wow

4.3K

123.2K

Steffen Röcker retweetledi

antirez@antirez·7 May

Welcome to DS4, a specialized inference engine for DeepSeek v4 Flash. github.com/antirez/ds4 This project would have been impossible without the existence of llama.cpp and GGML and the work of @ggerganov and all the other contributors. Thanks!

English

217

1.5K

194.4K

Steffen Röcker retweetledi

Yannick Nick@keennay·7 May

>new AMD Instinct MI350P GPU >CDNA 4 >PCIe Gen 5 x16 >144GB HBM3E 4TB/s >native MXFP6 and MXFP4 support

AMD@AMD

Don’t just scale AI. Scale ROI. AMD Instinct MI350P PCIe cards deliver 144 GB of HBM3E memory and up to 2299 teraFLOPS (at MXFP4) in a drop-in, air-cooled card built for standard servers. That’s how you scale AI at maximum ROI without redesigning your data center. Interested in drop-in AMD Instinct MI350P PCIe cards? See the specs at the link: bit.ly/4exiAg2

English

375

38.8K

Steffen Röcker@sroecker·7 May

@spiritbuun Nice! Hope decode doesn't suffer too much.

English

150

buun@spiritbuun·7 May

PPL-maxxing Qwen 27B at 3.77 BPW. 11.8GB of intelligence.

Indonesia

3.2K

Steffen Röcker retweetledi

0xSero@0xSero·7 May

New careers will be born, this one is mine.

English

183

13.7K

Steffen Röcker@sroecker·7 May

First entry to @LottoLabs localmaxxing: 28k prefill with vLLM serving Qwen 3.6 35B A3B REAP (0.5 ratio) in NVFP4 on a 5070 Ti with 16 GiB VRAM localmaxxing.com/models/sroecke… Check out the model and instructions here huggingface.co/sroecker/Qwen3…

English

126

Steffen Röcker@sroecker·7 May

@above_spec Probably not going to get crazy improvements with MTP and 35B but one user with 3 GPUs did reddit.com/r/LocalLLaMA/s…

English

121

AboveSpec@above_spec·7 May

For daily use, ncmoe=99 wins. Full 262k context. Higher quality. 3 GB of VRAM sitting idle while you work. If you're benchmarking or doing short bursts, ncmoe=32 for the extra 10%. But as a background assistant that's always on? ncmoe=99 is the reliable pick. RAM note: expert weights (~19 GB) stream from system RAM via mmap. You only need **32 GB RAM** to run this comfortably. DDR5 preferred — the CPU expert throughput is bandwidth-bound, so faster RAM = faster tokens. ```bash llama-server \ --model Qwen3.6-35B-A3B-IQ4_K_R4.gguf \ -ngl 99 --n-cpu-moe 99 -fa 1 \ -ctk q4_0 -ctv q4_0 \ -c 262144 -t 12 ``` Model: abovespec/Qwen3.6-35B-A3B-IQ4_K_R4-GGUF Engine: ik_llama.cpp (IQ4_K_R4 won't load in mainline) HW: RTX 4060 Ti 8 GB · Ryzen 9 7900X · 93 GB DDR5 Next up: can we squeeze in a small drafter model for spec decoding? That's a separate thread. If you have successfully done spec decoding with 35b a3b please teach me!

English

2.9K

AboveSpec@above_spec·7 May

Quick update on the 35B / 8GB setup. Switched to IQ4_K_R4 — higher quality quant, without losing much speed — getting ~49tok/s through model's full native 262k context. And VRAM usage is low enough to keep a browser with multiple tabs open the whole time. 🧵

AboveSpec@above_spec

Qwen3.6 35B A3B model. 55+ tokens/sec. $300 GPU. No, this isn't a server card. It's an RTX 4060 Ti 8GB. Previously I posted that I 41 t/s on this gpu and that post blew up and went viral. I went back and made it 34% faster. And now the speed doesn't drop with context depth at all. New benchmarks + what changed 🧵

English

191

26.1K

Steffen Röcker retweetledi

wd 🔺@populartourist·6 May

One of the most important llama.cpp PRs is still waiting for mainline: github.com/ggml-org/llama… MTP support means models that ship with MTP heads can potentially run significantly faster. Early Qwen3.6 27B MTP-enabled GGUFs are already running around 2x baseline speed. Qwen3.6 27B re-uploads are likely on the horizon, and once MTP lands in llama.cpp, 2x-speed GGUFs could become the new normal. What a time to be alive.

English

216

19.9K

Steffen Röcker retweetledi

vLLM@vllm_project·6 May

🚀 Excited to be the exclusive day-0 launch partner for @lightseekorg's Tokenspeed project! We've integrated Tokenspeed's MLA library, optimized specifically for agentic workloads with long context and multi-turn, purpose-built for Kimi 2.5/2.6 and DeepSeek R1 on NVIDIA Blackwell hardware! Try it out today with our preview image - nightly support coming soon!

LightSeek Foundation@lightseekorg

Introducing TokenSpeed, a speed-of-light LLM inference engine. > TensorRT LLM level performance > vLLM level usability > Built by a lean and mission-driven team in two months > MIT license, open-source github.com/lightseekorg/t… lightseek.org/blog/lightseek…

English

198

32.1K

Steffen Röcker retweetledi

LightSeek Foundation@lightseekorg·6 May

English

125

1.1K

1.8M

Steffen Röcker retweetledi

AVB@neural_avb·6 May

More you work with tool calling agents, more you realize you actually needed an RLM. A bunch of activity with ReAct traces is actually just the LLM calling a tool with information within its context repeated verbatim (often a slice of the user input, or the output of another tool call). A normal agent will have to generate this context token by token when calling a new tool, or returning an answer. This gets really bad on really long chunks of texts coz the LLM just keeps reading and writing the same tokens over and over. Coz it can’t store slices of the data in a variable and pass around the reference everywhere. Basically that’s the point of an RLM. Also fyi, you can pass external tools into RLMs as well that they can call inside their repl to transform stuff.

AVB@neural_avb

x.com/i/article/2030…

English

182

20.1K

Steffen Röcker retweetledi

Steffen Röcker@sroecker·6 May

@ScalaWilliam Not bad, but it could be better ;) Created this NVFP4 GGUF yesterday in order to facilitate testing with only 16GiB VRAM huggingface.co/sroecker/Qwen3…

English

Keşfet

@icmlconf @huggingface @_EldarKurtic @mgoin_ @RedHat_AI @vllm_project @antirez @ggerganov