Steffen Röcker

3.6K posts

Steffen Röcker banner
Steffen Röcker

Steffen Röcker

@sroecker

OG local LLaMA shill. Sr. Solution Architect @RedHat, ex particle physicist. Born @ 347 ppm CO₂. Personal account, potentially unaligned.

Stuttgart, Germany Katılım Mart 2009
6.8K Takip Edilen1.9K Takipçiler
Steffen Röcker retweetledi
Julien Chaumond
Julien Chaumond@julien_c·
What hardware actually powers open-source AI? Not benchmarks. Not vendor marketing. Real-world community usage. We’re launching @huggingface Hardware: → trending GPUs & CPUs → VRAM distribution → inference hardware trends → what the OSS AI ecosystem really runs on
Julien Chaumond tweet media
English
41
71
412
79.3K
Steffen Röcker retweetledi
Dan Alistarh
Dan Alistarh@DAlistarh·
Weight-only quantization powers local LLMs like llama.cpp or Ollama. But SOTA quantized accuracy requires complex kernels that are notoriously hard to implement. Can we get SOTA accuracy and keep things simple? Our new GSQ (Gumbel-Softmax Quantization) method says yes. 🧵
Dan Alistarh tweet media
English
1
12
51
6.1K
Steffen Röcker retweetledi
Daniel Han
Daniel Han@danielhanchen·
We released experimental MTP Qwen3.6 Unsloth GGUFs! Qwen3.6 27B MTP now runs at 140 tokens/s. Qwen3.6 35B-A3B MTP gets 220 tokens/s generation on a single GPU. Qwen3.6 27B and 35B-A3B have >1.4x speed-up over the original GGUFs without any change in accuracy. Guide + GGUFs + Benchmarks: #mtp-guide" target="_blank" rel="nofollow noopener">unsloth.ai/docs/models/qw… In terms of average speedup, we see a 1.4x for dense models at draft tokens = 2 and for the MoE around 1.15 to 1.2x. We do not recommend more than 2 draft tokens because the acceptance rate drops precipitously from 83% to 50% with 4 draft tokens, and the forward passes for MTP become less beneficial. Use `--spec-type mtp --spec-draft-n-max 2` Thanks to Aman for github.com/ggml-org/llama…!
Daniel Han tweet media
English
61
117
789
122.7K
Steffen Röcker retweetledi
Tom Turney
Tom Turney@no_stp_on_snek·
appreciate the comprehensive write-up from @_EldarKurtic, @mgoin_, @RedHat_AI on TurboQuant. data on H100 with native FP8 Tensor Cores looks right for what was tested. few things to add from the non-H100 side, where most of my testing lives:
Eldar Kurtić@_EldarKurtic

TurboQuant has drawn a lot of attention recently, but the accompanying evals didn't tell the full story. So we ran what I believe is the first comprehensive study of TurboQuant: where it helps, where it falls short, and how it impacts accuracy, latency, and throughput. Findings:

English
1
1
16
2K
Steffen Röcker retweetledi
Eldar Kurtić
Eldar Kurtić@_EldarKurtic·
TurboQuant has drawn a lot of attention recently, but the accompanying evals didn't tell the full story. So we ran what I believe is the first comprehensive study of TurboQuant: where it helps, where it falls short, and how it impacts accuracy, latency, and throughput. Findings:
Eldar Kurtić tweet mediaEldar Kurtić tweet media
English
11
52
322
79.9K
Steffen Röcker retweetledi
tender
tender@tenderizzation·
wow
tender tweet media
25
98
4.3K
123.2K
Steffen Röcker retweetledi
antirez
antirez@antirez·
Welcome to DS4, a specialized inference engine for DeepSeek v4 Flash. github.com/antirez/ds4 This project would have been impossible without the existence of llama.cpp and GGML and the work of @ggerganov and all the other contributors. Thanks!
English
47
217
1.5K
194.4K
Steffen Röcker retweetledi
buun
buun@spiritbuun·
PPL-maxxing Qwen 27B at 3.77 BPW. 11.8GB of intelligence.
buun tweet media
Indonesia
4
0
25
3.2K
Steffen Röcker retweetledi
0xSero
0xSero@0xSero·
New careers will be born, this one is mine.
0xSero tweet media
English
16
16
183
13.7K
AboveSpec
AboveSpec@above_spec·
For daily use, ncmoe=99 wins. Full 262k context. Higher quality. 3 GB of VRAM sitting idle while you work. If you're benchmarking or doing short bursts, ncmoe=32 for the extra 10%. But as a background assistant that's always on? ncmoe=99 is the reliable pick. RAM note: expert weights (~19 GB) stream from system RAM via mmap. You only need **32 GB RAM** to run this comfortably. DDR5 preferred — the CPU expert throughput is bandwidth-bound, so faster RAM = faster tokens. ```bash llama-server \ --model Qwen3.6-35B-A3B-IQ4_K_R4.gguf \ -ngl 99 --n-cpu-moe 99 -fa 1 \ -ctk q4_0 -ctv q4_0 \ -c 262144 -t 12 ``` Model: abovespec/Qwen3.6-35B-A3B-IQ4_K_R4-GGUF Engine: ik_llama.cpp (IQ4_K_R4 won't load in mainline) HW: RTX 4060 Ti 8 GB · Ryzen 9 7900X · 93 GB DDR5 Next up: can we squeeze in a small drafter model for spec decoding? That's a separate thread. If you have successfully done spec decoding with 35b a3b please teach me!
AboveSpec tweet media
English
3
0
17
2.9K
AboveSpec
AboveSpec@above_spec·
Quick update on the 35B / 8GB setup. Switched to IQ4_K_R4 — higher quality quant, without losing much speed — getting ~49tok/s through model's full native 262k context. And VRAM usage is low enough to keep a browser with multiple tabs open the whole time. 🧵
AboveSpec tweet media
AboveSpec@above_spec

Qwen3.6 35B A3B model. 55+ tokens/sec. $300 GPU. No, this isn't a server card. It's an RTX 4060 Ti 8GB. Previously I posted that I 41 t/s on this gpu and that post blew up and went viral. I went back and made it 34% faster. And now the speed doesn't drop with context depth at all. New benchmarks + what changed 🧵

English
15
16
191
26.1K
Steffen Röcker retweetledi
wd 🔺
wd 🔺@populartourist·
One of the most important llama.cpp PRs is still waiting for mainline: github.com/ggml-org/llama… MTP support means models that ship with MTP heads can potentially run significantly faster. Early Qwen3.6 27B MTP-enabled GGUFs are already running around 2x baseline speed. Qwen3.6 27B re-uploads are likely on the horizon, and once MTP lands in llama.cpp, 2x-speed GGUFs could become the new normal. What a time to be alive.
English
8
14
216
19.9K
Steffen Röcker retweetledi
vLLM
vLLM@vllm_project·
🚀 Excited to be the exclusive day-0 launch partner for @lightseekorg's Tokenspeed project! We've integrated Tokenspeed's MLA library, optimized specifically for agentic workloads with long context and multi-turn, purpose-built for Kimi 2.5/2.6 and DeepSeek R1 on NVIDIA Blackwell hardware! Try it out today with our preview image - nightly support coming soon!
vLLM tweet media
LightSeek Foundation@lightseekorg

Introducing TokenSpeed, a speed-of-light LLM inference engine. > TensorRT LLM level performance > vLLM level usability > Built by a lean and mission-driven team in two months > MIT license, open-source github.com/lightseekorg/t… lightseek.org/blog/lightseek…

English
9
27
198
32.1K
Steffen Röcker retweetledi
AVB
AVB@neural_avb·
More you work with tool calling agents, more you realize you actually needed an RLM. A bunch of activity with ReAct traces is actually just the LLM calling a tool with information within its context repeated verbatim (often a slice of the user input, or the output of another tool call). A normal agent will have to generate this context token by token when calling a new tool, or returning an answer. This gets really bad on really long chunks of texts coz the LLM just keeps reading and writing the same tokens over and over. Coz it can’t store slices of the data in a variable and pass around the reference everywhere. Basically that’s the point of an RLM. Also fyi, you can pass external tools into RLMs as well that they can call inside their repl to transform stuff.
AVB@neural_avb

x.com/i/article/2030…

English
9
21
182
20.1K