BrainOS
3.5K posts

BrainOS
@BadBrainCode
How can we debug the operating system in our own brains? This is the most important question facing humanity.
Planet Earth Katılım Kasım 2023
739 Takip Edilen214 Takipçiler
BrainOS retweetledi

@BadBrainCode The hybrid Mamba component breaks vLLM's prefix cache because the cache assumes pure transformer attention. That architectural mismatch is why cache key generation fails silently instead of erroring out.
English

Running NVIDIA Nemotron-3-Super-120B-A12B (NVFP4 mixed) on DGX Spark /
GB10 (Grace-Blackwell) via vLLM 0.19.2rc1. Architecture resolves as
NemotronHMTPModel — hybrid Mamba + Transformer with MTP draft head.
PROBLEM: prefix caching is non-functional on this model.
With --enable-prefix-caching explicit, vLLM emits this warning on boot:
"Prefix caching in Mamba cache 'all' mode is currently enabled.
Its support for Mamba layers is experimental."
Measured behavior over a real production run:
queries: 6,211
hits: 0
hit rate: 0.00%
Confirmed across thousands of requests with shared system prompts that
should be deduplicating cleanly. They aren't.
This isn't a misconfiguration — NVIDIA's own Nemotron-3-Super deployment
guide (<github.com/NVIDIA-NeMo/Ne…>
SparkDeploymentGuide) deliberately OMITS --enable-prefix-caching. So
upstream knows it doesn't work on NemotronH.
WHY IT MATTERS: We run a 4-way parallel fan-out pattern — brain
decomposes a brief, dispatches 4 concurrent section-writes against vLLM
with an identical 163-token shared system prompt, stitches results.
Measured throughput:
c=1: 15.0 tok/s
c=2: 25.4 tok/s
c=3: 30.1 tok/s
c=4: 41.4 tok/s (← saturation; 4th slot effectively free)
That 2.76× user-facing speedup is real. But every one of those 4 calls
re-prefills the same 163-token system message independently. With a
working prefix cache, that prefill cost drops ~4×. For larger shared
prompts (RAG context, long instruction blocks, few-shot exemplars),
the win compounds enormously.
THE STRUCTURAL QUESTION: Is this a fundamental limit? Mamba's selective-
scan state can't be paged like transformer KV — it's a fixed-size
recurrent state, not a sequence of key/value vectors. So for the Mamba
LAYERS of a hybrid model, prefix cache semantics are genuinely unclear.
But for the TRANSFORMER LAYERS interleaved with them, the KV reuse
should work fine.
Has anyone at @NVIDIAAIDev or @NVIDIAAI considered a "hybrid prefix
cache" mode that caches the transformer-layer KV pages for a matched
prefix while re-running the Mamba state forward? Even a partial fix
would eliminate most of the prefill cost on this architecture.
Or — is there a known issue / planned vLLM PR I should be watching?
Happy to share the bench harness if it's useful for repro.
cc @vllm_project — same question your side.
Hardware: DGX Spark, GB10, 121 GiB unified memory, --max-num-seqs 4,
--quantization fp4, MARLIN MoE backend, async scheduling, no MTP
(breaks structured tool-call emission — separate issue).
English

Local AI landing page generation on a DGX Spark.
One Gemma-4-26B Q4 GGUF served by llama.cpp with 7 concurrent decode slots.
The orchestrator breaks “landing page” into 6 section briefs:
hero
features
steps
testimonials
pricing
CTA
Then 6 Gemma instances generate the sections in parallel and stitch everything into one Tailwind page.
~3 minutes end to end.
The best part: everything you just watched happens offline, forever. No one can turn it off besides my light company lol
@googlegemma @NVIDIAAIDev
English

@SpaceTimeViking @mr_r0b0t @bridgemindai Concurrency is the way. Maybe Grok 4.3 managing a swarm of parallel calls.
English

People reporting the DGX Spark is slow just don’t know how to optimize for it.
Understandably a common issue with lack of good information out there. It does require some first principles understanding of the hardware and software.
I was running 256 concurrent sessions on a single DGX Spark getting nearly 2000 Tok/s aggregate.
English

MacBook Pro M5 Max is fully set up and running local models.
I have never seen speeds this fast.
Qwen 3.6 35B and Gemma 4 31B are running blazing fast on 128GB of unified memory.
Faster than both of my stacked NVIDIA DGX Sparks sitting right next to it.
Initial impressions: Apple silicon in 2026 is no joke for local inference.
The M5 Max handles 30B+ parameter models like they're nothing.
Full review and comparison video coming soon.

English

I’d like to get some feedback on running multiple models on a DGX. Anyone else run into this?
BrainOS@BadBrainCode
English

@TheAhmadOsman Sorry I miss read your post. I tried multiple smaller models running concurrently.
English

@TheAhmadOsman I have an article coming out today on DGX concurrency. Could be missing something bur tldr, it didn’t work.
English

Qwen 3.5 27B in NVFP4 w/ full context taking less than 20GB VRAM
You can basically run like 5 agents w/ full context on a single RTX PRO 6000 like this, and they'd be so fast
Tell me I didn't tell you this was gonna happen

Ahmad@TheAhmadOsman
Qwen 3.5 27B is the release of the year for me so far > Agentic model & great at tool calling > Claude Sonnet 4.6 quality at home > ~28GB in NVFP4 > Fits on a single RTX 5090 > with full context (256K) Amazing model & performance The prediction below will age like fine wine
English

@aijoey @NVIDIAAI @PavloMolchanov @Teknium @NousResearch Great job. I tried and never got the harness work. Had to give up due to time. If you can post your config it would be helpful.
English

got NVIDIA’s new Nemotron-Labs-Diffusion 8B running locally on my DGX Spark. Jetson(hermes) made me the tri mode runner
the cool part: it’s one model that can answer in different “gears.”
same prompt, same checkpoint:
- normal mode: 10.98 tok/s
- diffusion mode: 20.58 tok/s
- self-spec mode: 18.01 tok/s
- self-spec + lora: 18.07 tok/s
plain english: diffusion mode was almost 2x faster than normal mode in this tiny first test.
but faster wasn’t automatically better. the fastest mode also started repeating itself, so now the real test is running a bigger prompt suite and checking both:
- how fast it answers
- whether the answer is actually good
early result: the tri-mode idea works locally. next step is figuring out which mode is best for which kind of prompt.

English

@epaleezeldin We are talking about the deep state. They don't follow laws. In fact, they break them and are protected from prosecution by activist judges and weak DA's.
English

ENTIRELY fake news! At no point since President Trump was sworn back into office has the Trump EPA authorized the release of ANY genetically modified mosquitoes into Florida or anywhere else for that matter. So much fake news BS being peddled on social media for RTs, Likes, and engagement.
Concerned Citizen@BGatesIsaPyscho
“The EPA just authorise the release of 2 Billion Genetically Modified Mosquitoes” First Ticks & now Mosquitoes - The World is run by insane lunatics.
English

@aijoey @SpaceTimeViking @Alibaba_Qwen @nvidia Having the best results with this model over the dozen or so I have tested. It’s tool use is better than Nemotron. Doing more testing now on nemo to see if something is jacked up. Thank you for your work on this.
English

local coding models are getting useful fast.
this is Qwen3.6-27B AEON Ultimate Uncensored NVFP4 running locally on my DGX Spark through vLLM. @SpaceTimeViking
three concurrent dev tasks from one local endpoint:
• stack trace triage
• pytest regression test
• commit + PR notes
the model caught a real fixture bug: KeyError on missing timeout_ms, fixed it with a 5000ms default, and the regression test passed.
all local. no cloud API. tested from my phone while i was at work. cc: jetson(hermes)
English

@TheEcomNomad @NVIDIAAI Wall 2 was not speed related. It doesn’t know how to patch files. Changing something as simple as semicolon requires several full context turns.
English

@BadBrainCode @NVIDIAAI the 26 tok/s isn't a tuning issue, it's physics. unified memory means you're bottlenecked moving weights from VRAM to compute. batching won't fix it, just adds latency without throughput gains
English

@NVIDIAAI Can you please help here?
Two walls running Nemotron-3-Super-120B (NVFP4) locally on a DGX Spark (GB10, 121GB unified) for agentic coding — and the fix for the first exposed the second.
Setup: served via vLLM, single-stream decode ~26 tok/s (memory-bandwidth bound, saturates ~batch 2). Great for chat. Then we pointed it at multi-step coding goals that iterate on a file.
Wall 1 — full-file rewrites. The agent “edited” code by rewriting the entire file every time. A one-line fix = regenerating ~700 lines = minutes per edit. One goal ran ~8 hours, almost all of it re-emitting unchanged code. Prompt-steering (“use the patch tool, never rewrite existing files”) — in both the system prompt and the task — did nothing in live runs. The model’s default beat the instruction. What worked was structural: a pre-tool hook that blocks a full-file write when the file already exists and returns “use patch instead.” Edits dropped from ~10 min to ~0.2s. Wall cleared.
Wall 2 — which the fix revealed. Forced onto surgical patches, the model couldn’t reliably construct them. A patch needs old_string to be a byte-exact substring of the file, JSON-escaped. The model over-escaped — emitting \\" where the file had a plain " — so the anchor never matched. The tool’s own diagnostic flagged it: “Escape-drift detected… almost always a tool-call serialization [issue].” Add no-op patches (old_string == new_string, changes nothing) and “old_string not found,” and it churned ~4 hours; two requested features silently never landed — yet it reported COMPLETE.
Root cause of wall 2 is three things stacked: (1) faithfully round-tripping quotes/backslashes through the tool-call parser drifts; (2) with reasoning disabled (we force enable_thinking=false to dodge a cuBLAS long-decode crash on GB10), there’s no internal “does this string actually exist in the file?” check, so it fires malformed patches blind; (3) it reconstructs the anchor from context instead of re-reading exact bytes.
The lesson for local-agent builders on Spark: don’t just optimize tok/s. Agentic coding is gated by two separate failure modes — token economy (rewrites) and tool-call serialization fidelity (patches) — and the fix for one surfaces the other. Both need structural guardrails, not better prompts.
English









