BrainOS

3.5K posts

BrainOS banner
BrainOS

BrainOS

@BadBrainCode

How can we debug the operating system in our own brains? This is the most important question facing humanity.

Planet Earth Joined Kasım 2023
739 Following214 Followers
BrainOS retweeted
0xSero
0xSero@0xSero·
Deepseek-v4-flash REAP is done There's an 80 and 96gb version (weights) I am working through the pain of getting the model to run on sm121 (DGX Spark) If anyone of you has a working DS4 vllm/sglang config/env/docker for Sparks please share Once it works I'll make the HF public
0xSero tweet media
English
24
9
183
7.1K
BrainOS
BrainOS@BadBrainCode·
@TheEcomNomad Thank you for pointing me in the right direction.
English
0
0
1
10
Aaron ⚡️
Aaron ⚡️@TheEcomNomad·
@BadBrainCode The hybrid Mamba component breaks vLLM's prefix cache because the cache assumes pure transformer attention. That architectural mismatch is why cache key generation fails silently instead of erroring out.
English
1
0
1
17
BrainOS
BrainOS@BadBrainCode·
Running NVIDIA Nemotron-3-Super-120B-A12B (NVFP4 mixed) on DGX Spark / GB10 (Grace-Blackwell) via vLLM 0.19.2rc1. Architecture resolves as NemotronHMTPModel — hybrid Mamba + Transformer with MTP draft head. PROBLEM: prefix caching is non-functional on this model. With --enable-prefix-caching explicit, vLLM emits this warning on boot: "Prefix caching in Mamba cache 'all' mode is currently enabled. Its support for Mamba layers is experimental." Measured behavior over a real production run: queries: 6,211 hits: 0 hit rate: 0.00% Confirmed across thousands of requests with shared system prompts that should be deduplicating cleanly. They aren't. This isn't a misconfiguration — NVIDIA's own Nemotron-3-Super deployment guide (<github.com/NVIDIA-NeMo/Ne…> SparkDeploymentGuide) deliberately OMITS --enable-prefix-caching. So upstream knows it doesn't work on NemotronH. WHY IT MATTERS: We run a 4-way parallel fan-out pattern — brain decomposes a brief, dispatches 4 concurrent section-writes against vLLM with an identical 163-token shared system prompt, stitches results. Measured throughput: c=1: 15.0 tok/s c=2: 25.4 tok/s c=3: 30.1 tok/s c=4: 41.4 tok/s (← saturation; 4th slot effectively free) That 2.76× user-facing speedup is real. But every one of those 4 calls re-prefills the same 163-token system message independently. With a working prefix cache, that prefill cost drops ~4×. For larger shared prompts (RAG context, long instruction blocks, few-shot exemplars), the win compounds enormously. THE STRUCTURAL QUESTION: Is this a fundamental limit? Mamba's selective- scan state can't be paged like transformer KV — it's a fixed-size recurrent state, not a sequence of key/value vectors. So for the Mamba LAYERS of a hybrid model, prefix cache semantics are genuinely unclear. But for the TRANSFORMER LAYERS interleaved with them, the KV reuse should work fine. Has anyone at @NVIDIAAIDev or @NVIDIAAI considered a "hybrid prefix cache" mode that caches the transformer-layer KV pages for a matched prefix while re-running the Mamba state forward? Even a partial fix would eliminate most of the prefill cost on this architecture. Or — is there a known issue / planned vLLM PR I should be watching? Happy to share the bench harness if it's useful for repro. cc @vllm_project — same question your side. Hardware: DGX Spark, GB10, 121 GiB unified memory, --max-num-seqs 4, --quantization fp4, MARLIN MoE backend, async scheduling, no MTP (breaks structured tool-call emission — separate issue).
English
1
0
2
74
BrainOS
BrainOS@BadBrainCode·
@aijoey This is very helpful.
English
0
0
1
40
Joey
Joey@aijoey·
Local AI landing page generation on a DGX Spark. One Gemma-4-26B Q4 GGUF served by llama.cpp with 7 concurrent decode slots. The orchestrator breaks “landing page” into 6 section briefs: hero features steps testimonials pricing CTA Then 6 Gemma instances generate the sections in parallel and stitch everything into one Tailwind page. ~3 minutes end to end. The best part: everything you just watched happens offline, forever. No one can turn it off besides my light company lol @googlegemma @NVIDIAAIDev
English
7
2
18
3K
ÆON FORGE ✨
ÆON FORGE ✨@SpaceTimeViking·
People reporting the DGX Spark is slow just don’t know how to optimize for it. Understandably a common issue with lack of good information out there. It does require some first principles understanding of the hardware and software. I was running 256 concurrent sessions on a single DGX Spark getting nearly 2000 Tok/s aggregate.
English
6
3
26
1.6K
BridgeMind
BridgeMind@bridgemindai·
MacBook Pro M5 Max is fully set up and running local models. I have never seen speeds this fast. Qwen 3.6 35B and Gemma 4 31B are running blazing fast on 128GB of unified memory. Faster than both of my stacked NVIDIA DGX Sparks sitting right next to it. Initial impressions: Apple silicon in 2026 is no joke for local inference. The M5 Max handles 30B+ parameter models like they're nothing. Full review and comparison video coming soon.
BridgeMind tweet media
English
113
41
842
53.4K
BrainOS
BrainOS@BadBrainCode·
@TheAhmadOsman Sorry I miss read your post. I tried multiple smaller models running concurrently.
English
0
0
0
8
BrainOS
BrainOS@BadBrainCode·
@TheAhmadOsman I have an article coming out today on DGX concurrency. Could be missing something bur tldr, it didn’t work.
English
1
0
1
286
Ahmad
Ahmad@TheAhmadOsman·
Qwen 3.5 27B in NVFP4 w/ full context taking less than 20GB VRAM You can basically run like 5 agents w/ full context on a single RTX PRO 6000 like this, and they'd be so fast Tell me I didn't tell you this was gonna happen
Ahmad tweet media
Ahmad@TheAhmadOsman

Qwen 3.5 27B is the release of the year for me so far > Agentic model & great at tool calling > Claude Sonnet 4.6 quality at home > ~28GB in NVFP4 > Fits on a single RTX 5090 > with full context (256K) Amazing model & performance The prediction below will age like fine wine

English
36
21
412
35.2K
Nikita Bier
Nikita Bier@nikitabier·
Good morning Tokyo
Nikita Bier tweet media
English
1.7K
470
9.5K
401.4K
BrainOS
BrainOS@BadBrainCode·
@aijoey They are all over the country, self included.
English
0
0
1
33
Joey
Joey@aijoey·
got NVIDIA’s new Nemotron-Labs-Diffusion 8B running locally on my DGX Spark. Jetson(hermes) made me the tri mode runner the cool part: it’s one model that can answer in different “gears.” same prompt, same checkpoint: - normal mode: 10.98 tok/s - diffusion mode: 20.58 tok/s - self-spec mode: 18.01 tok/s - self-spec + lora: 18.07 tok/s plain english: diffusion mode was almost 2x faster than normal mode in this tiny first test. but faster wasn’t automatically better. the fastest mode also started repeating itself, so now the real test is running a bigger prompt suite and checking both: - how fast it answers - whether the answer is actually good early result: the tri-mode idea works locally. next step is figuring out which mode is best for which kind of prompt.
Joey tweet media
English
4
4
39
5.5K
BrainOS
BrainOS@BadBrainCode·
@din__mon @mr_r0b0t It’s the first model that can one shot small projects set up as a Hermes profile.
English
1
0
1
59
BrainOS
BrainOS@BadBrainCode·
@epaleezeldin We are talking about the deep state. They don't follow laws. In fact, they break them and are protected from prosecution by activist judges and weak DA's.
English
0
0
0
36
Lee Zeldin
Lee Zeldin@epaleezeldin·
No GE mosquitoes are authorized for use in the United States. NONE!
English
830
1.4K
6.5K
79.8K
Lee Zeldin
Lee Zeldin@epaleezeldin·
ENTIRELY fake news! At no point since President Trump was sworn back into office has the Trump EPA authorized the release of ANY genetically modified mosquitoes into Florida or anywhere else for that matter. So much fake news BS being peddled on social media for RTs, Likes, and engagement.
Concerned Citizen@BGatesIsaPyscho

“The EPA just authorise the release of 2 Billion Genetically Modified Mosquitoes” First Ticks & now Mosquitoes - The World is run by insane lunatics.

English
994
3.8K
10.3K
320.7K
BrainOS
BrainOS@BadBrainCode·
@aijoey @SpaceTimeViking @Alibaba_Qwen @nvidia Having the best results with this model over the dozen or so I have tested. It’s tool use is better than Nemotron. Doing more testing now on nemo to see if something is jacked up. Thank you for your work on this.
English
0
0
1
49
Joey
Joey@aijoey·
local coding models are getting useful fast. this is Qwen3.6-27B AEON Ultimate Uncensored NVFP4 running locally on my DGX Spark through vLLM. @SpaceTimeViking three concurrent dev tasks from one local endpoint: • stack trace triage • pytest regression test • commit + PR notes the model caught a real fixture bug: KeyError on missing timeout_ms, fixed it with a 5000ms default, and the regression test passed. all local. no cloud API. tested from my phone while i was at work. cc: jetson(hermes)
English
10
4
64
6.2K
Joey
Joey@aijoey·
@julien_c Yes!!! Welcome to the club.
English
1
0
0
253
Julien Chaumond
Julien Chaumond@julien_c·
I finally got me one of those puppies
Julien Chaumond tweet media
English
50
8
499
32.2K
BrainOS
BrainOS@BadBrainCode·
@TheEcomNomad @NVIDIAAI Wall 2 was not speed related. It doesn’t know how to patch files. Changing something as simple as semicolon requires several full context turns.
English
0
0
1
11
Aaron ⚡️
Aaron ⚡️@TheEcomNomad·
@BadBrainCode @NVIDIAAI the 26 tok/s isn't a tuning issue, it's physics. unified memory means you're bottlenecked moving weights from VRAM to compute. batching won't fix it, just adds latency without throughput gains
English
1
0
0
25
BrainOS
BrainOS@BadBrainCode·
@NVIDIAAI Can you please help here? Two walls running Nemotron-3-Super-120B (NVFP4) locally on a DGX Spark (GB10, 121GB unified) for agentic coding — and the fix for the first exposed the second. Setup: served via vLLM, single-stream decode ~26 tok/s (memory-bandwidth bound, saturates ~batch 2). Great for chat. Then we pointed it at multi-step coding goals that iterate on a file. Wall 1 — full-file rewrites. The agent “edited” code by rewriting the entire file every time. A one-line fix = regenerating ~700 lines = minutes per edit. One goal ran ~8 hours, almost all of it re-emitting unchanged code. Prompt-steering (“use the patch tool, never rewrite existing files”) — in both the system prompt and the task — did nothing in live runs. The model’s default beat the instruction. What worked was structural: a pre-tool hook that blocks a full-file write when the file already exists and returns “use patch instead.” Edits dropped from ~10 min to ~0.2s. Wall cleared. Wall 2 — which the fix revealed. Forced onto surgical patches, the model couldn’t reliably construct them. A patch needs old_string to be a byte-exact substring of the file, JSON-escaped. The model over-escaped — emitting \\" where the file had a plain " — so the anchor never matched. The tool’s own diagnostic flagged it: “Escape-drift detected… almost always a tool-call serialization [issue].” Add no-op patches (old_string == new_string, changes nothing) and “old_string not found,” and it churned ~4 hours; two requested features silently never landed — yet it reported COMPLETE. Root cause of wall 2 is three things stacked: (1) faithfully round-tripping quotes/backslashes through the tool-call parser drifts; (2) with reasoning disabled (we force enable_thinking=false to dodge a cuBLAS long-decode crash on GB10), there’s no internal “does this string actually exist in the file?” check, so it fires malformed patches blind; (3) it reconstructs the anchor from context instead of re-reading exact bytes. The lesson for local-agent builders on Spark: don’t just optimize tok/s. Agentic coding is gated by two separate failure modes — token economy (rewrites) and tool-call serialization fidelity (patches) — and the fix for one surfaces the other. Both need structural guardrails, not better prompts.
English
1
0
1
88