BrainOS

3.5K posts

BrainOS

@BadBrainCode

How can we debug the operating system in our own brains? This is the most important question facing humanity.

Planet Earth Katılım Kasım 2023

739 Takip Edilen214 Takipçiler

BrainOS retweetledi

0xSero@0xSero·6h

Deepseek-v4-flash REAP is done There's an 80 and 96gb version (weights) I am working through the pain of getting the model to run on sm121 (DGX Spark) If anyone of you has a working DS4 vllm/sglang config/env/docker for Sparks please share Once it works I'll make the HF public

English

182

7.1K

BrainOS@BadBrainCode·18h

@TheEcomNomad Thank you for pointing me in the right direction.

English

Aaron ⚡️@TheEcomNomad·21h

@BadBrainCode The hybrid Mamba component breaks vLLM's prefix cache because the cache assumes pure transformer attention. That architectural mismatch is why cache key generation fails silently instead of erroring out.

English

BrainOS@BadBrainCode·1d

Running NVIDIA Nemotron-3-Super-120B-A12B (NVFP4 mixed) on DGX Spark / GB10 (Grace-Blackwell) via vLLM 0.19.2rc1. Architecture resolves as NemotronHMTPModel — hybrid Mamba + Transformer with MTP draft head. PROBLEM: prefix caching is non-functional on this model. With --enable-prefix-caching explicit, vLLM emits this warning on boot: "Prefix caching in Mamba cache 'all' mode is currently enabled. Its support for Mamba layers is experimental." Measured behavior over a real production run: queries: 6,211 hits: 0 hit rate: 0.00% Confirmed across thousands of requests with shared system prompts that should be deduplicating cleanly. They aren't. This isn't a misconfiguration — NVIDIA's own Nemotron-3-Super deployment guide (<github.com/NVIDIA-NeMo/Ne…> SparkDeploymentGuide) deliberately OMITS --enable-prefix-caching. So upstream knows it doesn't work on NemotronH. WHY IT MATTERS: We run a 4-way parallel fan-out pattern — brain decomposes a brief, dispatches 4 concurrent section-writes against vLLM with an identical 163-token shared system prompt, stitches results. Measured throughput: c=1: 15.0 tok/s c=2: 25.4 tok/s c=3: 30.1 tok/s c=4: 41.4 tok/s (← saturation; 4th slot effectively free) That 2.76× user-facing speedup is real. But every one of those 4 calls re-prefills the same 163-token system message independently. With a working prefix cache, that prefill cost drops ~4×. For larger shared prompts (RAG context, long instruction blocks, few-shot exemplars), the win compounds enormously. THE STRUCTURAL QUESTION: Is this a fundamental limit? Mamba's selective- scan state can't be paged like transformer KV — it's a fixed-size recurrent state, not a sequence of key/value vectors. So for the Mamba LAYERS of a hybrid model, prefix cache semantics are genuinely unclear. But for the TRANSFORMER LAYERS interleaved with them, the KV reuse should work fine. Has anyone at @NVIDIAAIDev or @NVIDIAAI considered a "hybrid prefix cache" mode that caches the transformer-layer KV pages for a matched prefix while re-running the Mamba state forward? Even a partial fix would eliminate most of the prefill cost on this architecture. Or — is there a known issue / planned vLLM PR I should be watching? Happy to share the bench harness if it's useful for repro. cc @vllm_project — same question your side. Hardware: DGX Spark, GB10, 121 GiB unified memory, --max-num-seqs 4, --quantization fp4, MARLIN MoE backend, async scheduling, no MTP (breaks structured tool-call emission — separate issue).

English

BrainOS@BadBrainCode·1d

@per_arneng @mr_r0b0t @Acer Great looking rig. Please report CPU/GPU temps under full load.

English

281

𝒫𝑒𝓇 𝒜𝓇𝓃𝑒𝓃𝑔 【🐧λ🦀⎈】@per_arneng·1d

After alot of research into the DGX Spark clones i finally settled and ordered the Acer Veriton GN100 based on thermal performance. #nvidia #dgxspark @Acer

English

146

23.1K

BrainOS@BadBrainCode·1d

#007FirstLightRTX

QHT

BrainOS@BadBrainCode·1d

@aijoey This is very helpful.

English

Joey@aijoey·22 Nis

Local AI landing page generation on a DGX Spark. One Gemma-4-26B Q4 GGUF served by llama.cpp with 7 concurrent decode slots. The orchestrator breaks “landing page” into 6 section briefs: hero features steps testimonials pricing CTA Then 6 Gemma instances generate the sections in parallel and stitch everything into one Tailwind page. ~3 minutes end to end. The best part: everything you just watched happens offline, forever. No one can turn it off besides my light company lol @googlegemma @NVIDIAAIDev

English

BrainOS@BadBrainCode·1d

@SpaceTimeViking @mr_r0b0t @bridgemindai Concurrency is the way. Maybe Grok 4.3 managing a swarm of parallel calls.

English

118

ÆON FORGE ✨@SpaceTimeViking·1d

People reporting the DGX Spark is slow just don’t know how to optimize for it. Understandably a common issue with lack of good information out there. It does require some first principles understanding of the hardware and software. I was running 256 concurrent sessions on a single DGX Spark getting nearly 2000 Tok/s aggregate.

English

1.6K

BridgeMind@bridgemindai·2d

MacBook Pro M5 Max is fully set up and running local models. I have never seen speeds this fast. Qwen 3.6 35B and Gemma 4 31B are running blazing fast on 128GB of unified memory. Faster than both of my stacked NVIDIA DGX Sparks sitting right next to it. Initial impressions: Apple silicon in 2026 is no joke for local inference. The M5 Max handles 30B+ parameter models like they're nothing. Full review and comparison video coming soon.

English

113

842

53.4K

BrainOS@BadBrainCode·1d

I’d like to get some feedback on running multiple models on a DGX. Anyone else run into this?

BrainOS@BadBrainCode

x.com/i/article/2058…

English

BrainOS@BadBrainCode·2d

x.com/i/article/2058…

ZXX

BrainOS@BadBrainCode·2d

@TheAhmadOsman Sorry I miss read your post. I tried multiple smaller models running concurrently.

English

BrainOS@BadBrainCode·2d

@TheAhmadOsman I have an article coming out today on DGX concurrency. Could be missing something bur tldr, it didn’t work.

English

286

Ahmad@TheAhmadOsman·2d

Qwen 3.5 27B in NVFP4 w/ full context taking less than 20GB VRAM You can basically run like 5 agents w/ full context on a single RTX PRO 6000 like this, and they'd be so fast Tell me I didn't tell you this was gonna happen

Ahmad@TheAhmadOsman

Qwen 3.5 27B is the release of the year for me so far > Agentic model & great at tool calling > Claude Sonnet 4.6 quality at home > ~28GB in NVFP4 > Fits on a single RTX 5090 > with full context (256K) Amazing model & performance The prediction below will age like fine wine

English

412

35.2K

BrainOS@BadBrainCode·2d

@nikitabier @grok wtf is this?

English

148

Nikita Bier@nikitabier·2d

Good morning Tokyo

English

1.7K

470

9.5K

401.3K

BrainOS@BadBrainCode·2d

@aijoey They are all over the country, self included.

English

Joey@aijoey·2d

Sounds about right

Polymarket@Polymarket

JUST IN: Marc Andreessen reveals “AI vampires” are emerging in Silicon Valley — coders getting so little sleep because they stay up all night building with agents.

English

718

BrainOS@BadBrainCode·3d

@aijoey @NVIDIAAI @PavloMolchanov @Teknium @NousResearch Great job. I tried and never got the harness work. Had to give up due to time. If you can post your config it would be helpful.

English

Joey@aijoey·20 May

got NVIDIA’s new Nemotron-Labs-Diffusion 8B running locally on my DGX Spark. Jetson(hermes) made me the tri mode runner the cool part: it’s one model that can answer in different “gears.” same prompt, same checkpoint: - normal mode: 10.98 tok/s - diffusion mode: 20.58 tok/s - self-spec mode: 18.01 tok/s - self-spec + lora: 18.07 tok/s plain english: diffusion mode was almost 2x faster than normal mode in this tiny first test. but faster wasn’t automatically better. the fastest mode also started repeating itself, so now the real test is running a bigger prompt suite and checking both: - how fast it answers - whether the answer is actually good early result: the tri-mode idea works locally. next step is figuring out which mode is best for which kind of prompt.

English

5.5K

BrainOS@BadBrainCode·3d

@din__mon @mr_r0b0t It’s the first model that can one shot small projects set up as a Hermes profile.

English

Dinys@din__mon·3d

@BadBrainCode @mr_r0b0t In what way it is better?

English

BrainOS@BadBrainCode·3d

x.com/i/article/2058…

ZXX

2.6K

BrainOS@BadBrainCode·3d

@epaleezeldin We are talking about the deep state. They don't follow laws. In fact, they break them and are protected from prosecution by activist judges and weak DA's.

English

Lee Zeldin@epaleezeldin·4d

No GE mosquitoes are authorized for use in the United States. NONE!

English

830

1.4K

6.5K

79.8K

Lee Zeldin@epaleezeldin·4d

ENTIRELY fake news! At no point since President Trump was sworn back into office has the Trump EPA authorized the release of ANY genetically modified mosquitoes into Florida or anywhere else for that matter. So much fake news BS being peddled on social media for RTs, Likes, and engagement.

Concerned Citizen@BGatesIsaPyscho

“The EPA just authorise the release of 2 Billion Genetically Modified Mosquitoes” First Ticks & now Mosquitoes - The World is run by insane lunatics.

English

994

3.8K

10.3K

320.7K

BrainOS@BadBrainCode·4d

@aijoey @SpaceTimeViking @Alibaba_Qwen @nvidia Having the best results with this model over the dozen or so I have tested. It’s tool use is better than Nemotron. Doing more testing now on nemo to see if something is jacked up. Thank you for your work on this.

English

Joey@aijoey·11 May

local coding models are getting useful fast. this is Qwen3.6-27B AEON Ultimate Uncensored NVFP4 running locally on my DGX Spark through vLLM. @SpaceTimeViking three concurrent dev tasks from one local endpoint: • stack trace triage • pytest regression test • commit + PR notes the model caught a real fixture bug: KeyError on missing timeout_ms, fixed it with a 5000ms default, and the regression test passed. all local. no cloud API. tested from my phone while i was at work. cc: jetson(hermes)

English

6.2K

BrainOS@BadBrainCode·4d

@aijoey @julien_c We need a DGX anonymous group.

English

Joey@aijoey·4d

@julien_c Yes!!! Welcome to the club.

English

253

Julien Chaumond@julien_c·4d

I finally got me one of those puppies

English

499

32.2K

BrainOS@BadBrainCode·4d

@TheEcomNomad @NVIDIAAI Wall 2 was not speed related. It doesn’t know how to patch files. Changing something as simple as semicolon requires several full context turns.

English

Aaron ⚡️@TheEcomNomad·4d

@BadBrainCode @NVIDIAAI the 26 tok/s isn't a tuning issue, it's physics. unified memory means you're bottlenecked moving weights from VRAM to compute. batching won't fix it, just adds latency without throughput gains

English

BrainOS@BadBrainCode·4d

@NVIDIAAI Can you please help here? Two walls running Nemotron-3-Super-120B (NVFP4) locally on a DGX Spark (GB10, 121GB unified) for agentic coding — and the fix for the first exposed the second. Setup: served via vLLM, single-stream decode ~26 tok/s (memory-bandwidth bound, saturates ~batch 2). Great for chat. Then we pointed it at multi-step coding goals that iterate on a file. Wall 1 — full-file rewrites. The agent “edited” code by rewriting the entire file every time. A one-line fix = regenerating ~700 lines = minutes per edit. One goal ran ~8 hours, almost all of it re-emitting unchanged code. Prompt-steering (“use the patch tool, never rewrite existing files”) — in both the system prompt and the task — did nothing in live runs. The model’s default beat the instruction. What worked was structural: a pre-tool hook that blocks a full-file write when the file already exists and returns “use patch instead.” Edits dropped from ~10 min to ~0.2s. Wall cleared. Wall 2 — which the fix revealed. Forced onto surgical patches, the model couldn’t reliably construct them. A patch needs old_string to be a byte-exact substring of the file, JSON-escaped. The model over-escaped — emitting \\" where the file had a plain " — so the anchor never matched. The tool’s own diagnostic flagged it: “Escape-drift detected… almost always a tool-call serialization [issue].” Add no-op patches (old_string == new_string, changes nothing) and “old_string not found,” and it churned ~4 hours; two requested features silently never landed — yet it reported COMPLETE. Root cause of wall 2 is three things stacked: (1) faithfully round-tripping quotes/backslashes through the tool-call parser drifts; (2) with reasoning disabled (we force enable_thinking=false to dodge a cuBLAS long-decode crash on GB10), there’s no internal “does this string actually exist in the file?” check, so it fires malformed patches blind; (3) it reconstructs the anchor from context instead of re-reading exact bytes. The lesson for local-agent builders on Spark: don’t just optimize tok/s. Agentic coding is gated by two separate failure modes — token economy (rewrites) and tool-call serialization fidelity (patches) — and the fix for one surfaces the other. Both need structural guardrails, not better prompts.

English

Keşfet

@TheEcomNomad @NVIDIAAIDev @NVIDIAAI @vllm_project @per_arneng @mr_r0b0t @Acer @aijoey