Marc Ohmann

1.6K posts

Marc Ohmann

@marcohmann

Building self-learning ontologies to turn existing data into domain-specific AI agents. Context is the product. zeros is the platform https://t.co/V8rZhzvSDO

Taiga Katılım Aralık 2010

320 Takip Edilen516 Takipçiler

Marc Ohmann@marcohmann·1d

@0xSero How can YouTube have a channel like that burried while only show me AI hype videos in my feed 🤦‍♂️ Great find!

English

1.1K

0xSero@0xSero·1d

You ever go on Huggingface and see: - GGUF - Unsloth - Llama.cpp - Dynamic GGUF - Q_4_M / IQ_4XL etc. Here's what's going on under the hood. youtube.com/watch?v=vW30o4…

YouTube

English

145

1.3K

95.1K

Marc Ohmann@marcohmann·1d

Heremes with local Qwen3.5-122B is stout!

English

Marc Ohmann@marcohmann·1d

But Qwen can be lazy which is quite annoying. GLM quickly sniffs out Qwens excuses and gets things moving again. Then Qwen behaves

English

Marc Ohmann@marcohmann·1d

Qwen3.5-122B-A10B is such a gift to local agentic, multi-step, reasoning tasks. It's just a joy to use. It's not GLM-5.1 but it requires less than 1/4 of the vram (FP8 to FP8). I've been pushing it pretty hard the last few days. It's fast on my rig even at 262k context. Most of the session is > 100TPS until I get over 80k tokens context. I still have to fall back to GLM from time to time but the freedom of having Qwen for the heavy lifting is amazing. This amount of essentially free intelligence at my fingertips is what I've been waiting years for and its only going to get better.

English

Marc Ohmann@marcohmann·1d

I end up opening the door on the case when I'm hitting it hard. That removes a ton of heat fast. Right now cards are in slots 1 and 5 but I'm still playing around trying to figure out how I'm going to get 2 more in there. Probably need a printed bracket and risers. The card in slot 1 runs 5-7C warmer than the lower card.

English

Mike Bradley@The_Only_Signal·1d

Nice. I’ve been thinking a lot about this. Trying to build for max possible peak power on the two cards but it’s a tricky problem. Even if you get the power sorted the thermals become a separate problem to untangle. I’m wagering you went max separation on the two cards? Top slot + bottom slot? Anything creative with fans?

English

Mike Bradley@The_Only_Signal·13 Nis

2x RTX PRO 6000 tower incoming…

English

855

Marc Ohmann@marcohmann·1d

@The_Only_Signal I capped them at 450w yesterday. Was running full power but started overworking the PSU. 450w has been very good at helping to keep temps down also. I could probably go to 500 comfortably but don't really need to.

English

Mike Bradley@The_Only_Signal·1d

@marcohmann Nice! Do you run power capped at all on the GPUs in your current build?

English

Marc Ohmann@marcohmann·1d

@The_Only_Signal Only 120v right now with the 2x 6000s but my goal is to add two more soon and switch to 240v. Moving to a new office that I had wired just for it 😁

English

Mike Bradley@The_Only_Signal·2d

@marcohmann Nice! Out of curiosity are you running the PSU off of 208v or 240v power? I checked that one out and at my meager 120v power supply it would have turned into a 1500w platinum rated PSU.

English

Marc Ohmann@marcohmann·2d

@0xSero Sharing your pain

English

0xSero@0xSero·2d

I went from sm86 battles to sm120 battles. Why does god give me his hardest battles.

English

4.2K

Marc Ohmann@marcohmann·2d

So when Qwen 3.5 -122B gets off track, I can switch to full precision GLM-5.1 and run /verify to straighten things out then back to Qwen

English

Marc Ohmann@marcohmann·2d

I've had this opencode /verify command for a while and use it a lot when I sense BS Gets used more with the smaller models to quickly get them back on track First, log this verification request for pattern analysis: ┃ ┃ ```python ┃ zeros_log_verify_invocation() ┃ ``` ┃ ┃ **STOP and reflect before proceeding.** ┃ ┃ Answer these questions honestly: ┃ ┃ 1. **Confidence**: How confident are you in your current assessment (low/medium/high)? Why? ┃ ┃ 2. **Gaps**: What did you NOT verify? List specific things you assumed but didn't confirm: ┃ - Files you referenced but didn't read? ┃ - Parent classes you didn't check? ┃ - User sessions you didn't trace? ┃ - Git history you didn't review? ┃ ┃ 3. **Improve**: What concrete steps would increase your confidence? ┃ - What tool calls would verify your assumptions? ┃ - What additional context would help? ┃ ┃ 4. **Action**: Based on the above, do you need to: ┃ - [ ] Go back and verify something before proceeding? ┃ - [ ] Adjust your conclusion? ┃ - [ ] Proceed with noted caveats? ┃ ┃ Be specific. "I'm confident" without evidence is not acceptable. Show your work.

English

Marc Ohmann@marcohmann·2d

I'm running my 600w RTX 6000s side by side in slots 1 and 3. They nearly touch. This was because I wanted to see if it would be feasible to add 2 more in slots 5 and 7 and still dissipate heat. For inference, I haven't throttled them down yet. They can spike to 600w and momentarily hit 80c with multiple agents hitting it up to 200k context.

English

Marc Ohmann@marcohmann·2d

@KyleHessling1 yeah, it's like ordering GPUs for where the puck is going to be

English

Kyle Hessling@KyleHessling1·2d

Am I mistaken that if the delta holds as seen between the Qwen 3.6 35b MOE and the Qwen 3.5 35b MOE, that the 3.6 dense 27B delta will unseat Kimi k2.5 at less than 3% of the model size? You remember when we were all considering buying 2 or 4 mac studios just to run REAP prunes in Q1 to run that model? We could soon have similar capability on a 3090. Exciting acceleration, to say the least!

English

254

24.5K

Marc Ohmann retweetledi

Kimi.ai@Kimi_Moonshot·2d

We push Prefill/Decode disaggregation beyond a single cluster: cross-datacenter + heterogeneous hardware, unlocking the potential for significantly lower cost per token. This was previously blocked by KV cache transfer overhead. The key enabler is our hybrid model (Kimi Linear), which reduces KV cache size and makes cross-DC PD practical. Validated on a 20x scaled-up Kimi Linear model: ✅ 1.54× throughput ✅ 64% ↓ P90 TTFT → Directly translating into lower token cost. More in Prefill-as-a-Service: arxiv.org/html/2604.1503…

English

341

2.8K

641.5K

Marc Ohmann@marcohmann·2d

@WolfHumble Enjoy! TTFT Benchmark: qwen3.5-122b-fp8 Testing 128,000 tokens... Input size: 492,798 chars (~123,199 tokens) TTFT: 10035ms (10.035s) Throughput: 9.9 tok/s Pretty linear slowdown with 1k tokens at 1.08s and 92.4 TPS

English

Richard Taubo@WolfHumble·2d

@marcohmann If you have time to answer, I was wondering: What is the typical Time To First Token (TTFT) for a 128K-token input for this 122B-FP8 model? Thanks! 😊

English

Marc Ohmann@marcohmann·3d

Qwen3.5 122B-FP8 on 2x RTX 6000 Blackwells Got stuck in endless loops with no weights loading until I turned off NBIO IOMMU in bios. Now 66 TPS!

English

Marc Ohmann@marcohmann·3d

Qwen 3.5 122B - FP8 performs quite well. It's not GLM-5.1 obviously but reasons very well and doesn't require much nudging like Gemma 4 and smaller models

English

Marc Ohmann@marcohmann·3d

@The_Only_Signal This is so similar to my build. Even the same case. Except I went with the HELA 2500Rz PSU

English

Mike Bradley@The_Only_Signal·6d

Build List: Platform •CPU: AMD Threadripper PRO 7965WX •Motherboard: ASUS Pro WS WRX90E-SAGE SE (WRX90, EEB, 128 PCIe 5.0 lanes, dual 10GbE, IPMI) •RAM: 128GB DDR5-4800 ECC RDIMM — 4× Samsung M321R4GA3BB6-CQK Compute •2× NVIDIA RTX PRO 6000 Blackwell (96GB GDDR7 ECC each) •192GB total VRAM, x16/x16 PCIe 5.0, 500W cap per card Case •Corsair 9000D RGB Airflow (SSI-EEB, no fans included) Power •MSI MEG Ai1600T PCIE5 — 1600W 80+ Titanium, dual native 12V-2x6 •Dedicated 20A 120V circuit Cooling •CPU: Noctua NH-U14S TR5-SP6, dual NF-A15 140mm •Front intake: 3× iCUE LINK RX140 MAX •Top exhaust: 3× iCUE LINK RX140 MAX •Rear exhaust: 2× iCUE LINK RX120 RGB Storage •Samsung 9100 PRO 8TB w/heatsink — PCIe 5.0 x4, 14,800 MB/s (OS, models, stack) •2TB SSD (scratch — Qdrant, datasets, embeddings) Networking •Dual 10GbE onboard (Intel X710)

English

526

Marc Ohmann@marcohmann·3d

@TheAhmadOsman All I think about is how to get non-lobotomized GLM-5.1 locally with decent context

English

1.6K

Ahmad@TheAhmadOsman·3d

Currently running GLM-5.1 locally Cannot believe this thing is running on my own GPUs, its really smart

English

821

62.9K

Marc Ohmann@marcohmann·3d

@TheAhmadOsman Same board I'm running. Followed!

English

Ahmad@TheAhmadOsman·3d

More people should follow this guy He knows what he's talking about

Mike Bradley@The_Only_Signal

@TheAhmadOsman Great video talking about the board if you want to play a PCIe based drinking game 😁

English

121

26K

Marc Ohmann@marcohmann·8 Nis

GLM-5.1 2-bit GGUF running on 2x RTX 6000 Blackwells wks Thanks @UnslothAI and @Zai_org

English

Keşfet

@0xSero @The_Only_Signal @KyleHessling1 @elonmusk @BarackObama @taylorswift13 @cristiano @BillGates