Marc Ohmann

1.6K posts

Marc Ohmann banner
Marc Ohmann

Marc Ohmann

@marcohmann

Building self-learning ontologies to turn existing data into domain-specific AI agents. Context is the product. zeros is the platform https://t.co/V8rZhzvSDO

Taiga Katılım Aralık 2010
320 Takip Edilen516 Takipçiler
Marc Ohmann
Marc Ohmann@marcohmann·
@0xSero How can YouTube have a channel like that burried while only show me AI hype videos in my feed 🤦‍♂️ Great find!
English
0
0
9
1.1K
0xSero
0xSero@0xSero·
You ever go on Huggingface and see: - GGUF - Unsloth - Llama.cpp - Dynamic GGUF - Q_4_M / IQ_4XL etc. Here's what's going on under the hood. youtube.com/watch?v=vW30o4…
YouTube video
YouTube
English
23
145
1.3K
95.1K
Marc Ohmann
Marc Ohmann@marcohmann·
Heremes with local Qwen3.5-122B is stout!
Marc Ohmann tweet media
English
0
0
0
11
Marc Ohmann
Marc Ohmann@marcohmann·
But Qwen can be lazy which is quite annoying. GLM quickly sniffs out Qwens excuses and gets things moving again. Then Qwen behaves
English
0
0
0
10
Marc Ohmann
Marc Ohmann@marcohmann·
Qwen3.5-122B-A10B is such a gift to local agentic, multi-step, reasoning tasks. It's just a joy to use. It's not GLM-5.1 but it requires less than 1/4 of the vram (FP8 to FP8). I've been pushing it pretty hard the last few days. It's fast on my rig even at 262k context. Most of the session is > 100TPS until I get over 80k tokens context. I still have to fall back to GLM from time to time but the freedom of having Qwen for the heavy lifting is amazing. This amount of essentially free intelligence at my fingertips is what I've been waiting years for and its only going to get better.
English
1
0
1
29
Marc Ohmann
Marc Ohmann@marcohmann·
I end up opening the door on the case when I'm hitting it hard. That removes a ton of heat fast. Right now cards are in slots 1 and 5 but I'm still playing around trying to figure out how I'm going to get 2 more in there. Probably need a printed bracket and risers. The card in slot 1 runs 5-7C warmer than the lower card.
English
1
0
1
14
Mike Bradley
Mike Bradley@The_Only_Signal·
Nice. I’ve been thinking a lot about this. Trying to build for max possible peak power on the two cards but it’s a tricky problem. Even if you get the power sorted the thermals become a separate problem to untangle. I’m wagering you went max separation on the two cards? Top slot + bottom slot? Anything creative with fans?
English
1
0
0
6
Mike Bradley
Mike Bradley@The_Only_Signal·
2x RTX PRO 6000 tower incoming…
Mike Bradley tweet media
English
9
0
25
855
Marc Ohmann
Marc Ohmann@marcohmann·
@The_Only_Signal I capped them at 450w yesterday. Was running full power but started overworking the PSU. 450w has been very good at helping to keep temps down also. I could probably go to 500 comfortably but don't really need to.
English
1
0
0
17
Mike Bradley
Mike Bradley@The_Only_Signal·
@marcohmann Nice! Do you run power capped at all on the GPUs in your current build?
English
1
0
0
6
Marc Ohmann
Marc Ohmann@marcohmann·
@The_Only_Signal Only 120v right now with the 2x 6000s but my goal is to add two more soon and switch to 240v. Moving to a new office that I had wired just for it 😁
English
1
0
1
20
Mike Bradley
Mike Bradley@The_Only_Signal·
@marcohmann Nice! Out of curiosity are you running the PSU off of 208v or 240v power? I checked that one out and at my meager 120v power supply it would have turned into a 1500w platinum rated PSU.
English
1
0
0
18
0xSero
0xSero@0xSero·
I went from sm86 battles to sm120 battles. Why does god give me his hardest battles.
English
17
0
84
4.2K
Marc Ohmann
Marc Ohmann@marcohmann·
So when Qwen 3.5 -122B gets off track, I can switch to full precision GLM-5.1 and run /verify to straighten things out then back to Qwen
English
0
0
0
32
Marc Ohmann
Marc Ohmann@marcohmann·
I've had this opencode /verify command for a while and use it a lot when I sense BS Gets used more with the smaller models to quickly get them back on track First, log this verification request for pattern analysis: ┃ ┃ ```python ┃ zeros_log_verify_invocation() ┃ ``` ┃ ┃ **STOP and reflect before proceeding.** ┃ ┃ Answer these questions honestly: ┃ ┃ 1. **Confidence**: How confident are you in your current assessment (low/medium/high)? Why? ┃ ┃ 2. **Gaps**: What did you NOT verify? List specific things you assumed but didn't confirm: ┃ - Files you referenced but didn't read? ┃ - Parent classes you didn't check? ┃ - User sessions you didn't trace? ┃ - Git history you didn't review? ┃ ┃ 3. **Improve**: What concrete steps would increase your confidence? ┃ - What tool calls would verify your assumptions? ┃ - What additional context would help? ┃ ┃ 4. **Action**: Based on the above, do you need to: ┃ - [ ] Go back and verify something before proceeding? ┃ - [ ] Adjust your conclusion? ┃ - [ ] Proceed with noted caveats? ┃ ┃ Be specific. "I'm confident" without evidence is not acceptable. Show your work.
English
1
0
0
17
Marc Ohmann
Marc Ohmann@marcohmann·
I'm running my 600w RTX 6000s side by side in slots 1 and 3. They nearly touch. This was because I wanted to see if it would be feasible to add 2 more in slots 5 and 7 and still dissipate heat. For inference, I haven't throttled them down yet. They can spike to 600w and momentarily hit 80c with multiple agents hitting it up to 200k context.
Marc Ohmann tweet media
English
0
0
2
18
Kyle Hessling
Kyle Hessling@KyleHessling1·
Am I mistaken that if the delta holds as seen between the Qwen 3.6 35b MOE and the Qwen 3.5 35b MOE, that the 3.6 dense 27B delta will unseat Kimi k2.5 at less than 3% of the model size? You remember when we were all considering buying 2 or 4 mac studios just to run REAP prunes in Q1 to run that model? We could soon have similar capability on a 3090. Exciting acceleration, to say the least!
Kyle Hessling tweet media
English
24
11
254
24.5K
Marc Ohmann retweetledi
Kimi.ai
Kimi.ai@Kimi_Moonshot·
We push Prefill/Decode disaggregation beyond a single cluster: cross-datacenter + heterogeneous hardware, unlocking the potential for significantly lower cost per token. This was previously blocked by KV cache transfer overhead. The key enabler is our hybrid model (Kimi Linear), which reduces KV cache size and makes cross-DC PD practical. Validated on a 20x scaled-up Kimi Linear model: ✅ 1.54× throughput ✅ 64% ↓ P90 TTFT → Directly translating into lower token cost. More in Prefill-as-a-Service: arxiv.org/html/2604.1503…
Kimi.ai tweet media
English
67
341
2.8K
641.5K
Marc Ohmann
Marc Ohmann@marcohmann·
@WolfHumble Enjoy! TTFT Benchmark: qwen3.5-122b-fp8 Testing 128,000 tokens... Input size: 492,798 chars (~123,199 tokens) TTFT: 10035ms (10.035s) Throughput: 9.9 tok/s Pretty linear slowdown with 1k tokens at 1.08s and 92.4 TPS
English
1
0
1
16
Richard Taubo
Richard Taubo@WolfHumble·
@marcohmann If you have time to answer, I was wondering: What is the typical Time To First Token (TTFT) for a 128K-token input for this 122B-FP8 model? Thanks! 😊
English
1
0
0
7
Marc Ohmann
Marc Ohmann@marcohmann·
Qwen3.5 122B-FP8 on 2x RTX 6000 Blackwells Got stuck in endless loops with no weights loading until I turned off NBIO IOMMU in bios. Now 66 TPS!
Marc Ohmann tweet media
English
1
0
1
43
Marc Ohmann
Marc Ohmann@marcohmann·
Qwen 3.5 122B - FP8 performs quite well. It's not GLM-5.1 obviously but reasons very well and doesn't require much nudging like Gemma 4 and smaller models
English
1
0
0
49
Marc Ohmann
Marc Ohmann@marcohmann·
@The_Only_Signal This is so similar to my build. Even the same case. Except I went with the HELA 2500Rz PSU
English
1
0
1
40
Mike Bradley
Mike Bradley@The_Only_Signal·
Build List: Platform •CPU: AMD Threadripper PRO 7965WX •Motherboard: ASUS Pro WS WRX90E-SAGE SE (WRX90, EEB, 128 PCIe 5.0 lanes, dual 10GbE, IPMI) •RAM: 128GB DDR5-4800 ECC RDIMM — 4× Samsung M321R4GA3BB6-CQK Compute •2× NVIDIA RTX PRO 6000 Blackwell (96GB GDDR7 ECC each) •192GB total VRAM, x16/x16 PCIe 5.0, 500W cap per card Case •Corsair 9000D RGB Airflow (SSI-EEB, no fans included) Power •MSI MEG Ai1600T PCIE5 — 1600W 80+ Titanium, dual native 12V-2x6 •Dedicated 20A 120V circuit Cooling •CPU: Noctua NH-U14S TR5-SP6, dual NF-A15 140mm •Front intake: 3× iCUE LINK RX140 MAX •Top exhaust: 3× iCUE LINK RX140 MAX •Rear exhaust: 2× iCUE LINK RX120 RGB Storage •Samsung 9100 PRO 8TB w/heatsink — PCIe 5.0 x4, 14,800 MB/s (OS, models, stack) •2TB SSD (scratch — Qdrant, datasets, embeddings) Networking •Dual 10GbE onboard (Intel X710)
English
4
1
13
526
Marc Ohmann
Marc Ohmann@marcohmann·
@TheAhmadOsman All I think about is how to get non-lobotomized GLM-5.1 locally with decent context
English
1
0
1
1.6K
Ahmad
Ahmad@TheAhmadOsman·
Currently running GLM-5.1 locally Cannot believe this thing is running on my own GPUs, its really smart
English
69
12
821
62.9K
Ahmad
Ahmad@TheAhmadOsman·
More people should follow this guy He knows what he's talking about
Mike Bradley@The_Only_Signal

@TheAhmadOsman Great video talking about the board if you want to play a PCIe based drinking game 😁

English
7
1
121
26K