Tech2Wild
148 posts

Tech2Wild
@Tech2Wild
🎮 Tech, gaming, AI, and everything in between. 🤖 Building with it, not just talking about it. 🔥 From the mind of @ToNYD2WiLD
Bergabung Mart 2026
59 Mengikuti99 Pengikut


@Tech2Wild Two 3090’s on top of each other just sat on the PSU outside the case got me feeling a type of way
English

@Tech2Wild Qwen3.6 35B A3B AutoRound fits in a single 24GB GPU with 262K context with fp8 KV cache and runs at 160 tps in a rtx 3090 via vLLM... Produces much better code than Gemma 4 12B. Unfair to compare them.
English

@YahiaAh87164950 @sakurayukiai 2 GPUS gave me more speed on 27B it went from 70 tok/s to a 120
English

@Tech2Wild I can’t say anything negative about either model other than that 12B’s native vernacular is too informal for my liking… it has grok4.1/deepseek sentence structure and punctuation. Otherwise, I think that 12B is the better chat model and 35B the better reasoner/researcher.
English

@malikwas1f Good call I been running your recipes bro thanks for what you do
English

@sakurayukiai I have 35B running now. The issue I’m having is 2GPUs of 27B give me almost identical speeds as 1 GPU on 35B
English

@Tech2Wild If you can fit the 35B footprint, Qwen is wild. Only 3B active params means it runs circles around Gemma's 12B dense decode speeds, but Gemma 4 is way friendlier on a single consumer GPU.
English

@gospaceport Sir I literally just watched your video on your Quad Build from 9 months ago 🙏🏽. Debating whether you go to GEN 5 or just grab one of the motherboards you showed and stay Gen 4.
English

@Tech2Wild Wrote a breakdown on how the mechanics and math of speculative decoding actually work: leetllm.com/learn/speculat…
English

@Tech2Wild Q2 perplexity hit on a 550B is so brutal you're basically paying a massive latency and hardware tax to get the reasoning of a solid 70B. Wild engineering flex, but the math is pretty unforgiving.
English

Ran NVIDIA Nemotron-3-Ultra-550B fully local across 2 DGX Sparks (188GB split via llama.cpp RPC) 🤯 Findings: it works + reasons — but ~5 tok/s, since RPC is round-trip-bound (dual-node is slower per-token than one; it's a capacity play). But I question bigger≠better: 2-bit 550B barely tops a clean 4-bit ~285B. Can we agree ?

English

@outsource_ Model: unsloth/NVIDIA-Nemotron-3-Ultra-550B-A55B-GGUF, UD-Q2_K_XL (~188 GiB, 6 shards)
Dansk









