
Michał Piszczek
3.1K posts

Michał Piszczek
@cdiamond
CTO @ Archdesk | Systems where physics meets economics. Ex-Hacker. Ex-Fintech CEO. Nullius in verba. 🖖 AI does not fail. Human judgment does.






16 parallel runs of Gemma 4 26B A4B on a single NVIDIA DGX Spark! Pushing 18 tok/s per instance and a 300 tok/s aggregate. It can even hit 32 parallel runs. This level of concurrency highlights how efficient the architecture is.



Introducing Mistral OCR 4. It creates structure with bounding boxes, block classification, and inline confidence scores in 170 languages. 🧵👇




1-bit GLM-5.2 GGUF vs. Claude 4.8 Opus vs. GPT-5.5 We gave 3 models the same prompt and compared one-shot outputs. The 1-bit GLM-5.2 GGUF ran locally on a Mac Studio M3 Ultra with 256GB RAM at ~21.6 tok/s. Which output do you like best? GGUF: huggingface.co/unsloth/GLM-5.…


Wait we actually just broke 1T tokens in a day for the first time on OpenRouter :O Please keep contributing to the most awesome project I've ever worked on to help make Hermes Agent the best software stack on the planet! Thank you contributors🍻🍻





Why is AI writing still so bad









The Emperor Has No Clothes: Why the AI Infrastructure Buildout Math Doesn't Work I have to give IBM CEO Arvind Krishna credit. He's saying what many of us in this industry have been thinking but haven't been willing to say out loud. The math just doesn't add up. Here's what I'm seeing that's deeply troubling. We're in the middle of another mass hallucination. Just like the dot-com bubble, just like blockchain, just like the metaverse — everyone is convinced that building massive data centers will automatically create massive wealth. But here's the thing about building infrastructure. You actually have to sell what's inside it. Let's talk numbers. The planned data center buildout over the next 5-10 years is staggering. We're talking about commitments in the hundreds of gigawatts globally. The capital expenditure commitments are in the trillions. Yet when you look at the actual demand signals, not the projections, not the potential, but the actual consumption patterns, there's a massive gap. These AI companies are betting everything on demand that simply doesn't exist at the scale they're planning for. Let me be direct. AI services are expensive. Enterprise adoption is slow. Consumer AI is still finding its footing. And the compute requirements being promised by the hyperscalers require a level of demand that would represent a fundamental shift in how businesses consume technology. That's a big ask. I've seen this pattern before. The overbuilding. The belief that if you build it, they will come. The groupthink that turns critical analysis into heresy. The result is always the same. Companies are going to touch the stove. We're going to see massive write-downs. We're going to see pivots, shutdowns, and strategic reviews. We're going to see companies that spent years and billions trying to be the AI infrastructure leader become case studies in how not to read a market. The IBM CEO is right. The math doesn't work. And unlike 1999, we don't have the excuse of we didn't know. We know exactly what's happening. We just don't want to believe it because the alternative, being a skeptic while everyone else is piling in, feels like career suicide. It's not. The ones who survive the next decade will be the ones who built for reality, not fantasy. Wake up. The emperor has no clothes. As reported by Futurism, Krishna laid out striking calculations: a 1 gigawatt data center costs roughly $80 billion today. If one company commits 20-30 gigawatts, that's $1.5 trillion in capital expenditure. The total commitments across the industry for chasing AGI are approximately 100 gigawatts, equaling $8 trillion. To break even, you'd need $800 billion in profit just to cover the interest. That's not investment. That's hoping. futurism.com/artificial-int…





Google's Gemma 4 26B A4B QAT hits 25+ tokens/sec and 320+ tokens/sec prefill on 8 GB VRAM (RTX 4060) + 16 GB RAM using TurboQuant Prefill just went from 200 → 320+ tok/s on the same 8GB card. 1.6x, no new hardware, no new quant, just a KV cache trick stacked on top of the Gemma 4 26B MoE setup from a few days ago. A few days ago I posted Gemma 4 26B A4B hitting 28 tok/s decode on 8GB VRAM using native MTP. prefill was stuck around 200 tok/s. fair callout by the community. So today I tested something I'd already been meaning to try: TheTom/llama-cpp-turboquant, the TurboQuant KV cache fork by Tom Turney (@no_stp_on_snek). (github link in the comments) thanks to him, the fork just got resynced to mainline, so MTP + TurboQuant now run together cleanly (I didnt see any meaningful gains by using MTP with this setup though but you can try). The flags (No MTP): -m gemma-4-26B-A4B-it-qat-UD-Q4_K_XL.gguf -cnv -c 64000 --cache-type-k q8_0 --cache-type-v turbo3 Results on the same RTX 4060 8GB, tested with a 27k token prompt at 64k context loaded: Prefill: 200 tok/s → 320+ tok/s Decode: stayed above 25 tok/s (without MTP) Why it works: TurboQuant uses walsh hadamard rotation + polar quantization on the KV cache. keys are sensitive to compression, values aren't much, so it splits the difference: K stays at q8_0, V drops to turbo3 (~3 bits). bonus from the memory savings: same 8GB card can now stretch to 100-120k context with minimal decode penalty. It should now be snappier with any agent harness such as hermes agent without compromise on intelligence. If you're already running Gemma 4 on a small card, this stacks on top for free. Try --cache-type-k q8_0 --cache-type-v turbo3 on your setup and report back what your prefill/decode split looks like. unsloth model gguf and llama.cpp turboquant fork links in the comments. what's your prefill number before vs after?










