

Sudo su
6.6K posts

@sudoingX
GPU/local LLM. more RAM and OSS... everywhere




Putting out a wish to the universe. I need more compute, if I can get more I will make sure every machine from a small phone to a bootstrapped RTX 3090 node can run frontier intelligence fast with minimal intelligence loss. I have hit page 2 of huggingface, released 3 model family compressions and got GLM-4.7 on a MacBook huggingface.co/0xsero My beast just isn’t enough and I already spent 2k usd on renting GPUs on top of credits provided by Prime intellect and Hotaisle. ——— If you believe in what I do help me get this to Nvidia, maybe they will bless me with the pewter to keep making local AI more accessible 🙏



3 months ago, I realized I was hopelessly dependent on corporations that only care about power, money, and control. At this point Cursor, Claude, OpenAI, all had rugged their unlimited plans. I wanted a Mac M3 Ultra with 512GB RAM. Ahmad and Pewdiepie convinced me otherwise. Here's what I learned building my own AI Rig ----------------------------- The Build ($3K-$10K) This is the top performance you can get below 10k USD • 4x RTX 3090s with 2x NVLink • Epyc CPU with 128 PCIe lanes • 256-512GB DDR4 RAM • Romed8-2T motherboard • Custom rack + fan cooling • AX1600i PSU + quality risers Cost: $5K in US, $8K in EU (thanks VAT) Performance Reality Check More 3090s = larger models, but diminishing returns kick in fast. Next step: 8-12 GPUs for AWQ 4-bit or BF16 Mix GLM 4.5-4.6 But at this point, you've hit consumer hardware limits. ---------------------------------------- Models that work: S-Tier Models (The Golden Standard) • GLM-4.5-Air: Matches Sonnet 4.0, codes flawlessly got this up to a steady 50 tps and 4k/s prefill with vLLM • Hermes-70B: Tells you anything without jailbreaking A-Tier Workhorses • Qwen line • Mistral line • GPT-OSS B-Tier Options • Gemma line • Llama line ------------------------------------ The Software Stack That Actually Works For coding/agents: • Claude Code + Router (GLM-4.5-Air runs perfectly) • Roocode Orchestrator: Define modes (coding, security, reviewer, researcher) The orchestrator manages scope, spins up local LLMs with fragmented context, then synthesizes results. You can use GPT-5 or Opus/GLM-4.6 as orchestrator, and local models as everything else! Scaffolding Options (Ranked) 1. vLLM: Peak performance + usability, blazing fast if model fits 2. exllamav3: Much faster, all quant sizes, but poor scaffolding 3. llama.cpp: Easy start, good initial speeds, degrades over context UI Recommendations • lmstudio: Locked to llama.cpp but great UX • 3 Sparks: Apple app for local LLMs • JanAI: Fine but feature-limited ------------------------------- Bottom Line Mac Ultra M3 gets you 60-80% performance with MLX access. But if you want the absolute best you need Nvidia. This journey taught me: real independence comes from understanding and building your own tools. If you're interested in benchmarks I've posted a lot on my profile












@sudoingX Are you open to taking donations on the GitHub?

this guy has 29 models on huggingface at page 2 ranking. no lab behind him. no sponsorship. $2,000 from his own pocket on GPU rentals. he compressed GLM-4.7 to run on a MacBook and quantized Nemotron Super the week it dropped. all public. all free. nvidia is a trillion dollar company with hundreds of teams but they are not the ones quantizing models middle of the night and pushing them out before sunrise. if nvidia stopped tomorrow their employees stop working. people like @0xSero would not. that is the difference between a paycheck and a mission. @NVIDIAAI you talk about making AI accessible. the people actually doing it are right here. 29 models deep burning their own compute with no ask except more hardware to keep going. you do not need to build another program. just look at who is already building for you. one GPU to this man would produce more public value than a hundred internal sprints. i am not asking for charity. i am asking you to invest in someone who already proved it.

