

David Hendrickson
14K posts

@TeksEdge
CEO & Founder | PhD | Startup Advisor | @Columbia | Author Generative Software Engineering https://t.co/9oqvHuTX5f | 🔔 Follow for AI & Vibe Coding Tips 👇








⁉️So get this, AMD is making a bold move to own the affordable personal inferencing market by launching a Mini PC in June, a 128GB Shared Memory Inferencing Box 🎇 They call it the ⬭ Halo Box. 🧾 It's a Ryzen AI MAX+ 395 (16 Zen 5 cores + 40 RDNA 3.5 CUs + XDNA 2 NPU) ✅ Up to 128GB LPDDR5X-8533 unified memory ✅ Full ROCm support + Day-0 AI model optimization 🧪 Built for local AI development (up to ~200B param models) 📈 Direct shot at NVIDIA’s $4,699 DGX Spark and could cost $2,000–$3,000 (as they do now) 🤔 Why launch now during the RAM shortage? While memory makers divert capacity to HBM for AI data centers (driving LPDDR5X prices to spike and NVIDIA to raise the price of DGX Spark by $700), AMD is making a bold move to own the affordable, high-memory AI mini-PC segment before the crisis worsens. 💡 My Speculation: AMD could be using its contracts, relationships, and strategic priority to secure better memory access than many traditional OEMs. This could give them an advantage in launching the Halo Box during the shortage. Smart timing or risky bet? 🔥 This is AMD aggressively fighting for the local AI developer market.



This is a great example of the difficult position local inferencers are faced with using quant models 60-100GB in size. The DGX-Spark at $4,700 (now) retail and the only reasonable option (vs $10-$14K) but it’s slooooow.

🏆 LLMStats just dropped a fresh leaderboard update. This is my trusted ranking. 📊 The "TrueSkill" composite score is the real deal as the most conservative, battle-tested “Uber benchmark” in the game (μ − 3σ across GPQA, SWE-Bench, coding arenas & more). 👀 Current Standings 🏆 Overall #1 Claude Mythos Preview (@AnthropicAI) — 70.1Unreleased monster. 94.6% on GPQA Diamond. This thing is going to be an absolute banger 🚀 🥇 Best Open-Weights Kimi K2.6 (@moonshot) — 58.7Undisputed leader among open models right now. 90.5% GPQA + only $0.95/M tokens. Insanely good value 💎 Quick Hits 🏆 Gemini 3.1 Pro → Dominating coding arenas 👑 Llama 4 Scout → 10M context king ⚡ Mercury 2 → Fastest model at 1720 tps 🔥Bottom line If you care about real capability per dollar, Kimi K2.6 is the one to watch in the open-source world right now. And when Mythos drops… the game changes

⁉️So get this, AMD is making a bold move to own the affordable personal inferencing market by launching a Mini PC in June, a 128GB Shared Memory Inferencing Box 🎇 They call it the ⬭ Halo Box. 🧾 It's a Ryzen AI MAX+ 395 (16 Zen 5 cores + 40 RDNA 3.5 CUs + XDNA 2 NPU) ✅ Up to 128GB LPDDR5X-8533 unified memory ✅ Full ROCm support + Day-0 AI model optimization 🧪 Built for local AI development (up to ~200B param models) 📈 Direct shot at NVIDIA’s $4,699 DGX Spark and could cost $2,000–$3,000 (as they do now) 🤔 Why launch now during the RAM shortage? While memory makers divert capacity to HBM for AI data centers (driving LPDDR5X prices to spike and NVIDIA to raise the price of DGX Spark by $700), AMD is making a bold move to own the affordable, high-memory AI mini-PC segment before the crisis worsens. 💡 My Speculation: AMD could be using its contracts, relationships, and strategic priority to secure better memory access than many traditional OEMs. This could give them an advantage in launching the Halo Box during the shortage. Smart timing or risky bet? 🔥 This is AMD aggressively fighting for the local AI developer market.

Running DeepSeek V4 from @deepseek_ai on @vllm_project? Upgrade to v0.20.1 — 10+ bug fixes and optimizations, fully tested and verified by the open source community! A huge thank you to @FireworksAI_HQ, @baseten, @novita, @lightseekorg, @daocloud, @nvidia, @redhatai and more for helping report, fix, and verify the stability and speed of vLLM. 🙏 🔧 DeepSeek V4 Productionization Reliability: • Persistent topk cooperative deadlock at TopK=1024 • AOT compile cache import error • Repeated RoPE cache initialization • Non-streaming tool-call type conversion (DSV3.2/V4) • torch inductor error on V4 ⚡ Optimizations: • Multi-stream pre-attention GEMM + configurable knob • BF16 / MXFP8 all-to-all on FlashInfer one-sided comm • PTX `cvt` for faster FP32 → FP4 conversion • Integrated `head_compute_mix_kernel` for head computation 📖 Full notes → github.com/vllm-project/v…






How slow does a 128B DENSE model run locally? Qwen3 27B and Gemma 31B are the popular dense models everyone tests. But what happens when you 4x the params? Mistral Medium 3.5 128B, side-by-side on 4x4090 vs 4x5090 vs RTX PRO 6000 vs DGX Spark: 🔴4x4090: 12.06 tok/s decode, 680ms TTFT 🟢4x5090: 19.57 tok/s decode, 572ms TTFT 🟡PRO 6000: 18.12 tok/s decode, 538ms TTFT 🟣DGX Spark: 2.58 tok/s decode, 2243ms TTFT