mindfury

4.3K posts

mindfury banner
mindfury

mindfury

@mindfury1980

🇺🇸 | ♎️ | I’ve lived a strangely providential life. | 🐻‍❄️

Katılım Şubat 2024
145 Takip Edilen126 Takipçiler
mindfury retweetledi
OpenRouter
OpenRouter@OpenRouter·
Four open-weight models have crossed into territory where they are powering real agentic pipelines. New post in our Insights blog about why companies are choosing them in June: openrouter.ai/blog/insights/…
OpenRouter tweet media
English
23
110
1K
54.3K
mindfury
mindfury@mindfury1980·
@jun_song Wouldn't surprise me. We need a market answer between DGX Spark and DGX Station. Looks like Mac Studio M5 Ultra/512 could be it...
English
0
0
0
68
Jun Song
Jun Song@jun_song·
Mac Studio M5 Ultra 768GB price prediction: Calculation assuming they maintain the post-hike pricing: > M3 Ultra full chip 96GB base price: $6,799 > MBP 64GB -> 128GB RAM upgrade cost: $1,600 By simple math: $6,799 + $16,800 + $500 (price bump for base 2TB SSD) = $24,099. This is just a simple calculation. To actually run a 5-6TB model, I expect you'll have to pay at least $30k+. My prediction is it will land somewhere around $35k to $40k.
Jun Song tweet mediaJun Song tweet media
English
53
16
322
50.8K
Solyricon
Solyricon@Solyricon·
Dear future husband, Today sucked. I miss you. Love Your future wife P.S. come home already
English
60
7.9K
26.7K
416.6K
mindfury retweetledi
Mia
Mia@MiaAI_lab·
Ornith-1.0-9B for agentic coding? It broke on tool calling mid-task and got stuck in a loop with no way to continue. Only real upside: it's less bad than Gemma-4-12B... but that's not saying much. TL;DR: Skip it.
Mia tweet media
English
60
12
291
24.3K
mindfury
mindfury@mindfury1980·
@smar_shall @elonmusk You can’t just spin up semiconductor foundries on the fly. These take years. And memory companies have been bitten by hype cycles before. Nobody predicted this was coming.
English
0
0
1
40
Shaun
Shaun@smar_shall·
Someone needs to start making more memory. Anyone should be able to make a profit in a market where retail is 4X what it was a year ago while input costs are much the same. If you wanted to build only 100 data centres and put 100PB of ram in each, buying the chips would cost you around $100b. Building factories that could make and deliver that is $40b. Lead time, 5 years. You could save $60b on 100 data centres alone, and you secure the supply! Even better is that I only charge a 1% recovered cost savings fee for the idea.
English
2
1
12
5.7K
Elon Musk
Elon Musk@elonmusk·
Tim Cook, who told The Wall Street Journal that the jump in costs was unlike anything he had seen “in any area in over 40 years.” Biggest price jump in anything I’ve ever seen too. wsj.com/economy/the-da…
English
2.2K
3.5K
17.2K
4.5M
mindfury retweetledi
vLLM
vLLM@vllm_project·
🙏 Thanks to the @NVIDIAAI team for highlighting DFlash support on vLLM! With DFlash speculative decoding, swapping EAGLE-3 for a DFlash checkpoint is a config-only change — no code edits needed. It runs through the open-source Speculators library, which links the DFlash drafter to the target model's hidden states in the vLLM inference path. On Gemma-4 31B on a single Blackwell Ultra GPU, this delivers up to 5.8x higher throughput at the same concurrency over autoregressive decoding: 🧮 Math500 — 5.8x ➕ GSM8K — 5.3x 💻 HumanEval — 5.6x 🐍 MBPP — 4.4x Read the blog here! 👇
NVIDIA AI@NVIDIAAI

Increase inference performance by up to 15x without sacrificing responsiveness. DFlash, an open source lightweight block diffusion model designed for speculative decoding, delivers up to 15x higher throughput on NVIDIA Blackwell while maintaining the same user interactivity target. Instead of drafting tokens one at a time, it proposes a whole block in a single pass for the main model to verify in parallel. Adoption is drop-in with support in @lmsysorg SGLang, TensorRT-LLM, and @vllm_project.

English
9
33
333
58.2K
mindfury
mindfury@mindfury1980·
@keyelifeai @UnslothAI As others have said, wait for the M5. Also… depending on where you’re getting it, only 96 GB is available brand new.
English
0
0
4
494
mindfury retweetledi
Unsloth AI
Unsloth AI@UnslothAI·
1-bit GLM-5.2 GGUF vs. Claude 4.8 Opus vs. GPT-5.5 We gave 3 models the same prompt and compared one-shot outputs. The 1-bit GLM-5.2 GGUF ran locally on a Mac Studio M3 Ultra with 256GB RAM at ~21.6 tok/s. Which output do you like best? GGUF: huggingface.co/unsloth/GLM-5.…
Unsloth AI@UnslothAI

GLM-5.2 can now be run locally!🔥 The 2-bit model retains ~82% accuracy after we shrunk it from 1.51TB to 238GB (-84% size). Run on a 256GB Mac or RAM/VRAM setups. GLM-5.2 is the strongest open model to date. Guide: unsloth.ai/docs/models/gl… GGUF: huggingface.co/unsloth/GLM-5.…

English
172
410
3.6K
1.5M
mindfury
mindfury@mindfury1980·
@sudoingX You’ve talked me into it… I’m bringing MiniMax-M2.7 down and will try this one.
English
0
0
0
77
Sudo su
Sudo su@sudoingX·
genuinely fucking wild how underrated stepfun's step 3.7 flash is in the dgx spark world. barely anyone's talking about it, and it's the single best model you can run on one spark, full stop. 198B, vision, the full context, i've been living in it and the gap between how good it is and how little it gets discussed is absurd. dropping my benchmarks tonight so you can see exactly what a single dgx spark is capable of in local ai.
English
24
5
178
13.1K
mindfury
mindfury@mindfury1980·
@StefanMaier @NVIDIAAI Nice! Unsloth Studio? I just downloaded it an hour ago after seeing a video on YouTube about it
English
1
0
1
36
Cesarus
Cesarus@StefanMaier·
Second @NVIDIAAI DGX Spark ordered. That was fast. Besides inferencing I will also do some fine-tuning. Looking forward to the second device and new experiences!
English
7
0
29
1.2K
mindfury retweetledi
Sudo su
Sudo su@sudoingX·
i'm hunting for models that actually fit and run well on single dgx spark, and most don't make it. here's the trap almost nobody post about, a model fitting in 128gb is not the same as it being usable. the full deepseek v4 flash is 112gb on this box, technically it fits, but that leaves about 6gb for context, so the moment you hand it a real task with real files it chokes. loading the weights is the easy part, leaving room to actually think is the hard part. so every candidate gets three checks: > fits with real context room, not just the weights crammed in > runs fast enough to actually use, not a token every few seconds > output still holds up, the cut didn't cost the quality this one clears the first bar. @0xSero REAP'd deepseek v4 flash down to 82gb, prunes the redundant experts out of the MoE and tunes what's left for this exact box. that 30gb he cut is the difference between no room to breathe and real working context. now i find out if it clears the other two. pulling it now, original quant numbers ready to check the quality against.
Sudo su tweet media
Sudo su@sudoingX

the DGX Spark nvidia sent me is a full supercomputer that fits on my desk. GB10 grace blackwell, 128GB unified memory, sips power off a wall socket, runs models that needed a server rack two years ago. spent last night pushing it on the hardest case i could find, a dense 27B model, to map exactly where it flies and where physics bites. all measured. > 1. dense decode is memory bound everywhere, not a spark thing. to write one token the gpu reads all 27B weights out of memory, and the spark's compute is so far ahead of its bandwidth that the chip sits half idle waiting. baseline 7.64 tok/s. hold that number. > 2.this is where the spark shines: speculative decoding. the model guesses a few tokens ahead, the spark verifies them in one batched pass, one weight read confirms ~4 tokens. 7.64 to 17 tok/s, a clean 2.2x, output byte-identical, one flag. it works precisely because the spark has spare compute sitting idle, and this finally puts it to work. the headroom you paid for earns its keep. > 3. honest catch: that's a short-context win. at 256k tokens it fades to 1.37x because you cross from memory bound to compute bound on attention. physics on any box, not the spark. > 4. the payoff, the spark flexing: switch to a same size MoE and it does 21.7 tok/s at that same 256k, nothing special turned on. a 256k context window at usable speed on a box next to my coffee. that's what nvidia is putting on developers' desks. the lesson: match the model to the machine and the spark is a beast. dense plus spec decode for short sharp work, MoE when you need the long context. either way it's a datacenter's job running off a wall socket. bookmark it, all measured on one box. (nvidia sent me the spark, no money changed hands, every number is mine)

English
37
20
271
49.4K
mindfury
mindfury@mindfury1980·
@ClankerQueen @nvidia I got my second Spark going last night. Got Qwen3-235B-A22B-NVFP4. About 15 tok/s… not too bad but memory bandwidth is killer… I hope the market improves in this regard.
English
1
0
1
16
Clanker Queen
Clanker Queen@ClankerQueen·
The dual-node @nvidia DGX Spark cluster is screaming this morning. If anyone tells you the spark isn't worth it... They are using it wrong! ​I've just successfully achieved local distributed split-training on an absolute monster-class (nearly 300B) MoE model architecture. The local hardware infrastructure handled the heat, the checkpoints are passing validation, and the 'custom firmware' layer for Brikie is officially taking shape. ​The next step? Moving from smoke tests to full behavioral evolution. ​Big things are brewing. #AgenticAI #LLM #Brikie #Nvidia #Spark @NVIDIARTXSpark
English
4
2
22
1.5K
mindfury
mindfury@mindfury1980·
@hotschmoe Yeah. I missed the boat on a Mac Studio M3 Ultra 512 when it was still under $10k. Quick search showed one for $30k. Nah…
English
1
0
1
159
StrongEngineer_
StrongEngineer_@hotschmoe·
I almost bought a dgx spark at $2999, then I almost bought a rtx pro 5000 for $3400. Watched as prices continued to march. Not getting left behind, I decided to throw my hat in the ring with Intel, got 2xB70s for $1900 Has barely been 2 days and I can't believe how much fun this is. Excited to contribute to Intel optimizations and provide useful AI to my family and friends
English
14
1
130
17.1K
mindfury
mindfury@mindfury1980·
@sudoingX Exactly. It’s like building a house with a bunch of shoddy work. Sure… it will put a roof over your head and do what you need probably, but it’s going to suck when things break over time or mold develops.
English
0
0
0
65
Sudo su
Sudo su@sudoingX·
i get called out a lot for how long my build is taking, when someone can vibe-code an app in a weekend. anon the honest answer going to sting some of you. i'm not just building. i'm learning to tell good code from bad code. give the same prompt to five frontier models and you get five different answers, and if you can't tell which one is actually good, you're not building software, you're stacking a tower of stuff you don't understand. it runs great right up until it breaks, and then you're standing in a codebase you can't read with no idea what to touch. that's the part the "you don't need to learn to code" crowd never mentions. and notice who's pushing that line, it's the people selling you the thing that replaces learning. follow the incentive, you don't even need a conspiracy. vibe-coding is genuinely great until the day the model can't fix the bug and neither can you. that's the moment you find out you're in the middle of the ocean and you never learned to swim. the fundamentals, the unit tests, knowing WHY the code is shaped the way it is, that's your way back to shore. nobody's swimming out to get you. so here's the real take, this is the best time in history to learn to code, not the excuse to skip it. these models make you faster at what you understand and dangerous at what you don't. learn the fundamentals, write the tests, actually understand the thing. be a better engineer, not a faster button presser. i'll take the slower build i can fix at 3am over the weekend demo that falls apart the first time it touches reality.
English
30
5
120
10.1K
mindfury
mindfury@mindfury1980·
@sfxnz @TheAhmadOsman Ollama is like the Fisher Price of inference. It’ll work on anything but it’s not optimized for throughput. It’s more “I just want it to work.” SGLang and vLLM give you more tunables.
English
0
0
1
29
Sufyan
Sufyan@sfxnz·
@TheAhmadOsman I notice that you say dont use ollama. Could you explain why?
English
1
0
0
307
mindfury retweetledi
Ahmad
Ahmad@TheAhmadOsman·
Local AI hardware = capacity × bandwidth × software stack - Capacity tells you what fits - Bandwidth tells you how hard the box can breathe - The software stack tells you how much of the spec sheet you can actually cash out. Hardware by Memory Bandwidth - Mac Studio M3 Ultra: up to 512GB @ 819 GB/s - RTX PRO 6000 Blackwell: 96GB @ 1792 GB/s - RTX 5090: 32GB @ 1792 GB/s - RTX 4090: 24GB @ 1008 GB/s - RX 7900 XTX: 24GB @ 960 GB/s - Radeon PRO W7900: 48GB @ 864 GB/s - AMD Radeon AI PRO R9700: 32GB @ 640 GB/s - Intel Arc Pro B65: 32GB @ ~608 GB/s - Tenstorrent Wormhole n300: 24GB @ 576 GB/s - Tenstorrent Blackhole p150: 32GB @ 512 GB/s + 800G - MacBook Pro M5 Max: 460-614 GB/s - MacBook Pro M5 Pro: 307 GB/s - DGX Spark: 128GB @ 273 GB/s (coherent + CUDA) - Mac mini M4 Pro: 273 GB/s - Ryzen AI Max / Strix Halo: ~256 GB/s (~96GB usable GPU) - MacBook Air M5: 153 GB/s - Snapdragon X2 Elite: 152-228 GB/s - Intel Lunar Lake: 136 GB/s - Snapdragon X Elite: 135 GB/s - Mac mini M4: 120 GB/s - Arc Pro B60: 24GB @ ~456 GB/s Verdict - GPUs are still the bandwidth kings - Apple wins: stupid amounts of memory, don’t want to shard across GPUs - Apple loses: when raw tokens/sec & concurrency matter more - DGX Spark: coherent memory + NVIDIA stack - Strix Halo / Ryzen AI Max: first real x86 unified-memory contender - Tenstorrent: fully OSS stack, excited to see this mature Fitting ≠ serving Even if it fits, you still pay for - bandwidth during decode - KV cache growth - dequantization - batching + concurrency - scheduler quality - framework overhead The only mental model that matters: 1. What must fit? 2. What bandwidth tier do I need? 3. What software stack can actually deliver it? In short: - NVIDIA → fastest raw speed - Apple Studio M3 Ultra → biggest one-box memory - Strix Halo → first real x86 unified - DGX Spark → coherent NVIDIA dev appliance - AMD / Intel Arc → rising alternatives - Tenstorrent → fully opensource stack Do ask: “which bottleneck am I buying?” Not: “which hardware is best?”
Ahmad tweet media
Ahmad@TheAhmadOsman

x.com/i/article/2041…

English
81
263
1.7K
222.8K
Jun Song
Jun Song@jun_song·
Everyone’s saying you need a $400k H200 rack just to run GLM-5.2. That's just wrong. The absolute minimum hardware required is just two DGX Sparks. Depending on where you live, the price changes, but in Korea, you can find Asus OEM versions for about $3k a piece. Of course, you take around a 10% hit in performance from dynamic quants, and bandwidth limitations will slow down your decode speed. But if you’re using it for agent workflows, prefill is fast enough to get the job done. Paying $6k+ to kickstart your sovereign AI is honestly not expensive at all.
English
53
19
362
37.8K
mindfury retweetledi
Sudo su
Sudo su@sudoingX·
the DGX Spark nvidia sent me is a full supercomputer that fits on my desk. GB10 grace blackwell, 128GB unified memory, sips power off a wall socket, runs models that needed a server rack two years ago. spent last night pushing it on the hardest case i could find, a dense 27B model, to map exactly where it flies and where physics bites. all measured. > 1. dense decode is memory bound everywhere, not a spark thing. to write one token the gpu reads all 27B weights out of memory, and the spark's compute is so far ahead of its bandwidth that the chip sits half idle waiting. baseline 7.64 tok/s. hold that number. > 2.this is where the spark shines: speculative decoding. the model guesses a few tokens ahead, the spark verifies them in one batched pass, one weight read confirms ~4 tokens. 7.64 to 17 tok/s, a clean 2.2x, output byte-identical, one flag. it works precisely because the spark has spare compute sitting idle, and this finally puts it to work. the headroom you paid for earns its keep. > 3. honest catch: that's a short-context win. at 256k tokens it fades to 1.37x because you cross from memory bound to compute bound on attention. physics on any box, not the spark. > 4. the payoff, the spark flexing: switch to a same size MoE and it does 21.7 tok/s at that same 256k, nothing special turned on. a 256k context window at usable speed on a box next to my coffee. that's what nvidia is putting on developers' desks. the lesson: match the model to the machine and the spark is a beast. dense plus spec decode for short sharp work, MoE when you need the long context. either way it's a datacenter's job running off a wall socket. bookmark it, all measured on one box. (nvidia sent me the spark, no money changed hands, every number is mine)
Sudo su tweet media
English
26
16
202
42.2K
mindfury retweetledi
BURKOV
BURKOV@burkov·
For the last three days, I've been using GLM 5.2 with OpenCode instead of Codex and I don't see any difference. There wasn't any bug that GLM would fail to fix or a feature it would fail to add as requested. The only downside is that this model cannot see, so if it's simpler to explain an issue by pasting a screenshot, I would still use Codex. Otherwise, GLM would be my choice. Will continue to use it for two more weeks and, if it keeps just working, I will cancel my $100/month subscription with OpenAI. I already cancelled my Anthropic subscription and have no regrets. No moat isn't hypothetical anymore.
English
165
126
2.4K
190.5K