Topi Santakivi

756 posts

Topi Santakivi

@sandst1

Architecting AI solutions @LuotoCompany

Helsinki, Finland Katılım Mart 2011

298 Takip Edilen152 Takipçiler

Topi Santakivi@sandst1·15h

@MichaelZima @KuittinenPetri @antirez On Mac I use OMLX, a bit MTPLX too. It runs nicely, lot of progress in the past months, also the multi-token prediction stuff has been boosting inference speeds on Mac too. If you already have a Studio, plenty of good stuff to run.

English

Zima@MichaelZima·16h

@sandst1 @KuittinenPetri @antirez Qwen3-Coder-Next was stable for me. Still questioning how good MLX for inference...

English

Petri Kuittinen@KuittinenPetri·23h

Many people say Nvidia DGX Spark is too slow and not worth the money. I'm getting crazy speed qwen3.5-35b-a3b-nvfp4 with my ASUS Ascend GX10: over 200 token/s, 495k prefill. In real-life performance is lower, but still 100+ token/s. sparkrun run @atlas/qwen3.5-35b-a3b-nvfp4

English

201

18K

Topi Santakivi@sandst1·19h

@KuittinenPetri @MichaelZima Qwen 3.6 27B, different quants, atm via llama-cpp as it just landed MTP support. Also the DwarfStar by @antirez is running nicely on the GB10. Those are my main models atm, then also a bit of Qwen 3.6-35B-A3B and Qwen3-Coder-Next.

English

Petri Kuittinen@KuittinenPetri·19h

Interesting. My setup is: Asus Ascend GX10 + Beelink GTR9 Pro AMD Ryzen™ AI Max+ 395. Both have 128 GB LPDDR5X unifired RAM, but the former has CUDA and much more AI TOPs than Strix Halo, so prefill goes faster. Plus there aren't much well optimized docker images for Strix Halo. I am also using the little Nvidia GB10 box for inference mostly, serving using vllm or llama.cpp and connect just using local Wifi 6e, which gives around ~500 Mbit/s at best as they are not in the same room and probably need to hop over multiple nodes. What models do you use?

English

Topi Santakivi@sandst1·19h

@KuittinenPetri @mr_r0b0t wait: did Spark / sm121 NVFP4 support already land on vllm main branch? or did you build some custom setup?

English

128

Petri Kuittinen@KuittinenPetri·20h

@mr_r0b0t BTW. I didn't use any concurrency yet, just single user chat case and still 100+ token/s. The reason why that vllm docker container is so fast, is not just MoE, but it uses hybrid multi-token generation (MTP) featuring Gated DeltaNet as well.

English

814

Topi Santakivi@sandst1·19h

@MichaelZima @KuittinenPetri That's exactly what i'm doing atm

English

Zima@MichaelZima·23h

@KuittinenPetri I am starting to think my Studio will run my Hermes and a Spark will be my inference server.

English

1.1K

Topi Santakivi@sandst1·2d

@techedgedaily @rohanpaul_ai Yeah. Now anybody can vibe-code their software into existence, and if it's good software, they can have other people pay them to use it. Oh wait..

English

TechEdgeDaily@techedgedaily·2d

@rohanpaul_ai Dario saying software will be essentially free is the CEO of a $1T AI company telling you the SaaS business model has an expiration date.

English

958

Rohan Paul@rohanpaul_ai·2d

Anthropic CEO Dario Amodei : "Software is going to become cheap, maybe essentially free. The premise that you need to amortize a piece of software you build across millions of users, that may start to be false. But at the same time, there are whole jobs, whole careers that we've built for decades that may not be present. And, you know, I think we can deal with it. I think we can adjust to it. But I don't, I don't think there's an awareness at all of what, of what is coming here and the magnitude of it." --- From "The Wall Street Journal" YT channel (link in comment)

English

380

152

1.6K

766.2K

Topi Santakivi@sandst1·2d

@MemoryReboot_ The only thing missing is the M3 Ultra in stock.

English

473

Mass@MemoryReboot_·2d

What's the point of buying a DGX Spark for $5k? Imo to fully unlock the potential you need at least two, that's already $10k Mac Studio M3 Ultra 256GB for $7.5k looks like the better play. It's faster and all in one box The only reason is that it just looks cool af What am I missing?

English

21K

Topi Santakivi@sandst1·2d

@indes_yo @nash_su MTP is speculative decoding, specifically for token generation.

English

nash_su - e/acc@nash_su·3d

Mac 推理速度翻倍🚀 这个 MTPLX 是 MLX + MTP 的整合解决方案，专门针对 Apple Silicon 进行了模型推理优化，使用加入了定制 MTP head 的模型，可以提供翻倍的推理速度。我测试过了，Qwen3.6-27B ，推理速度比 LMStudio 翻了一倍，还集成了风扇管理，很棒👍 项目地址点击：

中文

221

23.3K

Topi Santakivi@sandst1·2d

@0xSero With MoE, the experts are not activated just based on prompts, the expert selection is done for _each token_ separately. This is why the full MoE usually needs to be loaded in memory even if a fraction of it is used per token.

English

1.3K

0xSero@0xSero·3d

1. Dense Models - Slow and Smart Example: Qwen3.6-27B / Gemma-4-31B What it means: - when a prompt is sent - it gets tokenised (words are mapped to tokens) - token generation starts - the 27B means 27 billion parameters - each of those parameters will be activated - 27 billion matrix multiplications - for every token generated Active parameter counts are positively correlated with intelligence. That's why Gemma-4-31B is able to compete with Mixture of Experts (MoEs) 10 times their size. 2. Mixture of Expert models - Fast and Efficient Example: Deepseek-V4-Flash / Qwen3.5-397B What it means: - when a prompt is sent it's tokenised - it's sent to a router - a router was trained to match prompts with experts - experts are sub-networks of the model - when found the experts are activated - tokens are generated with only a fraction of the params For example: Deepseek-v4-flash has 284 billion params 11x larger than the dense Qwen3.6-27b. But only 13B of those 284B will activate per token, which is less than half of the size of Qwen3.6-27B ---- Dense Pros: - Dense models are easier to train - They tend to be smaller overall - They can be very smart per token Dense Cons: - Competitive dense models are on average slower than their MoE peers. - Less parameters to train and specialise. MoE Pros: - Can be much larger and be trained longer - Faster token generation MoE Cons: - Larger vram requirements - Harder to train -------- Lmk if there's anything i'm wrong with or missing

English

991

62.6K

Topi Santakivi@sandst1·2d

@usr_bin_roygbiv @LottoLabs Parallel processing. The mem bandwith is low for a single token stream but there's a lot of compute in that box to handle X tasks at a time. Parallel subagents etc. If you only need single-user single-stream, go M5 Max. dendro-logic.com/engineering/nv…

English

Roy@usr_bin_roygbiv·3d

@LottoLabs what is the point of the gb10s? I feel like mlx is better even if you have a mac already they're like the same speed

English

611

Lotto@LottoLabs·3d

It might be time I want a gb10 to test out but don’t want to spend that much and I don’t really need 128gb? Or should I just put $3500 towards a RTX5000 but then I need a whole build or sell the 3090s Canadian dollars amzn.to/4tIcCga

English

10.5K

Topi Santakivi@sandst1·3d

@bjornmuh @jun_song Prefill is compute bound, not mem bandwith. Mem bandwith dictates how fast the LLM can produce new tokens.

English

bjornmuh@bjornmuh·3d

@jun_song Why will prefill be slower on Apple/MLX with faster mem bandwidth?

English

1.2K

송준 Jun Song@jun_song·3d

Best mid-range local LLM hardware : DGX Spark vs Mac Studio M5 Max 128GB (upcoming) Price: $4.7k (cheaper if used or OEM) vs ~$5k (est) Decode: 273 GB/s vs 614 GB/s (Mac wins by 2.2x) Prefill: DGX is ~2x faster + supports batching RAM: 128GB unified on both Power: 240W vs 200W (insanely efficient) Thermals: Both quiet, but DGX runs hot Perks: CUDA vs MLX optimization allows Deepseek V4 Flash on your desk.

English

325

31.2K

Topi Santakivi@sandst1·3d

@NVIDIAGeForce #007FirstLightRTX

QME

NVIDIA GeForce@NVIDIAGeForce·3d

Recruits, your first prize is here... A custom GeForce RTX 5080 Founders Edition + PC copy of the game. Comment #007FirstLightRTX to win 👇

English

27.6K

14.6K

2.2M

Topi Santakivi@sandst1·4d

@ivanfioravanti Yes. And even better, deterministic build, lint, test, typecheck steps that run outside of the context window after a task has been done and if they give errors, having the agent rework it until it passes. That can be implemented in various ways, i often use hooks or scripts.

English

Ivan Fioravanti ᯅ@ivanfioravanti·5d

The best way to succeed in coding with AI is setting metrics, benchmarks, tests that coding harnesses can run to verify correctness of their artifacts! In this way they must stay on the road, they can't go outside of the boundaries you define. I'm having very good results in this way! I'm focusing more on the boundaries and rules than code itself lately.

English

2.4K

Topi Santakivi@sandst1·12 May

@KuittinenPetri @TheAhmadOsman Found yesterday a nice article on parallel LLM calls and GB10, seems to be doing quite fine: dendro-logic.com/engineering/nv…

English

Petri Kuittinen@KuittinenPetri·11 May

Important Reminder: If you aim is agentic AI workloads, CPU & System Memory is 50% - 90% of Latency, not your GPU or its memory bandwidth, unless you are literally mostly just using the LLMs to generate crazy amount of text or code files e.g. used for fine-tuning another model. Research in late 2025 and 2026 indicates that in agentic workflows, the GPU is often waiting for the CPU to complete tasks. Tool Processing: Python interpretation, file system searches, web searches, and database queries are all CPU-bound. Orchestration Load: Managing dozens of concurrent agents, scheduling sub-tasks, and passing data between agents falls on the CPU. So if you really want good speed from agentic work loads, do not forget that you will ideally need a powerful CPU and enough fast memory and SSD for it as well. This is actually where some of the unified memory computers are surprisingly good e.g. the often mocked AMD Ryzen AI Max+ PRO 395. It is not token/s king, but 16-cores (no economy cores), 32 threads paired with 128 GB LPDDR5X 8000 MHz + pair with high end Samsung SSD + fast internet actually makes many real life tasks fly fast. Browser use, Database use etc. no problem. Ideally you'd want something like paired with Nvidia RTX 6000 as that would scale for simultaneous agent calls well. I haven't yet myself had enough time to test my Nvidia GB10 box, but I suspect its 20 core ARM CPU is weaker than AMD Ryzen not to mention the 64+ core Threarippers, but it surely wins massively in parallel prefill token/s. I also never owned any Apple silicon, so I cannot comment that, but I would assume Apple M5 Max is pretty fast in everything. In some agentic loads, GPU and token generation are just part of the story.

English

727

Ahmad@TheAhmadOsman·11 May

If you’re interested in Local AI, I highly recommend reading those 2 articles BEFORE making any hardware purchases Find them under the articles tab on my profile

English

299

13.8K

Topi Santakivi@sandst1·11 May

@ttunguz @no_stp_on_snek Here's one comparison i did of qwen 3.6 variants vs sonnet 4.5 on a couple tasks: github.com/sandst1/qwen3.…

English

Topi Santakivi@sandst1·11 May

@ttunguz @no_stp_on_snek From my daily work experience (sw dev), on avg sonnet is clearly the more capable but i'd say at least around 20-30% of tasks are such that qwen 3.6 27B handles them well enough. 35B is great but lacks on more complex codebases and changes that need comprehension of the whole.

English

Tomasz Tunguz@ttunguz·11 May

Localmaxxing : pushing more inference to local models. Over five weeks, I tested how much of my daily work can run on a local 35B model instead of cloud frontier models. The answer : half. Many reasons to use local models : privacy, cost, asset depreciation. But the only one that really matters is latency. I ran a head-to-head benchmark. Qwen 3.6 35B-A3B-4bit on my MacBook Pro M5 vs Claude Opus 4.5 via API. Result : 2.1x faster locally. Mean 2.8s vs 5.8s. The local model isn't smarter. Opus scores ~20% higher on reasoning benchmarks. Local models lag frontier by 3-4 months, and for complex tasks, that gap matters. But for routine agent tasks, it rarely does. If half the work runs 2x faster on my laptop, I'll take that trade every time. My little computer is about to earn its keep. tomtunguz.com/localmaxxing/

English

164

37.5K

Topi Santakivi@sandst1·10 May

@KuittinenPetri @sudoingX Yup. Got m4 pro 64gb atm, with the same 273 gb/s. Qwen 3.6 27b 4bit is 10-12tok/s and better with MTP. Thats not fast but not too bad either, its the prefill that takes most of the time. So for that, dgx or gx10 would be a good fit.

English

122

Petri Kuittinen@KuittinenPetri·10 May

When I made my order Kasuplan wasn't among the shops selling it. ☹️😀So I just made the best option, which was at that moment available. I feel the prices might spike soon again. Nvidia DGX Spark is relatively slow in decode (only 273 GB/s memory bandwidth), but it will likely do very well in agentic workloads, far better than pretty much any other unified memory computer you can buy as the Apple M3 Ultras with over 96 GB RAM are already sold out. Nvidia GB10 computers give so much faster prefill, the difference is drastic. Plus you could actually run video, music and image models with comfy UI on these without needing to wait forever. And those 20 ARM cores seem pretty snappy as well, but it seems AMD Ryzen™ AI Max+ 395 might still be the compute king among similar computers, it has 16 cores (all full cores, no economy cores) and 32 threads.

English

128

Sudo su@sudoingX·10 May

i love my dgx spark. this thing just completed an entire codebase with a full test suite while i was arguing on twitter. thank you nvidia. i wouldn't mind another one.

English

Topi Santakivi@sandst1·10 May

@KuittinenPetri @sudoingX That's a VAT0 price though

English

Topi Santakivi@sandst1·10 May

@KuittinenPetri @sudoingX Seems like a good deal, especially for 4tb.

English

Topi Santakivi@sandst1·10 May

@KuittinenPetri @sudoingX AI harware from Siilinjärvi certainly wasnt on the top of my list :D

English

Topi Santakivi@sandst1·10 May

@KuittinenPetri @sudoingX dang this description is not helping my wallet 😅 Surely not a GPU but as agentic loops are so prefill heavy, GX10 might be a good pick. I anyway tend to plan more stuff beforehand, press a button and get back for the next iteration after ~1-3 hours.

English

Petri Kuittinen@KuittinenPetri·10 May

I just recently got my ASUS Ascent GX10 and started installing it yesterday afternoon with a baby girl in my arms. She wanted to type commands as well and of course starting with sudo... luckily all went well. I have been very happy with this purchase. It ain't the decode king, but very fast in prefill, even with pretty vanilla llama.cpp / llama-server setup. I have gotten ~1500 token/s prefill - nice! Plus it is really easy to use and I thought local network would require something, but I just gave it a local name and despite dynamic IP on my Wifi Mesh, it seems to work flawless. I can ssh into it anywhere from home, screen -r for persistent shell and multiple windows under one ssh connection. I can serve models to other computers and it seems to stay cool & silent and consuming only about 43 Watts under inference. Sure Nvidia RTX 6000 with 96 GB would be faster, but so would be power usage and it costs much more. I will later explore vllm, slang, comfy UI, plus those optimized docker images. Wanna recommend some?

English

478

Keşfet

@MichaelZima @KuittinenPetri @antirez @mr_r0b0t @techedgedaily @rohanpaul_ai @MemoryReboot_ @indes_yo