Yves Van Den Broek

34 posts

Yves Van Den Broek

@Yves_From_BE

Katılım Eylül 2025

14 Takip Edilen3 Takipçiler

@Arcardus @SchallerDomenic Backup and restore is deeply integrated and has severe security implications, if you want that convenience, you should consider other options. You have just like on a pc with google products a backup ... if you want all the rest 🤷‍♂️

English

Arcardy (nonbinary 💜)@Arcardus·2d

@Yves_From_BE @SchallerDomenic I use google apps and not everything is backed up. This is what I am talking about Why can't I backup my iphone to google drive

English

Domenic Schaller@SchallerDomenic·3d

🇪🇺 EU geht erneut gegen Apple vor🫣 Die Europäische Kommission untersucht Apple, weil vollständige iCloud-Backups nur mit Apples eigenem Dienst möglich sind. Drittanbieter-Backup-Apps werden dadurch stark benachteiligt. Die Behörden sehen hier eine Wettbewerbsbehinderung. Was denkt ihr dazu? Richtige Maßnahme der EU oder wieder typische Überregulierung? 👇 #iCloud #Apple #EU

Deutsch

140

174

352.4K

Yves Van Den Broek@Yves_From_BE·2d

@KyleHessling1 Hmmm saw a test on YT Qwen vs Qwopus … even with thinking Qwen was faster in total time, used less tokens and about the same t/s … the results where not that different either, so curious where the difference comes from …

English

Kyle Hessling@KyleHessling1·12 Haz

Qwopus 3.6 27b-Coder is now live! Scores a 67% on a full run of SWE bench verified with thinking completely disabled! Q5_K_M This model is lightning fast for dense class! With a natively finetuned MTP head, it achieves 100 tps on a single 5090! The biggest upgrade here, though, is its stability in programming and tool calling within @NousResearch Hermes agent, with thinking off! Wall time is crazy fast this way, which makes Hermes feel "native" and snappy, like they were meant for each other. The freedom of running without thinking at all makes you part of the thinking process, and you never get caught waiting 15 minutes for it to finish a thought string, like with the base models. Thinking on and temp high, .9-1 seems to produce really incredible design and svg results. I reran the Boat survival prompt through a few turns, thinking on, and it seemed to render more fancy models in HTML canvas, but it was much more of a start-a-prompt and wait experience vs the snappy and active iteration with it disabled. It may be worth turning it off and on throughout the build process if you want to get really creative with design. Really looking forward to seeing how this one performs for y'all! Please post comments with your opinions and use cases below! As always with our fine-tunes, mess with the temperature setting, and run them much hotter than the base! Please check out the Boat Survival game I posted yesterday, made in 12 turns using Hermes and this model, with thinking off. Link below! Full swe bench repo-specific breakdown also posted in the comments for those interested! Happy building, everyone! We're looking forward to your thoughts! Quants uploading now! huggingface.co/Jackrong/Qwopu…

English

105

145

1.3K

159.6K

Yves Van Den Broek@Yves_From_BE·2d

@Arcardus @SchallerDomenic Use google apps, so everything is backed up … no need for opening up to third parties … what happens if you cannot restore your backup to a new iPhone, will you complain to Apple or to the EU 😜

English

Arcardy (nonbinary 💜)@Arcardus·3d

@SchallerDomenic Das fordere ich seit JAHREN. Endlich. Danke EU!! Warum MUSS ich Apples beschissenes iCloud abonnieren wenn ich bereits Google Drive hab?!

Deutsch

3.4K

Yves Van Den Broek@Yves_From_BE·2d

@theharshpat @siIiconway @pcuenq No because then he would have said for 192GB 😜

English

Harsh@theharshpat·3d

@Yves_From_BE @siIiconway @pcuenq For 2 of them

English

Pedro Cuenca@pcuenq·3d

GLM 5.2 has just been released 🔥 Here it's already running with MLX on two Mac Studios (M3 Ultra). This is comparable to the latest closed models, with weights you can download, quantize, distill, fine-tune, run.

English

722

81.7K

Yves Van Den Broek@Yves_From_BE·3d

@siIiconway @pcuenq 11K? I see 3.9 for the 60 core and 5.4 for the 80core

English

Siliconway@siIiconway·3d

@pcuenq U r kidding, 14 weeks. 96Gb 11k$. For what? 30tk/s? Nah thanks

English

1.3K

Yves Van Den Broek@Yves_From_BE·3d

@NTWR_LaL @ukrroot @pcuenq Yes they can since macOS26.2 you have RDMA, I see a 3x speed improvement with 4 Mac’s in this kind of setup

English

natwarlal@NTWR_LaL·3d

@ukrroot @pcuenq Yes exactly. They can’t share memory. How and why would i connect two?

English

Yves Van Den Broek@Yves_From_BE·3d

@ukrroot @pcuenq With RDMA over thunderbolt there is no real slow down. In fact dense models scale quite well. MoE models see less benefit, though prefill speeds up as well . I see serious speedup between 1 or 2 Macs and the same model. Condition is you use Tensor/RDMA and not pipeline/TCP

English

Rompel@ukrroot·3d

@pcuenq Interconnect is the wall here. Real question: decode tok/s on the two-Studio split, and what quant fits unified memory so you skip the inter-box hop entirely? Single-box at lower precision likely beats split-box at higher. Got numbers?

English

Yves Van Den Broek@Yves_From_BE·21 May

@adrgrondin @lmstudio Which qwen 3.6 exactly and what code was it reviewing, swift, python …

English

144

Adrien Grondin@adrgrondin·20 May

Subagents running locally and simultaneously on MacBook Pro M5 with Codex CLI + @lmstudio to review code and find bugs using Qwen 3.6 Powered by the updated MLX engine with batching in beta in the app The batching speed boost is noticeable

English

659

75K

Yves Van Den Broek@Yves_From_BE·19 May

@jundotkim @bstnxbt Nice! Is RDMA support in the roadmap?

English

Jun Kim@jundotkim·19 May

oMLX 0.3.9rc1 released. Highlights: - Low-memory Macs stay stable instead of getting killed by the OS - DFlash bumped to v0.1.7 (thanks to @bstnxbt's dflash-mlx). Qwen thinking/GDN fix, Etc. - Chunked prefill. A long prompt no longer blocks decode for everyone else - Multi-tasking in the admin chat. Run multiple chats in parallel - Real-time memory bar in the admin dashboard - Hermes Agent quick launch, "omlx launch hermes" Plus a lot of bug fixes and new contributors in this cycle. Thanks everyone! github.com/jundot/omlx/re…

English

121

8.3K

Yves Van Den Broek@Yves_From_BE·15 May

@Kannanigga @alexocheema @itsry16 It’s not the speed that counts, it’s the latency, not that much data is send.The pre processing speedup is real, the inference for MOE’s not so much but for dense it scales like x1.8 for 2 and x3 for 4 …

English

Kannannigga@Kannanigga·15 May

@alexocheema @itsry16 4 MacBooks are not the same as one 512GB GPU/server. RDMA over Thunderbolt 5 helps but the interconnect is still around 80Gbps bidirectional up to 120Gbps in boost, which is tiny compared with each MacBook’s 614GB/s local memory bandwidth. Stop posting misinformation. Ty!

English

663

Alex Cheema@alexocheema·14 May

It’s kind of crazy but the shitstorm of supply chain issues has created a new best-in-class local AI deployment: M5 Max MacBook clusters. - The memory unit economics are great - each MacBook has 128GB @ 614GB/s for $5k - M5 Max added tensor cores (Apple Neural Accelerators) with 4x compute of M4 Max (~60TFLOPS fp16) - You can cluster them with RDMA over Thunderbolt 5 and @exolabs for ~linear scaling (so memory bandwidth really is additive: 4 x MacBooks are 512GB @ 2456GB/s) You can’t get Mac Studios right now, so customers are buying MacBook clusters instead. A government is running this in prod. This setup is best for low-batch decode-heavy inference (all memory bound) and transcription (super fast and cheap on apple silicon).

Alex Cheema@alexocheema

It is unconventional but it actually works, depending on the workload of course. There are strengths and weaknesses for sure. There are some real deployments (governments, big companies) running this setup in production (pods of 4 MacBooks). It's the best price to performance for many workloads (e.g. transcription, low batch LLM inference). They landed on this themselves as the best hardware to run their workloads on. Can share more in private if you are interested (don't want to turn this into a sales pitch for exo). If Apple actually sold us an M5 Max / M5 Ultra Mac Studio, then we'd use that. But we could be waiting until October for that (or longer, the supply chain issues seem pretty bad). It's the same M5 Max chip in the MacBook as the Mac Studio, and it goes up to 128GB unified memory. Each chip has 614GB/s memory bandwidth (2.24x DGX Spark). I would say the main downside (which we should make more clear) is the software ecosystem - it's still quite immature. It has got much better in the last year e.g. clustering came a long way with low-latency RDMA in macOS 26.2.

English

212

36.6K

Yves Van Den Broek@Yves_From_BE·29 Nis

@ab198499 @moofeez @kylebrussell @grok Asking a native English speaker to solve a French puzzle …

English

Andrew@ab198499·29 Nis

@moofeez @kylebrussell @grok explain this to me like im in highschool

English

232

mufeez@moofeez·28 Nis

I post-trained Qwen3-Coder to fix bugs using an actual debugger. The result: Solve rate: 70% → 89% Median turns to fix: 46 → 19 (-59%) Instead of just reading code or print-debugging, it: - reasons from execution - inspects live variables and call stacks - sets breakpoints, steps, and evaluates expressions

English

118

1.6K

124.4K

Yves Van Den Broek@Yves_From_BE·16 Nis

@_karthik @FarzaTV Wrong link

English

Karthik@_karthik·16 Nis

@FarzaTV link: github.com/Shoshin23/form…

English

102

Karthik@_karthik·16 Nis

> filling forms on the web sucks! you're usually giving the same info over and over again > so i made clacky, inspired by clicky from @FarzaTV. drop an id, a resume, anything once - cmd+click any form field on your mac and it fills. >> dont panic. its all offline, powered by @googlegemma 4. nothing leaves your computer ever > works on web, pdf forms, slack text fields - ANYWHERE on your mac > 100% open source and free. shoutout to @osanseviero and team for building such a neat model and @Prince_Canuma for the mlx port! nice to get back into hacking for fun!

English

2.1K

Yves Van Den Broek@Yves_From_BE·15 Nis

@bstnxbt While in chat I see the 3x speedup in tokens (Qwen3.5-9B) 27 vs 84 t/s. The server is slower then the default mlx server when using RAG, though the server claims to use draft and reports a 68% acceptance …

English

167

bstn 👁️@bstnxbt·14 Nis

dflash-mlx v0.1.1 dflash-serve now supports tools, reasoning, streaming, and full OpenAI-compatible serving. Works with OpenCode, aider, Continue, Open WebUI. Also available via oMLX (thanks jundot). github.com/bstnxbt/dflash…

English

182

33.9K

Yves Van Den Broek@Yves_From_BE·5 Mar

@alexocheema @exolabs @alexocheema What “RAG” solution are they using to retrieve info from their documents?

English

Alex Cheema@alexocheema·5 Mar

China is way ahead on AI adoption. A school in Beijing has repurposed old macs to run personalised AI agents 100% locally using @exolabs The macs were previously used in their film studies lab, for video editing. They have ingested their entire corpus of school data: curriculums, reports, instructional materials and learning objectives - so it’s grounded to all their data in realtime. In order to get accurate answers, they need frontier models, which are BIG - memory is the constraint (not FLOPS). Apple devices with unified memory have a lot of high speed memory so stacking enough of them makes it possible to run massive models. A big concern of schools and parents is data privacy - when students or teachers use models in the cloud they are sending all their data in plaintext to the model provider. Even if schools have policies around this there’s always the risk someone accidentally copy-pastes sensitive data into the model - data leakage is inevitable.

English

129

893

92K

Yves Van Den Broek@Yves_From_BE·28 Şub

@SIskyee @AdamZmenak @alexocheema @exolabs Not memory, if you see that the M5 has 4x faster prompt processing with only a 30% increase in mm BW. It is all in MatMul aka new Neural accelerators in the M5 GPU. If the agents were smarter and cached once the function calls …

English

TerraHub@SIskyee·27 Şub

@AdamZmenak @alexocheema @exolabs Exactly this. Token speed means nothing when burning 30 seconds on prompt processing. Bottleneck shifted from generation to context ingestion. Solution: prompt caching, smaller system prompts, pre-computed embeddings. Memory bandwidth is the new constraint.

English

Alex Cheema@alexocheema·27 Şub

Super in-depth guide setting up Qwen3.5-122B-A10B on 2 x 128GB M4 Max Mac Studios from scratch w/ @exolabs. Total cost $8k. Runs at 52 tok/sec. Imagine M5 Max mac mini. The amount of edge compute accessible to consumers will be huge: ~27.5% more memory bandwidth & ~4x FLOPS.

Trevin Peterson@TrevinPeterson

x.com/i/article/2027…

English

881

315.7K

Yves Van Den Broek@Yves_From_BE·17 Şub

MLX Distributed / RDMA / Tensor Parallelism / KV Cache / Batching / Streaming

English

Yves Van Den Broek@Yves_From_BE·16 Şub

@awnihannun Nice! Wouldn't PreFill be faster if you run it on 4 M3U's distributed? Maybe 4 256GB make more sense than 512GB ...

English

Awni Hannun@awnihannun·15 Şub

Running four high-level OpenCode agents + subagents with mlx_lm.server continuous batching and MiniMax M2.5 (6-bit). Fits easily on a 512GB M3 Ultra. Generation is quite fast. But prefill is still slow compared to cloud servers.

English

276

26K

Yves Van Den Broek@Yves_From_BE·16 Şub

@matteoianni @Patrick1Kennedy Using MLX distributed, you slash the TTFT / Prompt PP. In my setup going from 1 to 2 Macs I slash the TTFT time by more then a factor of 2 🤷‍♂️ That with prompt caching where you cache the openclaw tool call list and you are good.

English

Matteo@matteoianni·15 Şub

@Patrick1Kennedy Prompt processing is shit. OpenClaw would take 3 days for a single task.

English

287

Patrick J Kennedy@Patrick1Kennedy·15 Şub

This is MiniMax-M2.5 MLX running in LM Studio on an Apple Mac Studio M3 Ultra 512GB. Fast enough out of the box for hosting OpenClaw, n8n workflows, and Open WebUI for the team.

English

606

76.4K

Yves Van Den Broek@Yves_From_BE·15 Şub

@ryanrhughes @Patrick1Kennedy The solution is to cluster M4Max of M3U with RDMA, it even reduces time to first prompt for MoE architectures. Awni had a post on that and I saw this with a Mac Mini M4 Pro and MBP M4Max, not advised as they are to different in architecture.

English

199

Ryan R. Hughes@ryanrhughes·15 Şub

What does the time to first token look like when you throw it a real prompt? This has been a consistent holdup for me with my M4 MacBook Pro. It's great at processing simple prompts but the moment you throw it a 100k request; you're waiting 1min for prompt processing. I'm wondering how much better the Ultra chip is for that.

English

2.1K

Yves Van Den Broek@Yves_From_BE·9 Şub

Using MLX distributed / RDMA capabilities. MacBook Pro M4 Max + Mac mini M4 Pro connected via TB5.

English

276

Keşfet

@Arcardus @SchallerDomenic @KyleHessling1 @NousResearch @theharshpat @siIiconway @pcuenq @NTWR_LaL