Yves Van Den Broek

34 posts

Yves Van Den Broek

Yves Van Den Broek

@Yves_From_BE

Katılım Eylül 2025
14 Takip Edilen3 Takipçiler
Yves Van Den Broek
Yves Van Den Broek@Yves_From_BE·
@Arcardus @SchallerDomenic Backup and restore is deeply integrated and has severe security implications, if you want that convenience, you should consider other options. You have just like on a pc with google products a backup ... if you want all the rest 🤷‍♂️
English
0
0
0
6
Domenic Schaller
Domenic Schaller@SchallerDomenic·
🇪🇺 EU geht erneut gegen Apple vor🫣 Die Europäische Kommission untersucht Apple, weil vollständige iCloud-Backups nur mit Apples eigenem Dienst möglich sind. Drittanbieter-Backup-Apps werden dadurch stark benachteiligt. Die Behörden sehen hier eine Wettbewerbsbehinderung. Was denkt ihr dazu? Richtige Maßnahme der EU oder wieder typische Überregulierung? 👇 #iCloud #Apple #EU
Domenic Schaller tweet media
Deutsch
140
12
174
352.4K
Yves Van Den Broek
Yves Van Den Broek@Yves_From_BE·
@KyleHessling1 Hmmm saw a test on YT Qwen vs Qwopus … even with thinking Qwen was faster in total time, used less tokens and about the same t/s … the results where not that different either, so curious where the difference comes from …
English
0
0
0
16
Kyle Hessling
Kyle Hessling@KyleHessling1·
Qwopus 3.6 27b-Coder is now live! Scores a 67% on a full run of SWE bench verified with thinking completely disabled! Q5_K_M This model is lightning fast for dense class! With a natively finetuned MTP head, it achieves 100 tps on a single 5090! The biggest upgrade here, though, is its stability in programming and tool calling within @NousResearch Hermes agent, with thinking off! Wall time is crazy fast this way, which makes Hermes feel "native" and snappy, like they were meant for each other. The freedom of running without thinking at all makes you part of the thinking process, and you never get caught waiting 15 minutes for it to finish a thought string, like with the base models. Thinking on and temp high, .9-1 seems to produce really incredible design and svg results. I reran the Boat survival prompt through a few turns, thinking on, and it seemed to render more fancy models in HTML canvas, but it was much more of a start-a-prompt and wait experience vs the snappy and active iteration with it disabled. It may be worth turning it off and on throughout the build process if you want to get really creative with design. Really looking forward to seeing how this one performs for y'all! Please post comments with your opinions and use cases below! As always with our fine-tunes, mess with the temperature setting, and run them much hotter than the base! Please check out the Boat Survival game I posted yesterday, made in 12 turns using Hermes and this model, with thinking off. Link below! Full swe bench repo-specific breakdown also posted in the comments for those interested! Happy building, everyone! We're looking forward to your thoughts! Quants uploading now! huggingface.co/Jackrong/Qwopu…
English
105
145
1.3K
159.6K
Yves Van Den Broek
Yves Van Den Broek@Yves_From_BE·
@Arcardus @SchallerDomenic Use google apps, so everything is backed up … no need for opening up to third parties … what happens if you cannot restore your backup to a new iPhone, will you complain to Apple or to the EU 😜
English
1
0
0
86
Arcardy (nonbinary 💜)
@SchallerDomenic Das fordere ich seit JAHREN. Endlich. Danke EU!! Warum MUSS ich Apples beschissenes iCloud abonnieren wenn ich bereits Google Drive hab?!
Deutsch
16
0
20
3.4K
Pedro Cuenca
Pedro Cuenca@pcuenq·
GLM 5.2 has just been released 🔥 Here it's already running with MLX on two Mac Studios (M3 Ultra). This is comparable to the latest closed models, with weights you can download, quantize, distill, fine-tune, run.
English
42
49
722
81.7K
Siliconway
Siliconway@siIiconway·
@pcuenq U r kidding, 14 weeks. 96Gb 11k$. For what? 30tk/s? Nah thanks
Siliconway tweet media
English
3
0
6
1.3K
natwarlal
natwarlal@NTWR_LaL·
@ukrroot @pcuenq Yes exactly. They can’t share memory. How and why would i connect two?
English
2
0
0
62
Yves Van Den Broek
Yves Van Den Broek@Yves_From_BE·
@ukrroot @pcuenq With RDMA over thunderbolt there is no real slow down. In fact dense models scale quite well. MoE models see less benefit, though prefill speeds up as well . I see serious speedup between 1 or 2 Macs and the same model. Condition is you use Tensor/RDMA and not pipeline/TCP
English
1
0
1
45
Rompel
Rompel@ukrroot·
@pcuenq Interconnect is the wall here. Real question: decode tok/s on the two-Studio split, and what quant fits unified memory so you skip the inter-box hop entirely? Single-box at lower precision likely beats split-box at higher. Got numbers?
English
4
0
3
3K
Adrien Grondin
Adrien Grondin@adrgrondin·
Subagents running locally and simultaneously on MacBook Pro M5 with Codex CLI + @lmstudio to review code and find bugs using Qwen 3.6 Powered by the updated MLX engine with batching in beta in the app The batching speed boost is noticeable
English
36
33
659
75K
Jun Kim
Jun Kim@jundotkim·
oMLX 0.3.9rc1 released. Highlights: - Low-memory Macs stay stable instead of getting killed by the OS - DFlash bumped to v0.1.7 (thanks to @bstnxbt's dflash-mlx). Qwen thinking/GDN fix, Etc. - Chunked prefill. A long prompt no longer blocks decode for everyone else - Multi-tasking in the admin chat. Run multiple chats in parallel - Real-time memory bar in the admin dashboard - Hermes Agent quick launch, "omlx launch hermes" Plus a lot of bug fixes and new contributors in this cycle. Thanks everyone! github.com/jundot/omlx/re…
Jun Kim tweet media
English
10
13
121
8.3K
Yves Van Den Broek
Yves Van Den Broek@Yves_From_BE·
@Kannanigga @alexocheema @itsry16 It’s not the speed that counts, it’s the latency, not that much data is send.The pre processing speedup is real, the inference for MOE’s not so much but for dense it scales like x1.8 for 2 and x3 for 4 …
English
0
0
0
39
Kannannigga
Kannannigga@Kannanigga·
@alexocheema @itsry16 4 MacBooks are not the same as one 512GB GPU/server. RDMA over Thunderbolt 5 helps but the interconnect is still around 80Gbps bidirectional up to 120Gbps in boost, which is tiny compared with each MacBook’s 614GB/s local memory bandwidth. Stop posting misinformation. Ty!
Kannannigga tweet media
English
4
0
3
663
Alex Cheema
Alex Cheema@alexocheema·
It’s kind of crazy but the shitstorm of supply chain issues has created a new best-in-class local AI deployment: M5 Max MacBook clusters. - The memory unit economics are great - each MacBook has 128GB @ 614GB/s for $5k - M5 Max added tensor cores (Apple Neural Accelerators) with 4x compute of M4 Max (~60TFLOPS fp16) - You can cluster them with RDMA over Thunderbolt 5 and @exolabs for ~linear scaling (so memory bandwidth really is additive: 4 x MacBooks are 512GB @ 2456GB/s) You can’t get Mac Studios right now, so customers are buying MacBook clusters instead. A government is running this in prod. This setup is best for low-batch decode-heavy inference (all memory bound) and transcription (super fast and cheap on apple silicon).
Alex Cheema tweet media
Alex Cheema@alexocheema

It is unconventional but it actually works, depending on the workload of course. There are strengths and weaknesses for sure. There are some real deployments (governments, big companies) running this setup in production (pods of 4 MacBooks). It's the best price to performance for many workloads (e.g. transcription, low batch LLM inference). They landed on this themselves as the best hardware to run their workloads on. Can share more in private if you are interested (don't want to turn this into a sales pitch for exo). If Apple actually sold us an M5 Max / M5 Ultra Mac Studio, then we'd use that. But we could be waiting until October for that (or longer, the supply chain issues seem pretty bad). It's the same M5 Max chip in the MacBook as the Mac Studio, and it goes up to 128GB unified memory. Each chip has 614GB/s memory bandwidth (2.24x DGX Spark). I would say the main downside (which we should make more clear) is the software ecosystem - it's still quite immature. It has got much better in the last year e.g. clustering came a long way with low-latency RDMA in macOS 26.2.

English
24
14
212
36.6K
mufeez
mufeez@moofeez·
I post-trained Qwen3-Coder to fix bugs using an actual debugger. The result: Solve rate: 70% → 89% Median turns to fix: 46 → 19 (-59%) Instead of just reading code or print-debugging, it: - reasons from execution - inspects live variables and call stacks - sets breakpoints, steps, and evaluates expressions
English
91
118
1.6K
124.4K
Karthik
Karthik@_karthik·
> filling forms on the web sucks! you're usually giving the same info over and over again > so i made clacky, inspired by clicky from @FarzaTV. drop an id, a resume, anything once - cmd+click any form field on your mac and it fills. >> dont panic. its all offline, powered by @googlegemma 4. nothing leaves your computer ever > works on web, pdf forms, slack text fields - ANYWHERE on your mac > 100% open source and free. shoutout to @osanseviero and team for building such a neat model and @Prince_Canuma for the mlx port! nice to get back into hacking for fun!
English
3
4
15
2.1K
Yves Van Den Broek
Yves Van Den Broek@Yves_From_BE·
@bstnxbt While in chat I see the 3x speedup in tokens (Qwen3.5-9B) 27 vs 84 t/s. The server is slower then the default mlx server when using RAG, though the server claims to use draft and reports a 68% acceptance …
English
0
0
0
167
bstn 👁️
bstn 👁️@bstnxbt·
dflash-mlx v0.1.1 dflash-serve now supports tools, reasoning, streaming, and full OpenAI-compatible serving. Works with OpenCode, aider, Continue, Open WebUI. Also available via oMLX (thanks jundot). github.com/bstnxbt/dflash…
bstn 👁️ tweet media
English
7
28
182
33.9K
Alex Cheema
Alex Cheema@alexocheema·
China is way ahead on AI adoption. A school in Beijing has repurposed old macs to run personalised AI agents 100% locally using @exolabs The macs were previously used in their film studies lab, for video editing. They have ingested their entire corpus of school data: curriculums, reports, instructional materials and learning objectives - so it’s grounded to all their data in realtime. In order to get accurate answers, they need frontier models, which are BIG - memory is the constraint (not FLOPS). Apple devices with unified memory have a lot of high speed memory so stacking enough of them makes it possible to run massive models. A big concern of schools and parents is data privacy - when students or teachers use models in the cloud they are sending all their data in plaintext to the model provider. Even if schools have policies around this there’s always the risk someone accidentally copy-pastes sensitive data into the model - data leakage is inevitable.
Alex Cheema tweet media
English
47
129
893
92K
Yves Van Den Broek
Yves Van Den Broek@Yves_From_BE·
@SIskyee @AdamZmenak @alexocheema @exolabs Not memory, if you see that the M5 has 4x faster prompt processing with only a 30% increase in mm BW. It is all in MatMul aka new Neural accelerators in the M5 GPU. If the agents were smarter and cached once the function calls …
English
0
0
0
11
TerraHub
TerraHub@SIskyee·
@AdamZmenak @alexocheema @exolabs Exactly this. Token speed means nothing when burning 30 seconds on prompt processing. Bottleneck shifted from generation to context ingestion. Solution: prompt caching, smaller system prompts, pre-computed embeddings. Memory bandwidth is the new constraint.
English
1
0
1
76
Alex Cheema
Alex Cheema@alexocheema·
Super in-depth guide setting up Qwen3.5-122B-A10B on 2 x 128GB M4 Max Mac Studios from scratch w/ @exolabs. Total cost $8k. Runs at 52 tok/sec. Imagine M5 Max mac mini. The amount of edge compute accessible to consumers will be huge: ~27.5% more memory bandwidth & ~4x FLOPS.
Alex Cheema tweet media
Trevin Peterson@TrevinPeterson

x.com/i/article/2027…

English
64
67
881
315.7K
Yves Van Den Broek
Yves Van Den Broek@Yves_From_BE·
MLX Distributed / RDMA / Tensor Parallelism / KV Cache / Batching / Streaming
English
0
0
1
39
Yves Van Den Broek
Yves Van Den Broek@Yves_From_BE·
@awnihannun Nice! Wouldn't PreFill be faster if you run it on 4 M3U's distributed? Maybe 4 256GB make more sense than 512GB ...
English
0
0
0
4
Awni Hannun
Awni Hannun@awnihannun·
Running four high-level OpenCode agents + subagents with mlx_lm.server continuous batching and MiniMax M2.5 (6-bit). Fits easily on a 512GB M3 Ultra. Generation is quite fast. But prefill is still slow compared to cloud servers.
English
21
19
276
26K
Yves Van Den Broek
Yves Van Den Broek@Yves_From_BE·
@matteoianni @Patrick1Kennedy Using MLX distributed, you slash the TTFT / Prompt PP. In my setup going from 1 to 2 Macs I slash the TTFT time by more then a factor of 2 🤷‍♂️ That with prompt caching where you cache the openclaw tool call list and you are good.
English
0
0
0
27
Matteo
Matteo@matteoianni·
@Patrick1Kennedy Prompt processing is shit. OpenClaw would take 3 days for a single task.
English
2
0
1
287
Patrick J Kennedy
Patrick J Kennedy@Patrick1Kennedy·
This is MiniMax-M2.5 MLX running in LM Studio on an Apple Mac Studio M3 Ultra 512GB. Fast enough out of the box for hosting OpenClaw, n8n workflows, and Open WebUI for the team.
English
47
27
606
76.4K
Yves Van Den Broek
Yves Van Den Broek@Yves_From_BE·
@ryanrhughes @Patrick1Kennedy The solution is to cluster M4Max of M3U with RDMA, it even reduces time to first prompt for MoE architectures. Awni had a post on that and I saw this with a Mac Mini M4 Pro and MBP M4Max, not advised as they are to different in architecture.
English
0
0
0
199
Ryan R. Hughes
Ryan R. Hughes@ryanrhughes·
What does the time to first token look like when you throw it a real prompt? This has been a consistent holdup for me with my M4 MacBook Pro. It's great at processing simple prompts but the moment you throw it a 100k request; you're waiting 1min for prompt processing. I'm wondering how much better the Ultra chip is for that.
English
3
0
3
2.1K
Yves Van Den Broek
Yves Van Den Broek@Yves_From_BE·
Using MLX distributed / RDMA capabilities. MacBook Pro M4 Max + Mac mini M4 Pro connected via TB5.
English
0
1
2
276