Neil Mehta retweetet
Neil Mehta
13 posts


We made MLX engine a lot faster in the last release. Give it a try!
Neil Mehta@ostensiblyneil
English

Cool demo here showing the capabilities of the beta LM Studio MLX engine. The engine intelligently manages caching and batching for these parallel agents.
Adrien Grondin@adrgrondin
Subagents running locally and simultaneously on MacBook Pro M5 with Codex CLI + @lmstudio to review code and find bugs using Qwen 3.6 Powered by the updated MLX engine with batching in beta in the app The batching speed boost is noticeable
English

@TokenFires Hey, we published an updated runtime for MLX yesterday that should improve the LM Studio performance. If you want to try it out, update the app to 0.4.13 and then run `lms runtime get mlx --channel beta`. Would be curious to hear your experience!
English

I just ran Qwen3.6 35B A3B on oMLX instead of LM Studio and all that delay in prompt processing…GONE. I’m feeding about 45k tokens into my agent each turn and chat has become as fast as frontier. PP started at 40k and went to 250k tokens per second. I’m a little blown away. This doesn’t seem real. M5 Max MacBook, top end temps have dropped 10-20 degrees F, fan spin dropped from 5700 rpm (maximum) to 3200 rpm on coding tasks. Putting this setup on my Mac mini this weekend…
English

@ostensiblyneil Thanks. Now, I can confirm the new SHA-256 24ad4d1... for LM-Studio-0.3.36-1-x64.exe.
There is one additional confirmation.
At the reporting, LM-Studio-0.3.36-1-arm64.exe kept SHA256 f3df3be....
Now, it becomes c22b5bf...
Is this also intentional?
English

LM-Studio-0.3.36-1-x64.exe のhash値がリリース時点から変わってるみたいで、怖くてissue立てちゃった。
github.com/lmstudio-ai/lm…
普通に更新された、破損、おま環など考えられる理由は色々あるが……
日本語

@teknium @ggerganov The important piece is the engine selection on the bottom right. CUDA 12 is recommended for the best performance on 5090 cards.

English


gpt-oss is a great model
IMO OpenAI showed us the blueprint for winning local AI:
- Interleaved SWA
- Small head sizes in the attention
- Attention sinks
- Mixture of Experts FFN
- 4-bit training
All of these parts combined together result in the best architecture suitable for regular users. Very lightweight and efficient for inference on pretty much any hardware.
Qwen models are also great. The MoE works really well. I think they should just adopt iSWA and 4-bit training to become the best.
Gemma models are also great. They already have the 4-bit QAT figured out. It seems they just need to adopt the MoE architecture. And maybe reduce the head size a bit.
p.s. don't know if this makes sense, just my overall impression and intuitive understanding
English

@teknium @ggerganov Hey @teknium could you please check the runtimes page in the app (ctrl+shift+R)? The default selection should be CUDA 12 llama.cpp v1.47.0 (or greater), and note that the model needs to be reloaded after changing the default selection.
English

I used to have 2x 4090s on the pc, which definitely did cause a lot of issues - when I tested without the 2nd 4090 back then it sped everything up dramatically.
But now, just a single 5090 on here - here's my fire hazard dusty ass rig xD apologies for all the sadness this image will cause people 😂

English

MLX-VLM v0.3.2 is here 🔥
What’s new:
- Migrated to .toml
- UI and Audio dependencies are optional
- Added CUDA support
- Support text-only training
- Lots of fixes and refactoring
Thanks to all the awesome contributions of this release ❤️ (@ActuallyIsaak, Neil from @lmstudio, Saurav and Zhnext)
Get started today:
> pip install -U mlx-vlm
Please leave us a star ⭐
github.com/Blaizzy/mlx-vlm

English




