Jun Kim

19 posts

Jun Kim

Jun Kim

@jundotkim

Katılım Şubat 2026
141 Takip Edilen272 Takipçiler
Jun Kim retweetledi
bstn 👁️
bstn 👁️@bstnxbt·
dflash-mlx v0.1.1 dflash-serve now supports tools, reasoning, streaming, and full OpenAI-compatible serving. Works with OpenCode, aider, Continue, Open WebUI. Also available via oMLX (thanks jundot). github.com/bstnxbt/dflash…
bstn 👁️ tweet media
English
7
29
184
32.2K
Jun Kim
Jun Kim@jundotkim·
@d4mations @Prince_Canuma @bstnxbt Thanks so much for building such a great sub! Honestly I check it more than 10 times a day, it's become part of my daily routine. I just never seem to have enough time to actually sit down and post something myself. Really appreciate all the support you give.
English
0
0
2
163
Jun Kim
Jun Kim@jundotkim·
oMLX 0.3.9.dev2 released. Highlights: - Gemma 4 MTP on the vision path (thanks to @Prince_Canuma's mlx-vlm). Image+text decodes much faster now - Gemma 4 on the DFlash engine (thanks to @bstnxbt's dflash-mlx) - ParoQuant support - omlx launch copilot joins claude / codex / opencode / openclaw / pi - Restart server button right in the admin UI - oQ auto-builds a proxy when the model can't fit in RAM Plus a lot of bug fixes and 20 new contributors in this cycle. Thanks everyone! github.com/jundot/omlx/re…
Jun Kim tweet media
English
7
21
169
32.1K
Jun Kim retweetledi
AI✖️Satoshi⏩️
AI✖️Satoshi⏩️@AiXsatoshi·
MacでローカルLLMを使うなら、oMLXがかなりオススメです👍 これまでMacのローカルLLMで課題になりやすかった待ち時間が大幅に短縮され、タスクによっては半分以下、場合によっては1/10程度まで速くなることがあります。 特に快適なのは、システムプロンプトやツール定義など、何度も使い回す長い共通プロンプトがある場合です🚀 高速化の理由は、プロンプトの計算結果をメモリやSSDに保存しておき、次回以降に再利用できるためです
Jun Kim@jundotkim

oMLX 0.3.9.dev2 released. Highlights: - Gemma 4 MTP on the vision path (thanks to @Prince_Canuma's mlx-vlm). Image+text decodes much faster now - Gemma 4 on the DFlash engine (thanks to @bstnxbt's dflash-mlx) - ParoQuant support - omlx launch copilot joins claude / codex / opencode / openclaw / pi - Restart server button right in the admin UI - oQ auto-builds a proxy when the model can't fit in RAM Plus a lot of bug fixes and 20 new contributors in this cycle. Thanks everyone! github.com/jundot/omlx/re…

日本語
3
16
96
13.2K
Jun Kim retweetledi
Berryxia.AI
Berryxia.AI@berryxia·
最近我鼓吹苹果的端侧模型和统一内存的优势! 前有MLX ,现在不断拓展的格式都出来比如之前也分享过的oMLX又有更新! Apple Silicon上的本地AI已经把云端大模型的很多优势直接干掉了。 oMLX 0.3.9.dev2直接把Gemma 4的MTP视觉路径、DFlash引擎、ParoQuant全塞了进来,图文解码速度大幅提升; 还新增了omlx launch copilot,一键接入Claude / Codex / OpenClaw等顶级工具; oQ自动建proxy解决显存不够的问题; 管理界面也加了重启服务器按钮。 以前本地AI总觉得“差点意思”,现在它在速度、集成度、易用性上越来越离谱地强。 这才是真正把AI从云端拉回你自己电脑的节奏。 项目地址:github.com/jundot/omlx
Berryxia.AI tweet media
中文
8
7
79
10.7K
Jun Kim
Jun Kim@jundotkim·
Appreciate the shoutout and the thoughtful comparison. oMLX started from a simple frustration: waiting too long every time my coding agent shifted context. The SSD KV cache was built to fix that. Nice to see MTPLX taking a different angle with speculative decoding. Different problems, both worth solving.
Alex@AlexJonesax

Two open-source MLX inference servers worth knowing about if you run LLMs on Mac: MTPLX (@youssofal) Uses a model's own MTP heads for speculative decoding. No draft model needed. ~63 tok/s on Qwen3.6-27B (M5Max). Mathematically exact sampling too; not just greedy prefix matching. oMLX (@jundot) Tiered KV cache that persists to SSD across restarts. Huge for coding agents where you're sending the same codebase context repeatedly. Also serves LLMs, VLMs, embeddings, rerankers, and audio simultaneously. They're solving different problems; MTPLX maximizes tok/s, oMLX maximizes workflow efficiency. Both have OpenAI + Anthropic-compatible APIs, both work with Claude Code/OpenCode/Cursor out of the box. Running both depending on the task. But, both worth checking out.

English
3
1
11
709
Jun Kim
Jun Kim@jundotkim·
oMLX dev here! - context shifts = when the prompt mutates mid-session, oMLX still serves a cache hit up to the matching prefix. Vanilla KV cache can be trimmed by most servers, but Gemma/Qwen3.5 use specialized cache layouts (sliding window, hybrid attn, arrayscache) where any divergence normally forces a full re-prefill. oMLX does partial hits across ALL cache types - critical for coding agents! - compaction = Claude Code only auto compacts at 200k. oMLX advertises a threshold based on the actual model's context window so it triggers at the right time for whatever you're running. - tool result pruning = on slow hardware, an agent reading a huge file can take 15+ min. You cap the per-read size; the model sees a slice and asks for the next chunk only if needed. Happy to go deeper on any of these!
English
0
0
0
72
Mario Zechner
Mario Zechner@badlogicgames·
baby steps with mlx server + Qwen3.6-27B. no prefix cache, at least i couldn't figure out how to get that to work. tool calls are parsed post-hoc after the full assistant message has been generated.
Mario Zechner tweet media
English
9
2
54
9.7K
Esteban
Esteban@breath_mirror·
been spending the past days integrating @bstnxbt's DFlash into my fork of oMLX 96t/s at 1024 tokens generated now adding @0xClandestine's Mirror SD on top of it let's see if it can boost the t/s then i'll add @runsonai's DDTree no idea where this will lead me but fun indeed
English
6
4
75
3.2K
Jun Kim
Jun Kim@jundotkim·
@0xClandestine @breath_mirror @bstnxbt @runsonai Hey! I'm jundot, creator of oMLX. I do have an X account, just haven't been active here since I've been balancing a day job and maintaining oMLX on the side. Glad to see oMLX getting some love!
English
2
0
3
97
Jun Kim retweetledi
Awni Hannun
Awni Hannun@awnihannun·
There were some exceptionally cool demos from @ollama and omlx using MLX to run Qwen 3.5 and Gemma 4 on Apple silicon. The capabilities of local LLMs and the surrounding ecosystem have come a long way in the past couple years.
Todd Dailey@twid

Watching @awnihannun at @ollama

English
6
14
206
28.9K
Jun Kim retweetledi
GitHubDaily
GitHubDaily@GitHub_Daily·
在 Mac 上部署本地大模型,内存管理颇为麻烦,想找个既高效又灵活的方案并不容易。 最近在 GitHub 刷到 oMLX 这个专为 Apple Silicon 优化的推理工具。 直接把大模型服务做进了 macOS 的顶部状态栏里,点开就能管理。 核心亮点是冷热分层的 KV 缓存机制,高频访问的上下文留在内存,装不下的自动存入固态硬盘。 就算中途切换模型或者重启服务,之前的上下文也能瞬间恢复,帮我们省去漫长的重新计算时间。 GitHub:github.com/jundot/omlx 自带直观的 Web 管理后台,不仅能实时监控内存、精细调参,还能直接搜索下载各类开源模型。 提供原生的 Mac 客户端,拖拽即可安装,也支持用 Homebrew 一键部署。 全面兼容 OpenAI 和 Anthropic 接口,方便直接接入我们常用的聊天软件。 把本地模型的日常调用体验打磨得很顺滑,适合重度依赖 Mac 跑大模型的朋友。
GitHubDaily tweet media
中文
3
11
84
13.6K
Jun Kim
Jun Kim@jundotkim·
oMLX owner here. Angelos is right — this is likely on our side, not MLX or MLX-LM. We've shipped numerous tool-calling fixes for small reasoning models since v0.2.6, and as of v0.2.16 (latest), oMLX guarantees token-level identical output with mlx-lm's BatchGenerator. Please try upgrading to v0.2.16 and let us know if the multi-turn tool calling degradation persists. @LotusDecoder Happy to dig in if you share a repro against oMLX.
English
0
0
0
59
Angelos Katharopoulos
Angelos Katharopoulos@angeloskath·
@LotusDecoder Just got time to look at it. The 4 bit completes 70/70 rounds just fine using mlx_lm.server . It doesn't seem to be an MLX or MLX-LM issue, perhaps oMLX or something else in your env?
English
1
0
0
149
LotusDecoder
LotusDecoder@LotusDecoder·
又搞了一轮实测出发的对比实验, 算是大概整明白了, 主要原因 :Qwen3.5 架构太新了, MLX 的量化落后, 跑 claude code, MLX 4bit 和 8bit 都不咋行。 还是得用 GGUF 啊。 现在结论是: 编程生产力用 GGUF 量化。 追求速度用 MLX 量化。
LotusDecoder tweet mediaLotusDecoder tweet mediaLotusDecoder tweet media
LotusDecoder@LotusDecoder

还是算了,qwen3.5低档model不适用于 claude code 的生产力。 MLX/Qwen3.5-35B-8bit , 虽然速度快,也有点智能, 但是不大适用于 claude code, 很容易退化。 大概13轮开始吧。 而一个任务,read write bash 来五六次 tool很常见,基本等于很难用于生产了。

中文
11
5
145
38.9K
Jun Kim
Jun Kim@jundotkim·
@kbhero21 The loading issue was likely because you picked the wrong model variant — make sure you're selecting a model with MLX in the name, as those are the ones optimized to run locally via the MLX framework.
English
0
0
0
8
Delia Dou
Delia Dou@kbhero21·
Learning from Cell and building in public. 1/ Today I completed 2 AI hands-on projects. The first one made me truly understand why people say the deepest layer of the AI era is: data + devices + compute. Project 1: I tried running a Qwen model on my 16GB MacBook with omlx. Reality check: my machine didn’t have enough memory for the tutorial’s Qwen 3.5 6B. At first, AI suggested Qwen 3.5 4B. I reset the model API, changed Max Tokens, adjusted Max Context Window... still failed. In the end, I had to downgrade to: Qwen 2.5-Coder-1.5B-Instruct-4bit That finally worked. It was fast, but very limited: – mostly only good for Q&A – code could only appear in the output panel – no real file reading or writing
English
3
0
0
197
Jun Kim retweetledi
Michal Komar
Michal Komar@michalkomar·
oMLX might be the gamechanger for local LLM inference on Mac Silicon chips. M5 Neural Accelerator support and disk caching included github.com/jundot/omlx
English
0
1
4
384
Jun Kim retweetledi
GitHub Projects Community
GitHub Projects Community@GithubProjects·
Serve multiple llms locally on your mac with optimized memory and caching
GitHub Projects Community tweet media
English
3
13
164
14K
Jun Kim retweetledi
Brian Roemmele
Brian Roemmele@BrianRoemmele·
Testing this now. Quite useful on my laptop, omIx LLM inference server with continuous batching & SSD caching for Apple Silicon - managed from the macOS menu bar github.com/jundot/omlx
English
5
7
84
7.1K