Jun Kim

19 posts

Jun Kim

@jundotkim

Katılım Şubat 2026

141 Takip Edilen272 Takipçiler

Jun Kim retweetledi

bstn 👁️@bstnxbt·14 Nis

dflash-mlx v0.1.1 dflash-serve now supports tools, reasoning, streaming, and full OpenAI-compatible serving. Works with OpenCode, aider, Continue, Open WebUI. Also available via oMLX (thanks jundot). github.com/bstnxbt/dflash…

English

184

32.2K

Jun Kim@jundotkim·1d

@d4mations @Prince_Canuma @bstnxbt Thanks so much for building such a great sub! Honestly I check it more than 10 times a day, it's become part of my daily routine. I just never seem to have enough time to actually sit down and post something myself. Really appreciate all the support you give.

English

163

d4mations@d4mations·1d

@jundotkim @Prince_Canuma @bstnxbt Hey Jun, have a look at the reddit sub, we are over 2k visitors every week!! It has really taken off!

English

216

Jun Kim@jundotkim·1d

oMLX 0.3.9.dev2 released. Highlights: - Gemma 4 MTP on the vision path (thanks to @Prince_Canuma's mlx-vlm). Image+text decodes much faster now - Gemma 4 on the DFlash engine (thanks to @bstnxbt's dflash-mlx) - ParoQuant support - omlx launch copilot joins claude / codex / opencode / openclaw / pi - Restart server button right in the admin UI - oQ auto-builds a proxy when the model can't fit in RAM Plus a lot of bug fixes and 20 new contributors in this cycle. Thanks everyone! github.com/jundot/omlx/re…

English

169

32.1K

Jun Kim retweetledi

AI✖️Satoshi⏩️@AiXsatoshi·1d

MacでローカルLLMを使うなら、oMLXがかなりオススメです👍 これまでMacのローカルLLMで課題になりやすかった待ち時間が大幅に短縮され、タスクによっては半分以下、場合によっては1/10程度まで速くなることがあります。特に快適なのは、システムプロンプトやツール定義など、何度も使い回す長い共通プロンプトがある場合です🚀 高速化の理由は、プロンプトの計算結果をメモリやSSDに保存しておき、次回以降に再利用できるためです

Jun Kim@jundotkim

日本語

13.2K

Jun Kim retweetledi

Berryxia.AI@berryxia·1d

最近我鼓吹苹果的端侧模型和统一内存的优势！前有MLX ，现在不断拓展的格式都出来比如之前也分享过的oMLX又有更新！ Apple Silicon上的本地AI已经把云端大模型的很多优势直接干掉了。 oMLX 0.3.9.dev2直接把Gemma 4的MTP视觉路径、DFlash引擎、ParoQuant全塞了进来，图文解码速度大幅提升；还新增了omlx launch copilot，一键接入Claude / Codex / OpenClaw等顶级工具； oQ自动建proxy解决显存不够的问题；管理界面也加了重启服务器按钮。以前本地AI总觉得“差点意思”，现在它在速度、集成度、易用性上越来越离谱地强。这才是真正把AI从云端拉回你自己电脑的节奏。项目地址：github.com/jundot/omlx

中文

10.7K

Jun Kim@jundotkim·3d

Appreciate the shoutout and the thoughtful comparison. oMLX started from a simple frustration: waiting too long every time my coding agent shifted context. The SSD KV cache was built to fix that. Nice to see MTPLX taking a different angle with speculative decoding. Different problems, both worth solving.

Alex@AlexJonesax

Two open-source MLX inference servers worth knowing about if you run LLMs on Mac: MTPLX (@youssofal) Uses a model's own MTP heads for speculative decoding. No draft model needed. ~63 tok/s on Qwen3.6-27B (M5Max). Mathematically exact sampling too; not just greedy prefix matching. oMLX (@jundot) Tiered KV cache that persists to SSD across restarts. Huge for coding agents where you're sending the same codebase context repeatedly. Also serves LLMs, VLMs, embeddings, rerankers, and audio simultaneously. They're solving different problems; MTPLX maximizes tok/s, oMLX maximizes workflow efficiency. Both have OpenAI + Anthropic-compatible APIs, both work with Claude Code/OpenCode/Cursor out of the box. Running both depending on the task. But, both worth checking out.

English

709

Jun Kim@jundotkim·27 Nis

oMLX dev here! - context shifts = when the prompt mutates mid-session, oMLX still serves a cache hit up to the matching prefix. Vanilla KV cache can be trimmed by most servers, but Gemma/Qwen3.5 use specialized cache layouts (sliding window, hybrid attn, arrayscache) where any divergence normally forces a full re-prefill. oMLX does partial hits across ALL cache types - critical for coding agents! - compaction = Claude Code only auto compacts at 200k. oMLX advertises a threshold based on the actual model's context window so it triggers at the right time for whatever you're running. - tool result pruning = on slow hardware, an agent reading a huge file can take 15+ min. You cap the per-read size; the model sees a slice and asks for the next chunk only if needed. Happy to go deeper on any of these!

English

Ivan Fioravanti ᯅ@ivanfioravanti·27 Nis

@badlogicgames @jundotkim ?

QAM

Mario Zechner@badlogicgames·27 Nis

baby steps with mlx server + Qwen3.6-27B. no prefix cache, at least i couldn't figure out how to get that to work. tool calls are parsed post-hoc after the full assistant message has been generated.

English

9.7K

Jun Kim@jundotkim·15 Nis

@0xClandestine @breath_mirror @bstnxbt @runsonai Haha, can't be too active on X for now but just linked it on my GitHub profile!

English

clandestine.eth 🦇🔊@0xClandestine·15 Nis

@jundotkim @breath_mirror @bstnxbt @runsonai Gonna assume you're the OG jundot and follow 😅

English

103

Esteban@breath_mirror·15 Nis

been spending the past days integrating @bstnxbt's DFlash into my fork of oMLX 96t/s at 1024 tokens generated now adding @0xClandestine's Mirror SD on top of it let's see if it can boost the t/s then i'll add @runsonai's DDTree no idea where this will lead me but fun indeed

English

3.2K

Jun Kim@jundotkim·15 Nis

@0xClandestine @breath_mirror @bstnxbt @runsonai Hey! I'm jundot, creator of oMLX. I do have an X account, just haven't been active here since I've been balancing a day job and maintaining oMLX on the side. Glad to see oMLX getting some love!

English

clandestine.eth 🦇🔊@0xClandestine·15 Nis

@breath_mirror @bstnxbt @runsonai don't think he has one, quick to respond on gh tho

English

Jun Kim retweetledi

Alex Cheema@alexocheema·15 Nis

oMLX brought tiered kv caching to Mac. Especially important with Apple Silicon where prefill time is very long - you avoid redundant prefills, even between sessions by persisting kv caches to disk.

Tom Dörr@tom_doerr

Mac-optimized LLM inference server github.com/jundot/omlx

English

288

30.6K

Jun Kim retweetledi

Awni Hannun@awnihannun·10 Nis

There were some exceptionally cool demos from @ollama and omlx using MLX to run Qwen 3.5 and Gemma 4 on Apple silicon. The capabilities of local LLMs and the surrounding ecosystem have come a long way in the past couple years.

Todd Dailey@twid

Watching @awnihannun at @ollama

English

206

28.9K

Jun Kim retweetledi

GitHubDaily@GitHub_Daily·21 Mar

在 Mac 上部署本地大模型，内存管理颇为麻烦，想找个既高效又灵活的方案并不容易。最近在 GitHub 刷到 oMLX 这个专为 Apple Silicon 优化的推理工具。直接把大模型服务做进了 macOS 的顶部状态栏里，点开就能管理。核心亮点是冷热分层的 KV 缓存机制，高频访问的上下文留在内存，装不下的自动存入固态硬盘。就算中途切换模型或者重启服务，之前的上下文也能瞬间恢复，帮我们省去漫长的重新计算时间。 GitHub：github.com/jundot/omlx 自带直观的 Web 管理后台，不仅能实时监控内存、精细调参，还能直接搜索下载各类开源模型。提供原生的 Mac 客户端，拖拽即可安装，也支持用 Homebrew 一键部署。全面兼容 OpenAI 和 Anthropic 接口，方便直接接入我们常用的聊天软件。把本地模型的日常调用体验打磨得很顺滑，适合重度依赖 Mac 跑大模型的朋友。

中文

13.6K

Jun Kim@jundotkim·17 Mar

oMLX owner here. Angelos is right — this is likely on our side, not MLX or MLX-LM. We've shipped numerous tool-calling fixes for small reasoning models since v0.2.6, and as of v0.2.16 (latest), oMLX guarantees token-level identical output with mlx-lm's BatchGenerator. Please try upgrading to v0.2.16 and let us know if the multi-turn tool calling degradation persists. @LotusDecoder Happy to dig in if you share a repro against oMLX.

English

Angelos Katharopoulos@angeloskath·17 Mar

@LotusDecoder Just got time to look at it. The 4 bit completes 70/70 rounds just fine using mlx_lm.server . It doesn't seem to be an MLX or MLX-LM issue, perhaps oMLX or something else in your env?

English

149

LotusDecoder@LotusDecoder·11 Mar

又搞了一轮实测出发的对比实验，算是大概整明白了，主要原因：Qwen3.5 架构太新了， MLX 的量化落后，跑 claude code， MLX 4bit 和 8bit 都不咋行。还是得用 GGUF 啊。现在结论是：编程生产力用 GGUF 量化。追求速度用 MLX 量化。

LotusDecoder@LotusDecoder

还是算了，qwen3.5低档model不适用于 claude code 的生产力。 MLX/Qwen3.5-35B-8bit ，虽然速度快，也有点智能，但是不大适用于 claude code，很容易退化。大概13轮开始吧。而一个任务，read write bash 来五六次 tool很常见，基本等于很难用于生产了。

中文

145

38.9K

Jun Kim retweetledi

Tom Dörr@tom_doerr·15 Mar

LLM inference server for Apple Silicon github.com/jundot/omlx

Español

321

30K

Jun Kim@jundotkim·13 Mar

@kbhero21 The loading issue was likely because you picked the wrong model variant — make sure you're selecting a model with MLX in the name, as those are the ones optimized to run locally via the MLX framework.

English

Delia Dou@kbhero21·13 Mar

Learning from Cell and building in public. 1/ Today I completed 2 AI hands-on projects. The first one made me truly understand why people say the deepest layer of the AI era is: data + devices + compute. Project 1: I tried running a Qwen model on my 16GB MacBook with omlx. Reality check: my machine didn’t have enough memory for the tutorial’s Qwen 3.5 6B. At first, AI suggested Qwen 3.5 4B. I reset the model API, changed Max Tokens, adjusted Max Context Window... still failed. In the end, I had to downgrade to: Qwen 2.5-Coder-1.5B-Instruct-4bit That finally worked. It was fast, but very limited: – mostly only good for Q&A – code could only appear in the output panel – no real file reading or writing

English

197

Jun Kim retweetledi

Michal Komar@michalkomar·13 Mar

oMLX might be the gamechanger for local LLM inference on Mac Silicon chips. M5 Neural Accelerator support and disk caching included github.com/jundot/omlx

English

384

Jun Kim retweetledi

Ivan Fioravanti ᯅ@ivanfioravanti·12 Mar

Ok, it's time to test oMLX! github.com/jundot/omlx

English

106

6.7K

Jun Kim retweetledi

GitHub Projects Community@GithubProjects·12 Mar

Serve multiple llms locally on your mac with optimized memory and caching

English

164

14K

Jun Kim retweetledi

Brian Roemmele@BrianRoemmele·10 Mar

Testing this now. Quite useful on my laptop, omIx LLM inference server with continuous batching & SSD caching for Apple Silicon - managed from the macOS menu bar github.com/jundot/omlx

English

7.1K

Keşfet

@d4mations @Prince_Canuma @bstnxbt @badlogicgames @0xClandestine @breath_mirror @runsonai @ollama