Patrick Devine

1.4K posts

Patrick Devine

@pdev110

Software Guy @ Ollama

San Francisco Bay Area Tham gia Mayıs 2014

226 Đang theo dõi853 Người theo dõi

Tweet ghim

Patrick Devine@pdev110·18 Ara

For your pandemic Friday viewing enjoyment. 'kubectl run -it --rm --image=ghcr.io/pdevine/thisis… tif' or for the docker inclined 'docker run -it --rm ghcr.io/pdevine/thisis…'

English

Patrick Devine@pdev110·4d

So glad to finally get this out! We also added a new LRU trie cache w/ Qwen 3.5 which makes using coding tools like Claude and Pi super usable running locally.

ollama@ollama

Ollama is now updated to run the fastest on Apple silicon, powered by MLX, Apple's machine learning framework. This change unlocks much faster performance to accelerate demanding work on macOS: - Personal assistants like OpenClaw - Coding agents like Claude Code, OpenCode, or Codex

English

Patrick Devine@pdev110·4d

@KrakowiakK @awnihannun I believe MLX is faster than vLLM w/ batch size 1.

English

Kris Krakowiak@KrakowiakK·5d

@awnihannun I am waiting for the moment when MLX will be ahead of vLLM.. might never happen ;)

English

193

Awni Hannun@awnihannun·5d

You can now run LMs with Ollama + MLX! I've been waiting for this moment since MLX was first open sourced, so glad that it finally arrived.

ollama@ollama

English

509

75.7K

Patrick Devine@pdev110·4d

@QuantCapitalX @ollama Not quite ready yet 😅

English

Quant Capital@QuantCapitalX·5d

Why are nvfp4 Models not supported on the NVIDIA DGX-Spark? Should be running fine! "admin@dgx-spark:~$ ollama run qwen3.5:35b-a3b-nvfp4 pulling manifest Error: pull model manifest: 412: this model requires macOS"

English

972

ollama@ollama·5d

English

286

728

5.8K

747.6K

Patrick Devine@pdev110·4d

@D_Twitt3r @ollama You _may_ be able to get them to import if you create a Modelfile in the safetensors directory and then use `ollama create --experimental -f <path/to/Modelfile>`. You can use the `--quantization` parameter to quantize to nvfp4/int4/etc.

English

D.@D_Twitt3r·5d

@ollama Will this updated Ollama support other mlx and/or nvfp4 models downloaded from Hugging Face? Or do we need to wait for you to do more adjustments to them and post in your own catalog?

English

1.4K

Patrick Devine@pdev110·4d

@ivanfioravanti @ollama Definitely appreciate any feedback! This one took a while, but I think the wait will have been worth it. 😅

English

Ivan Fioravanti ᯅ@ivanfioravanti·5d

@ollama YES!!!!!!!! Thanks team! I'll benchmark this like crazy and send feedback!

English

1.3K

Patrick Devine@pdev110·19 Mar

@micheltamanda @hxiao @0xRaghuboi Take a look through the docs in docs/development.md (specifically the MLX Engine (optional) section). You'll need to install Xcode and also run `xcodebuild -downloadComponent MetalToolchain` and then the correct cmake commands (they're covered in the doc).

English

Michel Laclé@micheltamanda·18 Mar

@pdev110 @hxiao @0xRaghuboi @pdev110 , thoughts?

English

Han Xiao@hxiao·15 Mar

uh..Qwen3.5-35B-A3B on llama.cpp re-prefill on every request, ~4x slower than it should be. anyone solved this? Thought people have happily deployed & used it locally? But if this is not solved yet, the perf is quite limited. Root cause: GDN layers are recurrent → pos_min tracks full sequence → but llama.cpp validates cache using an SWA threshold that defaults to 1 for non-SWA models → pos_min > 1 always true → cache always discarded → full re-refill every time?

English

271

27.1K

Patrick Devine@pdev110·17 Mar

@ivanfioravanti @dfi @0xSero Do you mean works better than the stock affine quants? I think the problem is the default groupsize is 64 which isn't super great.

English

Ivan Fioravanti ᯅ@ivanfioravanti·16 Mar

My for mlx-openbench is working like a charm and crunching evals like crazy. My M3 Ultra are on fire! 🔥 Here testing Qwen3.5-35B-A3B-REAP-pile10k-15p-MLX-q8 by @dfi created with reap-mlx by @0xSero Don't ask me how this is possible, but REAP model is working better on evals 👀

English

9.4K

Patrick Devine@pdev110·17 Mar

@micheltamanda @hxiao @0xRaghuboi Just merged the two PRs. What are you running in to?

English

Michel Laclé@micheltamanda·17 Mar

@pdev110 @hxiao @0xRaghuboi Working through this now, I having issues getting MLX to work with int4. Debugging it now.

English

Patrick Devine@pdev110·17 Mar

@micheltamanda @hxiao @0xRaghuboi I get about 135 toks/sec eval w/ that on the new M5 Max. About 95 toks/sec eval on an M3 Max.

English

Patrick Devine@pdev110·17 Mar

@micheltamanda @hxiao @0xRaghuboi github.com/ollama/ollama/… you will also need to `ollama pull pdevine/qwen3.5:35b-a3b-int4`. Development instructions for enabling MLX are in `docs/development.md`.

English

358

Patrick Devine@pdev110·17 Mar

@micheltamanda @hxiao @0xRaghuboi Still in PR form, but will hopefully be merged soon 😅

English

Michel Laclé@micheltamanda·17 Mar

@pdev110 @hxiao @0xRaghuboi Ok this is great, I will install the latest when I get home

English

Patrick Devine@pdev110·17 Mar

@micheltamanda @hxiao @0xRaghuboi I just posted some updates to make it run a lot faster w/ Ollama w/ the MLX backend. Also, there are some caching changes coming too which should make it pretty zippy.

English

Michel Laclé@micheltamanda·15 Mar

@hxiao @0xRaghuboi I am using Ollama to serve this model. I will try vLLM/llama.cpp next.

English

785

Patrick Devine@pdev110·13 Mar

@ivanfioravanti Do you know if oMLX sets the presence_penalty correctly? Qwen3.5 is super sensitive to this and tends to really overthink if the hyperparameters are set incorrectly.

English

447

Ivan Fioravanti ᯅ@ivanfioravanti·13 Mar

Yes, Qwen3.5-9B thinks a bit too much. 22 seconds to reply to Hi. Note: here I'm using oMLX (it seems really good!)

English

6.6K

Patrick Devine@pdev110·24 Şub

@ivanfioravanti @awnihannun I've definitely felt the "cognitive debt" issue already with LLMs writing something that "works", but it turns out it isn't the thing I actually asked it to write. The flip side is it's now really easy to just _ask_ the LLM to explain itself.

English

Ivan Fioravanti ᯅ@ivanfioravanti·23 Şub

@awnihannun "As humans write less code our mental model of the underlying systems and algorithms will decay or never develop in the first place. This is cognitive debt." Very few people will be able to intervene in case of need, this could become a real issue in the mid term.

English

Awni Hannun@awnihannun·23 Şub

x.com/i/article/2025…

ZXX

386

177.6K

Patrick Devine@pdev110·19 Şub

@awnihannun I think the main difference between models and hw is the cost of duplicating the weights is effectively zero. If you've got the hw, you can always run another copy model.

English

Awni Hannun@awnihannun·17 Şub

In some ways AI models are more like hardware than software. - Predictable scaling - Immense capex to produce - High per unit margin (each token is profitable for inference) - Finite shelf life. Shorter half life than hardware, to be sure - Updated on reasonably predetermined schedules. And at this point I think it’s fair to say those schedules are known at least a year maybe multiple years in advance - Distinct tape-out (training) and after that you can only improve / fix them in post training (firmware?) I wonder if AI models will ever be sold in the same way devices are sold.

English

Patrick Devine@pdev110·14 Şub

@vickcodes Pro tip: We just added Ctrl-G (same as Claude and Codex) in `ollama run` so you can use your favourite editor to write long prompts.

English

Vikas@vickcodes·12 Şub

👩‍🚀 Up and Running. 🚀 #llama #minimax

ollama@ollama

❤️ We are partnering with @MiniMax_AI to give Ollama users free usage of MiniMax M2.5 for the next couple of days! ollama run minimax-m2.5:cloud Use MiniMax M2.5 with OpenCode, Claude Code, Codex, OpenClaw via ollama launch! OpenCode: ollama launch opencode --model minimax-m2.5:cloud Claude: ollama launch claude --model minimax-m2.5:cloud

English

978

Patrick Devine đã retweet

CloudAI-X@cloudxdev·13 Şub

New @ollama update is god's work btw Give it a try Just run `ollama` and thats it

English

8.5K

Patrick Devine@pdev110·4 Şub

@ivanfioravanti In my testing on glm 4.7 flash, 4bit affine quants with a high group setting often makes it go of the rails (and I still see this even with mixed precision between the various tensors). 8 bit is definitely a lot more resilient.

English

Ivan Fioravanti ᯅ@ivanfioravanti·4 Şub

Using 4bit coding models locally give you more generation speed, but less precision. This means model and harness will keep iterating on wrongly created file to fix them. TLDR: better using 6bit or above for local coding.

English

106

8.7K

Patrick Devine@pdev110·28 Oca

@MrBerneker @ollama it's not in the UI yet, but does work in the CLI. Here's z-image-turbo running on the MLX backend in Ollama in an iterm2 terminal running on macOS:

English

Brian Berneker@MrBerneker·27 Oca

Hey @ollama, I have a feature request for you! I love that you added z-image-turbo and flux2-klein for use with Ollama, but it would be even better if you made your chat interface able to send it prompts and display generated images too! What do you think? Doable?

English

3.7K

Patrick Devine@pdev110·28 Oca

@wassollichhier @ollama I told it to "be so colorful that my eyes freaking bleed", so I can't blame it for the neon. :-D

English

103

Wassollichhier@wassollichhier·28 Oca

@pdev110 @ollama Interessting, when I one shot a Pong game with GLM 4.7 Flash, it made a „Neon“ Pong as well. The model really like neon style for old school games ^^ Btw. I havent asked Flash to make it Neon

English

138

Patrick Devine@pdev110·27 Oca

1-shot prompt Asteroids game w/ GLM-4.7-Flash on the experimental MLX backend for @ollama. This was w/ 8-bit affine quantization and a MBP.

English

10.2K

Khám phá

@KrakowiakK @awnihannun @QuantCapitalX @ollama @D_Twitt3r @ivanfioravanti @micheltamanda @hxiao