Patrick Devine

1.4K posts

Patrick Devine

Patrick Devine

@pdev110

Software Guy @ Ollama

San Francisco Bay Area Tham gia Mayıs 2014
226 Đang theo dõi853 Người theo dõi
Kris Krakowiak
Kris Krakowiak@KrakowiakK·
@awnihannun I am waiting for the moment when MLX will be ahead of vLLM.. might never happen ;)
English
1
0
2
193
Quant Capital
Quant Capital@QuantCapitalX·
Why are nvfp4 Models not supported on the NVIDIA DGX-Spark? Should be running fine! "admin@dgx-spark:~$ ollama run qwen3.5:35b-a3b-nvfp4 pulling manifest Error: pull model manifest: 412: this model requires macOS"
English
2
0
0
972
ollama
ollama@ollama·
Ollama is now updated to run the fastest on Apple silicon, powered by MLX, Apple's machine learning framework. This change unlocks much faster performance to accelerate demanding work on macOS: - Personal assistants like OpenClaw - Coding agents like Claude Code, OpenCode, or Codex
English
286
728
5.8K
747.6K
Patrick Devine
Patrick Devine@pdev110·
@D_Twitt3r @ollama You _may_ be able to get them to import if you create a Modelfile in the safetensors directory and then use `ollama create --experimental -f <path/to/Modelfile>`. You can use the `--quantization` parameter to quantize to nvfp4/int4/etc.
English
0
0
1
43
D.
D.@D_Twitt3r·
@ollama Will this updated Ollama support other mlx and/or nvfp4 models downloaded from Hugging Face? Or do we need to wait for you to do more adjustments to them and post in your own catalog?
English
2
0
1
1.4K
Ivan Fioravanti ᯅ
Ivan Fioravanti ᯅ@ivanfioravanti·
@ollama YES!!!!!!!! Thanks team! I'll benchmark this like crazy and send feedback!
English
2
0
21
1.3K
Patrick Devine
Patrick Devine@pdev110·
@micheltamanda @hxiao @0xRaghuboi Take a look through the docs in docs/development.md (specifically the MLX Engine (optional) section). You'll need to install Xcode and also run `xcodebuild -downloadComponent MetalToolchain` and then the correct cmake commands (they're covered in the doc).
English
0
0
0
61
Han Xiao
Han Xiao@hxiao·
uh..Qwen3.5-35B-A3B on llama.cpp re-prefill on every request, ~4x slower than it should be. anyone solved this? Thought people have happily deployed & used it locally? But if this is not solved yet, the perf is quite limited. Root cause: GDN layers are recurrent → pos_min tracks full sequence → but llama.cpp validates cache using an SWA threshold that defaults to 1 for non-SWA models → pos_min > 1 always true → cache always discarded → full re-refill every time?
Han Xiao tweet media
English
33
27
271
27.1K
Patrick Devine
Patrick Devine@pdev110·
@ivanfioravanti @dfi @0xSero Do you mean works better than the stock affine quants? I think the problem is the default groupsize is 64 which isn't super great.
English
0
0
0
39
Ivan Fioravanti ᯅ
Ivan Fioravanti ᯅ@ivanfioravanti·
My for mlx-openbench is working like a charm and crunching evals like crazy. My M3 Ultra are on fire! 🔥 Here testing Qwen3.5-35B-A3B-REAP-pile10k-15p-MLX-q8 by @dfi created with reap-mlx by @0xSero Don't ask me how this is possible, but REAP model is working better on evals 👀
Ivan Fioravanti ᯅ tweet mediaIvan Fioravanti ᯅ tweet media
English
8
2
62
9.4K
Patrick Devine
Patrick Devine@pdev110·
@micheltamanda @hxiao @0xRaghuboi I just posted some updates to make it run a lot faster w/ Ollama w/ the MLX backend. Also, there are some caching changes coming too which should make it pretty zippy.
English
1
0
1
32
Patrick Devine
Patrick Devine@pdev110·
@ivanfioravanti Do you know if oMLX sets the presence_penalty correctly? Qwen3.5 is super sensitive to this and tends to really overthink if the hyperparameters are set incorrectly.
English
1
0
2
447
Ivan Fioravanti ᯅ
Ivan Fioravanti ᯅ@ivanfioravanti·
Yes, Qwen3.5-9B thinks a bit too much. 22 seconds to reply to Hi. Note: here I'm using oMLX (it seems really good!)
Ivan Fioravanti ᯅ tweet media
English
17
0
61
6.6K
Patrick Devine
Patrick Devine@pdev110·
@ivanfioravanti @awnihannun I've definitely felt the "cognitive debt" issue already with LLMs writing something that "works", but it turns out it isn't the thing I actually asked it to write. The flip side is it's now really easy to just _ask_ the LLM to explain itself.
English
0
0
1
41
Ivan Fioravanti ᯅ
Ivan Fioravanti ᯅ@ivanfioravanti·
@awnihannun "As humans write less code our mental model of the underlying systems and algorithms will decay or never develop in the first place. This is cognitive debt." Very few people will be able to intervene in case of need, this could become a real issue in the mid term.
English
2
0
27
6K
Patrick Devine
Patrick Devine@pdev110·
@awnihannun I think the main difference between models and hw is the cost of duplicating the weights is effectively zero. If you've got the hw, you can always run another copy model.
English
0
0
2
53
Awni Hannun
Awni Hannun@awnihannun·
In some ways AI models are more like hardware than software. - Predictable scaling - Immense capex to produce - High per unit margin (each token is profitable for inference) - Finite shelf life. Shorter half life than hardware, to be sure - Updated on reasonably predetermined schedules. And at this point I think it’s fair to say those schedules are known at least a year maybe multiple years in advance - Distinct tape-out (training) and after that you can only improve / fix them in post training (firmware?) I wonder if AI models will ever be sold in the same way devices are sold.
English
8
8
72
5K
Patrick Devine
Patrick Devine@pdev110·
@vickcodes Pro tip: We just added Ctrl-G (same as Claude and Codex) in `ollama run` so you can use your favourite editor to write long prompts.
English
1
0
0
24
Vikas
Vikas@vickcodes·
👩‍🚀 Up and Running. 🚀 #llama #minimax
Vikas tweet media
ollama@ollama

❤️ We are partnering with @MiniMax_AI to give Ollama users free usage of MiniMax M2.5 for the next couple of days! ollama run minimax-m2.5:cloud Use MiniMax M2.5 with OpenCode, Claude Code, Codex, OpenClaw via ollama launch! OpenCode: ollama launch opencode --model minimax-m2.5:cloud Claude: ollama launch claude --model minimax-m2.5:cloud

English
2
0
8
978
Patrick Devine đã retweet
CloudAI-X
CloudAI-X@cloudxdev·
New @ollama update is god's work btw Give it a try Just run `ollama` and thats it
CloudAI-X tweet media
English
1
5
42
8.5K
Patrick Devine
Patrick Devine@pdev110·
@ivanfioravanti In my testing on glm 4.7 flash, 4bit affine quants with a high group setting often makes it go of the rails (and I still see this even with mixed precision between the various tensors). 8 bit is definitely a lot more resilient.
English
0
0
1
62
Ivan Fioravanti ᯅ
Ivan Fioravanti ᯅ@ivanfioravanti·
Using 4bit coding models locally give you more generation speed, but less precision. This means model and harness will keep iterating on wrongly created file to fix them. TLDR: better using 6bit or above for local coding.
English
18
2
106
8.7K
Patrick Devine
Patrick Devine@pdev110·
@MrBerneker @ollama it's not in the UI yet, but does work in the CLI. Here's z-image-turbo running on the MLX backend in Ollama in an iterm2 terminal running on macOS:
Patrick Devine tweet media
English
1
0
1
97
Brian Berneker
Brian Berneker@MrBerneker·
Hey @ollama, I have a feature request for you! I love that you added z-image-turbo and flux2-klein for use with Ollama, but it would be even better if you made your chat interface able to send it prompts and display generated images too! What do you think? Doable?
English
3
0
11
3.7K
Wassollichhier
Wassollichhier@wassollichhier·
@pdev110 @ollama Interessting, when I one shot a Pong game with GLM 4.7 Flash, it made a „Neon“ Pong as well. The model really like neon style for old school games ^^ Btw. I havent asked Flash to make it Neon
English
1
0
0
138
Patrick Devine
Patrick Devine@pdev110·
1-shot prompt Asteroids game w/ GLM-4.7-Flash on the experimental MLX backend for @ollama. This was w/ 8-bit affine quantization and a MBP.
Patrick Devine tweet mediaPatrick Devine tweet media
English
4
7
28
10.2K