murat 🍥
15.6K posts


what if i told you... computer use can be faster on local models
moondream3 with its photon update today that gives it mac support can see your screen and use it with 1s latency, ty @vikhyatk
here we have whisper+qwen+moondream triple model pipeline working offline flawlessly
English

@nuwandavek @vikhyatk gemma could be better at certain tasks or prompts too i wouldn't be surprised
English

@mayfer @vikhyatk nice! i was playing around with a similarish idea (finding all untagged basketball/tennis courts in sf on google maps by browsing around)
gemma4 + efficientsam3 was hilariously good for the size. will try qwen!
github.com/SimonZeng7108/…
English

@Laythe_li_suwi @vikhyatk converting user command to actions, in this case it's just "click <item>" but it can do more things like use keyboard shortcuts, applescript, or type things directly
English

@nuwandavek @vikhyatk i've tried basically every small llm in existence and settled on qwen3.5 4B at q4_k_m
there may be different prompts that make gemma work too but with my prompts gemma was unusably worse. qwen has really great general ability
English

@mayfer @yacineMTB @vikhyatk I wonder whether it is using screenshot loop to predict the co-ordinates or using accessibility trees?
If you want your LLMs to control using accessibility trees fully headless, you can use this:
github.com/lahfir/agent-d…
English

@tombielecki @vikhyatk whisper-large-v3-turbo or whatever its called. it's hands down best especially with text prompt prefix to set the context. if you need faster due to hardware restrictions parakeet or apple speech recognition are ok
English

@mayfer Depends on how detailed/accurate you need the image analysis to be and how fast you need the loop. For live computer use like tasks local will prob win out, but for automatic detailed aesthetics / design / accurate text extraction, cloud will be ahead / faster for a while imo.
English

yeah this is super significant
it's absurdly bad capital allocation to run low latency AI image processing on cloud.
local computer use will win really hard
vik@vikhyatk
Running on Apple Silicon will never be as fast as an H100. But for interactive workloads like computer use, wall-clock latency is dominated by the network, not the accelerator. Skipping a large image uploads buys you more than the H100 buys back. x.com/mayfer/status/…
English

@joodalooped it helps but doesn't solve it until you make text unreadable
English

the harness is GoatRemote goatremote.com
it has an absurdly optimized qwen3.5 pipeline. the LLM call takes 300ms which determines what action to take based on the user's request
imo forget traditional agent harnesses, they can't achieve this kind of latency, they're not built for it
English

@mayfer @vikhyatk What harness does it use to achieve 1s latency? In my tests, Desktop Control CLI achieves ~400-500ms for local perception, but you also need to add LLM call layency. See demo:
x.com/yaroshevych/st…
Oleg 🇺🇦@yaroshevych
I learned to appreciate fast models: Mercury model by @_inception_ai, driven by @opencode via @OpenRouter By the numbers, #DesktopCtl took under 600ms for most UI operations (mostly driven by OCR cost), while model latency was under 2-3sec. @lmstudio was used for the demo purposes.
English












