galo

3.9K posts

galo banner
galo

galo

@galogimenez

[email protected] | @HP | all things distributed | 🎿

Boise/Barcelona Katılım Mart 2009
722 Takip Edilen245 Takipçiler
galo retweetledi
Georgi Gerganov
Georgi Gerganov@ggerganov·
I think the consensus is that Qwen3.5 is a step change so atm I would recommend explore that, given that it covers a range of sizes suitable for all devices. Note that the main issues that people currently unknowingly face with local models mostly revolve around the harness and some intricacies around model chat templates and prompt construction. Sometimes there are even pure inference bugs. From typing the task in the client to the actual result, there is a long chain of components that atm are not only fragile - are also developed by different parties. So it's difficult to consolidate the entire stack and you have to keep in mind that what you are currently observing is with very high probability still broken in some subtle way along that chain. But things are improving on all levels and everything will become better across the board soon. Best way to evaluate things IMO: - Start with full quality models that you fit on your hardware - Make sure you know what your harness actually does. F.ex. don't expect to hook Claude Code or Codex to some local model and the magic to happen. The developers of CC don't care (yet) if it is compatible with Qwen3.5. Best is to write your own harness so you know what happens every step of the way. Or use llama-server's webui (we now have MCP support out of the box) - When things start to click, look for optimizations to make it faster. Here is where you can start quantizing for speed or looks for some advice in the community for optimal parameters So I can just say that on the low-level inference side, we will ship the right solution for sure. We still need to make the user-facing stack work better with local models - I'm hoping this will happen, though I feel less capable to control that. And to answer your question more straightforward, I've experimented with the following models and have found useful applications (mostly around chat, MCP and coding) with all of them: - gpt-oss-120b - Qwen3-Coder-30B - GLM-4.7-Flash - MiniMax-M2.5 - Qwen3.5-35B-A3B With the exception of gpt-oss-120b and MiniMax-M2.5, I've used Q8_0 variants to keep most of the original quality. Unfortunately, I am not familiar with tool calling benchmarks specifically, so I cannot recommend. From my PoV, as long as we make sure the fundamental inference computation is correct, tool calling efficiency will depend just on: - Model intelligence (something we do not control) - Chat template parsing (something we are still actively improving on our end in llama.cpp)
English
11
33
370
98.8K
galo retweetledi
Georgi Gerganov
Georgi Gerganov@ggerganov·
llama.cpp at 100k stars now that 90% of the code worldwide is being written by AI agents, I predict that within 3-6 months, 90% of all AI agents will be running locally with llama.cpp 😄 Jokes aside, I am going to use this small milestone as an opportunity to reflect a bit on the project and the state of AI from the perspective of local applications. There is a lot to say and discuss and yet it feels less and less important to try to make a point. Opinions about viability of local LLMs are strongly polarized, details are overlooked, the scientific approach is lacking. Arguments are predominantly based on vibes and hype waves. One thing is clear though - local LLMs are used more and more. I expect this trend to continue and likely 2026 will end up being one of the most important years for the local AI movement. I admit that I didn't expect the agentic era to come so quickly to the local LLM space. One year ago, the available models were too computationally expensive for doing long-context tasks. There wasn't an obvious path towards meaningful agentic applications. The memory and compute requirements were huge. Last summer, with the release of gpt-oss, things started to change. It was the first time we saw a glimpse of tool calling that actually works well within the resource constraints of our daily devices. Later in the year, even better models were released and by now, useful local agentic workflows are a reality. Comparing local vs hosted capabilities at a given moment of time is pointless. To try put things into perspective: - We don't need frontier intelligence to automate searches and sending emails - We don't need trillion parameter models to be able to summarize articles or technical documents - We don't need massive GPU data centers to control our home appliances or turn the lights off in the garage I believe that there is a certain level of intelligence we as humans can comprehend and meaningfully utilize to improve our working process. Beyond that level, access to more intelligence becomes unnecessary at best and counterproductive at worst. I also believe that that level of useful artificial intelligence is completely within reach locally and it has always been just a matter of implementing the right software stack to bring it to the end user. With llama.cpp, I am confident that we continue to be on the right track of building that software stack! The llama.cpp project is going stronger than ever. With more than 1500 contributors, the project keeps growing steadily. From technical point of view, I think that llama.cpp + ggml is the only solution that actually makes sense. That is, the software stack must run efficiently on every possible device, hardware and operating system. The technology is too important to be vendor-locked. It has to be developed in the open, by the community, together with the independent hardware vendors. This is the only right way to build something that will truly make a difference in the long run. I won't try to convince you about what is currently and will be possible with local AI. We will just continue to build as usual. I am confident that after the smoke clears and we look objectively at what we have built together, the benefits will be obvious to everyone. Big shoutout to all llama.cpp maintainers. I feel extremely lucky to be able to work together with so many talented contributors. Every day I learn something new and I feel there is so much more cool stuff that we are going to build. Also, I am really thankful that the project continues to have reliable partners to support it! Cheers!
Georgi Gerganov tweet mediaGeorgi Gerganov tweet media
English
148
285
2.1K
188.3K
galo retweetledi
Nick Frosst
Nick Frosst@nickfrosst·
@cohere transcribe Sota open source transcription model running in the browser :) Weights on @huggingface link below
English
60
130
1.4K
185.8K
galo retweetledi
Claude
Claude@claudeai·
You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in spreadsheets—anything you'd do sitting at your desk. Research preview in Claude Cowork and Claude Code, macOS only.
English
5K
14.4K
139K
77.5M
galo retweetledi
Unsloth AI
Unsloth AI@UnslothAI·
Introducing Unsloth Studio ✨ A new open-source web UI to train and run LLMs. • Run models locally on Mac, Windows, Linux • Train 500+ models 2x faster with 70% less VRAM • Supports GGUF, vision, audio, embedding models • Auto-create datasets from PDF, CSV, DOCX • Self-healing tool calling and code execution • Compare models side by side + export to GGUF GitHub: github.com/unslothai/unsl… Blog and Guide: unsloth.ai/docs/new/studio Available now on Hugging Face, NVIDIA, Docker and Colab.
English
219
832
5.2K
1.6M
galo retweetledi
Just a Dude Who Invests
Just a Dude Who Invests@DudeWhoInvests·
This lecture by Jensen Huang Nvidia $NVDA CEO was an hour long, and it teaches you more about business than any college could…
English
6
96
445
44.7K
galo retweetledi
Connor Waslo
Connor Waslo@ConnorWaslo·
we cut half of our product and it's so much better now. huge glow up and the makeover is just getting started.
English
26
24
1.2K
171.3K
galo retweetledi
The Register
The Register@TheRegister·
Anthropic's Claude Opus 4.6 spends $20K trying to write a C compiler dlvr.it/TQs4db
English
0
2
6
2.2K
galo retweetledi
antirez
antirez@antirez·
Yesterday @MistralAI released an open weights transcription model able to work in real time, Voxtral Mini 4B. Today, following the Whisper.cpp lesson, here is a C inference pipeline ready to use as a library, I hope you'll enjoy it: github.com/antirez/voxtra…
English
28
95
974
54.5K
galo retweetledi
Lukas Ziegler
Lukas Ziegler@lukas_m_ziegler·
End-to-end neural networks racing drones in Abu Dhabi! 🚁 Check out the drone racing team from Delft University of Technology! A completely end-to-end neural network solution, from pixels to direct motor commands. No Kalman filters. No computer vision feature detectors. Just neurons flying the drone. The challenge is extreme. These drones fly at high speeds and need split-second decisions with minimal onboard resources: a single rolling-shutter camera and an IMU. Their approach is called SkyDreamer, based on the Dreamer-v3 reinforcement learning algorithm. First, a world model is trained in simulation. Then, the neural network learns how to fly in its dreams through reinforcement learning. The network's internal state can be read out to see where it thinks it is on the track or how fast it's going. Even better, the drone estimates some of its own body characteristics during flight, like the camera angle relative to the body, eliminating time-consuming manual calibration. The system uses only a single camera and the gyros from the IMU, ignoring the accelerometers, just like human FPV pilots do. ~~ ♻️ Join the weekly robotics newsletter, and never miss any news → ziegler.substack.com
English
74
420
3.1K
206.5K
galo retweetledi
Zhijian Liu
Zhijian Liu@zhijianliu_·
Holiday cooking finally ready to serve! 🥳 Introducing DFlash — speculative decoding with block diffusion. 🚀 6.2× lossless speedup on Qwen3-8B ⚡ 2.5× faster than EAGLE-3 Diffusion vs AR doesn’t have to be a fight. At today’s stage: • dLLMs = fast, highly parallel, but lossy • AR LLMs = accurate, sequential, but slow DFlash = diffusion drafts, AR verifies.
English
62
230
1.8K
211.2K
galo retweetledi
Mistral AI
Mistral AI@MistralAI·
Introducing Voxtral Transcribe 2, next-gen speech-to-text models by @MistralAI. State-of-the-art transcription, speaker diarization, sub-200ms real-time latency. Details in 🧵
English
119
439
3.9K
653.1K
galo retweetledi
The Register
The Register@TheRegister·
HP CEO prints final page after six years, moves to PayPal dlvr.it/TQkXn7
English
1
4
3
1.8K
galo retweetledi
galo retweetledi
The Register
The Register@TheRegister·
Marketing 'genius' destroyed a printer by trying to fix a paper jam dlvr.it/TQYvBc
English
2
2
1
1.6K
galo retweetledi
Yawar Siddiqui
Yawar Siddiqui@yawarnihal·
Introducing ShapeR, a method for robust conditional 3D shape generation from casually captured sequences. ShapeR leverages a rectified flow transformer conditioned on per-object multimodal data to turn casual image sequences into full metric scene reconstructions. Project Page: facebookresearch.github.io/ShapeR Paper: arxiv.org/abs/2601.11514 Links to code and huggingface below ⬇️
English
17
143
1K
61.8K
galo retweetledi
GPU MODE
GPU MODE@GPU_MODE·
Tomorrow Jan 17 at 10am PST we'll have @LoubnaBenAllal1 going over her book "The Smol Training Playbook: The Secrets to Building World-Class LLMs" It's a wonderful and comprehensive reference for those of us that care about open models youtube.com/watch?v=y9zOZH…
YouTube video
YouTube
English
7
29
203
21.8K