galo

3.9K posts

galo

@galogimenez

[email protected] | @HP | all things distributed | 🎿

Boise/Barcelona Katılım Mart 2009

722 Takip Edilen245 Takipçiler

galo retweetledi

Georgi Gerganov@ggerganov·30 Mar

I think the consensus is that Qwen3.5 is a step change so atm I would recommend explore that, given that it covers a range of sizes suitable for all devices. Note that the main issues that people currently unknowingly face with local models mostly revolve around the harness and some intricacies around model chat templates and prompt construction. Sometimes there are even pure inference bugs. From typing the task in the client to the actual result, there is a long chain of components that atm are not only fragile - are also developed by different parties. So it's difficult to consolidate the entire stack and you have to keep in mind that what you are currently observing is with very high probability still broken in some subtle way along that chain. But things are improving on all levels and everything will become better across the board soon. Best way to evaluate things IMO: - Start with full quality models that you fit on your hardware - Make sure you know what your harness actually does. F.ex. don't expect to hook Claude Code or Codex to some local model and the magic to happen. The developers of CC don't care (yet) if it is compatible with Qwen3.5. Best is to write your own harness so you know what happens every step of the way. Or use llama-server's webui (we now have MCP support out of the box) - When things start to click, look for optimizations to make it faster. Here is where you can start quantizing for speed or looks for some advice in the community for optimal parameters So I can just say that on the low-level inference side, we will ship the right solution for sure. We still need to make the user-facing stack work better with local models - I'm hoping this will happen, though I feel less capable to control that. And to answer your question more straightforward, I've experimented with the following models and have found useful applications (mostly around chat, MCP and coding) with all of them: - gpt-oss-120b - Qwen3-Coder-30B - GLM-4.7-Flash - MiniMax-M2.5 - Qwen3.5-35B-A3B With the exception of gpt-oss-120b and MiniMax-M2.5, I've used Q8_0 variants to keep most of the original quality. Unfortunately, I am not familiar with tool calling benchmarks specifically, so I cannot recommend. From my PoV, as long as we make sure the fundamental inference computation is correct, tool calling efficiency will depend just on: - Model intelligence (something we do not control) - Chat template parsing (something we are still actively improving on our end in llama.cpp)

English

370

98.8K

galo retweetledi

Georgi Gerganov@ggerganov·30 Mar

llama.cpp at 100k stars now that 90% of the code worldwide is being written by AI agents, I predict that within 3-6 months, 90% of all AI agents will be running locally with llama.cpp 😄 Jokes aside, I am going to use this small milestone as an opportunity to reflect a bit on the project and the state of AI from the perspective of local applications. There is a lot to say and discuss and yet it feels less and less important to try to make a point. Opinions about viability of local LLMs are strongly polarized, details are overlooked, the scientific approach is lacking. Arguments are predominantly based on vibes and hype waves. One thing is clear though - local LLMs are used more and more. I expect this trend to continue and likely 2026 will end up being one of the most important years for the local AI movement. I admit that I didn't expect the agentic era to come so quickly to the local LLM space. One year ago, the available models were too computationally expensive for doing long-context tasks. There wasn't an obvious path towards meaningful agentic applications. The memory and compute requirements were huge. Last summer, with the release of gpt-oss, things started to change. It was the first time we saw a glimpse of tool calling that actually works well within the resource constraints of our daily devices. Later in the year, even better models were released and by now, useful local agentic workflows are a reality. Comparing local vs hosted capabilities at a given moment of time is pointless. To try put things into perspective: - We don't need frontier intelligence to automate searches and sending emails - We don't need trillion parameter models to be able to summarize articles or technical documents - We don't need massive GPU data centers to control our home appliances or turn the lights off in the garage I believe that there is a certain level of intelligence we as humans can comprehend and meaningfully utilize to improve our working process. Beyond that level, access to more intelligence becomes unnecessary at best and counterproductive at worst. I also believe that that level of useful artificial intelligence is completely within reach locally and it has always been just a matter of implementing the right software stack to bring it to the end user. With llama.cpp, I am confident that we continue to be on the right track of building that software stack! The llama.cpp project is going stronger than ever. With more than 1500 contributors, the project keeps growing steadily. From technical point of view, I think that llama.cpp + ggml is the only solution that actually makes sense. That is, the software stack must run efficiently on every possible device, hardware and operating system. The technology is too important to be vendor-locked. It has to be developed in the open, by the community, together with the independent hardware vendors. This is the only right way to build something that will truly make a difference in the long run. I won't try to convince you about what is currently and will be possible with local AI. We will just continue to build as usual. I am confident that after the smoke clears and we look objectively at what we have built together, the benefits will be obvious to everyone. Big shoutout to all llama.cpp maintainers. I feel extremely lucky to be able to work together with so many talented contributors. Every day I learn something new and I feel there is so much more cool stuff that we are going to build. Also, I am really thankful that the project continues to have reliable partners to support it! Cheers!

English

148

285

2.1K

188.3K

galo retweetledi

Nick Frosst@nickfrosst·28 Mar

@cohere transcribe Sota open source transcription model running in the browser :) Weights on @huggingface link below

English

130

1.4K

185.8K

galo retweetledi

AT@AliesTaha·26 Mar

x.com/i/article/2037…

ZXX

620

71.8K

galo retweetledi

Claude@claudeai·24 Mar

You can now enable Claude to use your computer to complete tasks. It opens your apps, navigates your browser, fills in spreadsheets—anything you'd do sitting at your desk. Research preview in Claude Cowork and Claude Code, macOS only.

English

14.4K

139K

77.5M

galo retweetledi

Unsloth AI@UnslothAI·17 Mar

Introducing Unsloth Studio ✨ A new open-source web UI to train and run LLMs. • Run models locally on Mac, Windows, Linux • Train 500+ models 2x faster with 70% less VRAM • Supports GGUF, vision, audio, embedding models • Auto-create datasets from PDF, CSV, DOCX • Self-healing tool calling and code execution • Compare models side by side + export to GGUF GitHub: github.com/unslothai/unsl… Blog and Guide: unsloth.ai/docs/new/studio Available now on Hugging Face, NVIDIA, Docker and Colab.

English

219

832

5.2K

1.6M

galo retweetledi

Stripe@stripe·18 Mar

x.com/i/article/2034…

ZXX

262

599

4.2K

1.3M

galo retweetledi

Just a Dude Who Invests@DudeWhoInvests·1 Mar

This lecture by Jensen Huang Nvidia $NVDA CEO was an hour long, and it teaches you more about business than any college could…

English

445

44.7K

galo retweetledi

Connor Waslo@ConnorWaslo·9 Şub

we cut half of our product and it's so much better now. huge glow up and the makeover is just getting started.

English

1.2K

171.3K

galo retweetledi

The Register@TheRegister·9 Şub

Anthropic's Claude Opus 4.6 spends $20K trying to write a C compiler dlvr.it/TQs4db

English

2.2K

galo retweetledi

antirez@antirez·6 Şub

Yesterday @MistralAI released an open weights transcription model able to work in real time, Voxtral Mini 4B. Today, following the Whisper.cpp lesson, here is a C inference pipeline ready to use as a library, I hope you'll enjoy it: github.com/antirez/voxtra…

English

974

54.5K

galo retweetledi

Lukas Ziegler@lukas_m_ziegler·6 Şub

End-to-end neural networks racing drones in Abu Dhabi! 🚁 Check out the drone racing team from Delft University of Technology! A completely end-to-end neural network solution, from pixels to direct motor commands. No Kalman filters. No computer vision feature detectors. Just neurons flying the drone. The challenge is extreme. These drones fly at high speeds and need split-second decisions with minimal onboard resources: a single rolling-shutter camera and an IMU. Their approach is called SkyDreamer, based on the Dreamer-v3 reinforcement learning algorithm. First, a world model is trained in simulation. Then, the neural network learns how to fly in its dreams through reinforcement learning. The network's internal state can be read out to see where it thinks it is on the track or how fast it's going. Even better, the drone estimates some of its own body characteristics during flight, like the camera angle relative to the body, eliminating time-consuming manual calibration. The system uses only a single camera and the gyros from the IMU, ignoring the accelerometers, just like human FPV pilots do. ~~ ♻️ Join the weekly robotics newsletter, and never miss any news → ziegler.substack.com

English

420

3.1K

206.5K

galo retweetledi

Zhijian Liu@zhijianliu_·6 Oca

Holiday cooking finally ready to serve! 🥳 Introducing DFlash — speculative decoding with block diffusion. 🚀 6.2× lossless speedup on Qwen3-8B ⚡ 2.5× faster than EAGLE-3 Diffusion vs AR doesn’t have to be a fight. At today’s stage: • dLLMs = fast, highly parallel, but lossy • AR LLMs = accurate, sequential, but slow DFlash = diffusion drafts, AR verifies.

English

230

1.8K

211.2K

galo retweetledi

Mistral AI@MistralAI·4 Şub

Introducing Voxtral Transcribe 2, next-gen speech-to-text models by @MistralAI. State-of-the-art transcription, speaker diarization, sub-200ms real-time latency. Details in 🧵

English

119

439

3.9K

653.1K

galo retweetledi

The Register@TheRegister·3 Şub

HP CEO prints final page after six years, moves to PayPal dlvr.it/TQkXn7

English

1.8K

galo retweetledi

Ashpreet Bedi@ashpreetbedi·1 Şub

x.com/i/article/2017…

ZXX

174

491.4K

galo retweetledi

martin_casado@martin_casado·30 Oca

This is fantastic. @gokulr is just the greatest.

Patrick OShaughnessy@patrick_oshag

Gokul explains why outcome-based software companies like Zendesk are more exposed to AI than systems of record like NetSuite, and why public markets are not distinguishing between the two. He argues that the only way AI-native startups can disrupt systems of record is by spending 1-2 years building migration tools to get data off of incumbent platforms. "The software companies that should be the most worried right now is where they are pricing the product based on utility. Zendesk is a good example. Instead of paying for 50 Zendesk seats, you can pay for 20 and I can have 30 AI agents sitting next to Zendesk. For these companies you need to change your pricing model to be based on outcome. It's going to be hard for them to stay public. The companies that are less exposed are ones based on data that has been collected and captured over a period of time. ERP is a great example. There is no compelling reason for someone to put their career at stake by ripping out NetSuite. NetSuite has more time to build AI agents on top of it because they have the data, they can train the AI agent on top of it and bundle it. I think the public markets do not distinguish between these two types of companies."

English

9.4K

galo retweetledi

The Register@TheRegister·26 Oca

Marketing 'genius' destroyed a printer by trying to fix a paper jam dlvr.it/TQYvBc

English

1.6K

galo retweetledi

Yawar Siddiqui@yawarnihal·19 Oca

Introducing ShapeR, a method for robust conditional 3D shape generation from casually captured sequences. ShapeR leverages a rectified flow transformer conditioned on per-object multimodal data to turn casual image sequences into full metric scene reconstructions. Project Page: facebookresearch.github.io/ShapeR Paper: arxiv.org/abs/2601.11514 Links to code and huggingface below ⬇️

English

143

61.8K

galo retweetledi

GPU MODE@GPU_MODE·17 Oca

Tomorrow Jan 17 at 10am PST we'll have @LoubnaBenAllal1 going over her book "The Smol Training Playbook: The Secrets to Building World-Class LLMs" It's a wonderful and comprehensive reference for those of us that care about open models youtube.com/watch?v=y9zOZH…

YouTube

English

203

21.8K

Keşfet

@cohere @huggingface @MistralAI @gokulr @LoubnaBenAllal1 @elonmusk @BarackObama @taylorswift13