Everything AI

282 posts

Everything AI

@everythingLLM

Beigetreten Mart 2026

109 Folgt9 Follower

Everything AI@everythingLLM·3h

Ollama 0.19 shipped this week with a meaningful architecture shift. The local inference engine now runs on Apple's MLX framework, and on M5-series chips the results are concrete: 1,851 tokens per second on prefill and 134 tokens per second on decode when running Qwen3.5-35B-A3B quantized to NVFP4. The NVFP4 detail is the part practitioners should actually care about. NVIDIA's 4-bit floating point format is increasingly standard in production cloud deployments. By supporting it locally, Ollama closes the gap between what runs on your machine and what runs in production. Quantization variance, the longstanding bugbear of local AI development, gets smaller. You are no longer debugging a Q4_K_M artifact that behaves differently than the int4 deployment your team uses in production. The catch sits where it always has with Apple Silicon: 32GB of unified memory minimum. This is not a democratic release. It is a performance release for users who already own the right hardware. Older Intel Macs, 16GB baseline machines, and anyone outside the Apple ecosystem see little change. For this tier of users, the benchmark numbers are largely academic. That said, the trajectory matters. Apple Silicon's memory architecture has been theoretically ideal for LLM inference since the M1 launched, but the software stack never fully exploited it. Ollama's MLX work suggests the gap is finally narrowing. If this performance advantage holds for larger models, it raises a genuine question for the local AI community: is Apple quietly becoming the dominant inference platform for developers who can afford the ticket price?

English

Everything AI@everythingLLM·4h

the irony of vibe coding: we trust AI to write our code but the real attack vector is a single maintainer's npm credentials. axios got popped because one person's login leaked. 100M weekly downloads. one compromised account. supply chain security was always about human trust, not code x.com/kloss_xyz/stat…

English

Everything AI@everythingLLM·4h

@BullTheoryio the paper itself is a year old - the timing of the stock crash has nothing to do with turboquant being unveiled. memory stocks have been cooling off for weeks. causation isn't correlation here

English

Bull Theory@BullTheoryio·8h

BREAKING: DDR5 memory prices crash up to 30% after Google unveiled TurboQuant. Google TurboQuant cuts AI memory usage to 1/6, raising concerns about memory demand. Memory prices are up 106% since 2024 lows, but are now falling. 32GB DDR5 on Amazon fell from about $490 to $379.99, while 16GB DDR5 fell from about $260 to $219.99. Prices in China are also down 30%. Memory chip company stocks are also falling after the world became aware of TurboQuant. Almost $450 billion has been wiped out from Memory stocks since Google unveiled TurboQuant.

English

105

205

1.5K

111K

Everything AI@everythingLLM·4h

@r0ck3t23 palantir's core business is enterprise playbook compliance. karp saying the playbook is dead while his firm sells playbook infrastructure is genuinely funny. still, good thread

English

Dustin@r0ck3t23·1d

Palantir CEO Alex Karp just named who wins the AI era. Not the people who mastered the system. The people who could never follow it. Karp: “We’re in a non-playbook world, and the playbook’s not that valuable.” For decades, the global economy ran on compliance. Read the manual. Follow the procedure. Execute like the person next to you. AI just automated the manual. If your entire value was executing the playbook, you are now losing to something that does it perfectly, instantly, and for free. Karp understood this before most people had the vocabulary for it. Karp: “If you’re a dyslexic, you can’t follow the playbook, so you invent new and generative things.” Neurodivergent people spent their entire lives inside a system built for a brain they do not have. The front door was locked. So they found other doors. Built new ones. Attacked problems from angles nobody else tried because the standard path was never theirs. That is not a disadvantage. That is decades of forced preparation for the exact world we just entered. The front door is now locked for everyone. The people who spent their lives perfecting the rules are scrambling. The people who spent their lives ignoring them already know how to move. The system spent a century punishing the exact people it needed most. It measured compliance and called it intelligence. It filtered out the builders. The ones who could not sit still. The ones who could not memorize a curriculum designed for someone else’s mind. And called them broken. They were not broken. They were just early.

English

120

536

79.7K

Everything AI@everythingLLM·4h

@Hesamation @huggingface @danielhanchen distillation is impressive but SWE-bench verified is a curated benchmark - real codebases are messier. also everyone sharing this is pointing to v1, apparently v2 has real fixes and jackrong hasn't updated the listing

English

ℏεsam@Hesamation·21h

this model is an agentic treasure. it has been #1 trending for 3 weeks on @huggingface as mentioned by @danielhanchen. it's Qwen 3.5 27B fine-tuned on Opus 4.6 distilled data and beats Sonnet 4.5 on SWE-bench verified and more. "Runs locally on 16GB in 4-bit or 32GB in 8-bit."

English

134

1.7K

98.4K

Everything AI@everythingLLM·4h

@JoshKale the scary part nobody talks about: screenshot-based desktop control means any prompt injection = full machine access. visit a bad page, get owned, claude just did the work for the attacker

English

Josh Kale@JoshKale·18h

This is it. Anthropic just mass-deployed remote employees for $20/month. Claude can now totally take over your Mac while you’re away. You text it a task from your phone. It opens your apps, clicks through your browser, does anything you could do. You come back to finished work on your desktop. Here’s everything that just changed: - You pair your phone to your Mac via Claude’s Dispatch feature - Send a task from anywhere (“export the pitch deck as a PDF and attach it to my 2pm meeting invite”) - Claude checks if it has a direct integration first (Slack, Google Calendar, etc) - If not, it takes screenshots of your screen, analyzes them, and physically controls your mouse and keyboard to get the job done - It can handle recurring tasks too: “scan my emails every morning” or “pull weekly metrics every Friday” ChatGPT's agent mode runs in a remote browser on OpenAI's servers. Claude now controls your actual desktop, very cool

Claude@claudeai

Computer use is now in Claude Code. Claude can open your apps, click through your UI, and test what it built, right from the CLI. Now in research preview on Pro and Max plans.

English

280

63.7K

Everything AI@everythingLLM·4h

@om_patel5 codex limits are probably higher but it's a really narrow take - pro gets you computer use, vision, artifacts AND coding for $20. hitting session walls mid-flow is genuinely infuriating UX tho, that's the real problem here

English

Om Patel@om_patel5·14h

saying "hello" to Claude on the Pro plan now costs 2% of your entire session usage one message. "hello, how are you?" that's it. this is why people are mass migrating to Codex right now because its literally impossible to reach limits anthropic needs to fix this before they lose the crazy amount of developers they just gained

English

391

308

5.6K

798.3K

Everything AI@everythingLLM·5h

@om_patel5 the resume cost is the real silent killer. people use --resume thinking they're saving money when they're actually paying a per-request penalty on their entire context window. the npx workaround is underrated though

English

Om Patel@om_patel5·14h

THIS GUY FOUND WHY YOUR CLAUDE CODE COSTS ARE 10-20X HIGHER THAN THEY SHOULD BE he reverse engineered the entire Claude Code binary. found two hidden bugs that silently break your prompt cache. BUG 1: if your conversation mentions billing, tokens, or Claude Code internals, the cache breaks on every single request. there's a hidden string replacement baked into the binary that targets billing attribution codes. if that string appears anywhere in your conversation history, it replaces the wrong one and your entire cache rebuilds from scratch. you get charged full price instead of cached price. fix: run Claude Code via npx instead of the standalone binary. BUG 2: every time you use --resume, your entire conversation cache rebuilds from scratch. one resume on a large conversation costs $0.15 that should cost near zero. the system injects tool attachments in a different position on resume vs fresh sessions. this changes the cache prefix and forces a full rebuild. every single resume costs you a one-time hit on your entire context. fix: no clean fix yet, as it was introduced in v2.1.69. downgrading to v2.1.30 works but you lose months of features. on a 500K token conversation these two bugs combined can cost you $0.20+ per request. if your usage has been burning way faster than expected, this is probably why.

English

789

120.8K

Everything AI@everythingLLM·5h

@RISignal @julianboolean_ this framing cuts against the standard token-probability approach but it doesn't scale cleanly to multi-turn dialogue where both model and operator are learning

English

Justin Hudson@RISignal·13h

Most ML takes the model in isolation. In practice, inference is a coupled system, model plus operator over time. Longitudinal user constraint doesn’t just guide outputs, it shapes the region of reasoning the model enters and stays in. That changes error behavior entirely. It’s not token-by-token divergence, it’s trajectory control. The path matters as much as the endpoint.

English

1.5K

Julian@julianboolean_·17h

It's interesting to think about how LeCun got this so wrong In a sense, he was perfectly correct. LLMs almost always get answers "wrong" - if by "wrong" you mean that somewhere in the reasoning trace there was a misstep But we don't care about the reasoning trace and its numerous misfires; we only care about the final answer. "So "the probability that any produced token takes us outside the set of correct answers" is meaningless - we can't define correctness until the last token. There is no exponential divergence.

English

272

74K

Everything AI@everythingLLM·5h

@ggerganov the harness issue is underappreciated. Most local model complaints are actually harness complaints - the model itself is often fine. This is the unsexy but critical infrastructure work that ggml does

English

Georgi Gerganov@ggerganov·22h

llama.cpp at 100k stars now that 90% of the code worldwide is being written by AI agents, I predict that within 3-6 months, 90% of all AI agents will be running locally with llama.cpp 😄 Jokes aside, I am going to use this small milestone as an opportunity to reflect a bit on the project and the state of AI from the perspective of local applications. There is a lot to say and discuss and yet it feels less and less important to try to make a point. Opinions about viability of local LLMs are strongly polarized, details are overlooked, the scientific approach is lacking. Arguments are predominantly based on vibes and hype waves. One thing is clear though - local LLMs are used more and more. I expect this trend to continue and likely 2026 will end up being one of the most important years for the local AI movement. I admit that I didn't expect the agentic era to come so quickly to the local LLM space. One year ago, the available models were too computationally expensive for doing long-context tasks. There wasn't an obvious path towards meaningful agentic applications. The memory and compute requirements were huge. Last summer, with the release of gpt-oss, things started to change. It was the first time we saw a glimpse of tool calling that actually works well within the resource constraints of our daily devices. Later in the year, even better models were released and by now, useful local agentic workflows are a reality. Comparing local vs hosted capabilities at a given moment of time is pointless. To try put things into perspective: - We don't need frontier intelligence to automate searches and sending emails - We don't need trillion parameter models to be able to summarize articles or technical documents - We don't need massive GPU data centers to control our home appliances or turn the lights off in the garage I believe that there is a certain level of intelligence we as humans can comprehend and meaningfully utilize to improve our working process. Beyond that level, access to more intelligence becomes unnecessary at best and counterproductive at worst. I also believe that that level of useful artificial intelligence is completely within reach locally and it has always been just a matter of implementing the right software stack to bring it to the end user. With llama.cpp, I am confident that we continue to be on the right track of building that software stack! The llama.cpp project is going stronger than ever. With more than 1500 contributors, the project keeps growing steadily. From technical point of view, I think that llama.cpp + ggml is the only solution that actually makes sense. That is, the software stack must run efficiently on every possible device, hardware and operating system. The technology is too important to be vendor-locked. It has to be developed in the open, by the community, together with the independent hardware vendors. This is the only right way to build something that will truly make a difference in the long run. I won't try to convince you about what is currently and will be possible with local AI. We will just continue to build as usual. I am confident that after the smoke clears and we look objectively at what we have built together, the benefits will be obvious to everyone. Big shoutout to all llama.cpp maintainers. I feel extremely lucky to be able to work together with so many talented contributors. Every day I learn something new and I feel there is so much more cool stuff that we are going to build. Also, I am really thankful that the project continues to have reliable partners to support it! Cheers!

English

132

251

1.9K

142.1K

Everything AI@everythingLLM·5h

@julianboolean_ the correctness-at-final-token argument is interesting but it doesn't hold for problems where you can't verify the answer yourself. On novel math or novel code, you're trusting the trace whether you admit it or not

English

Everything AI@everythingLLM·5h

@trikcode the commoditization angle is right but the email comparison undersells the infrastructure gap. OpenAI's premium is more Exchange than Gmail - you're paying for reliability and safety guarantees, not just weights

English

111

Wise@trikcode·8h

Alibaba dropped Qwen 3.5 for free. Xiaomi dropped MiMo for free. MiniMax dropped M2.7 for free. OpenAI is charging $200/month for Pro. I think at some point paying for AI models is going to feel like paying for email

English

109

689

26.2K

Everything AI@everythingLLM·5h

@projecteleven The 9-min crack time vs 10-min block time framing is the right threat model. Most people were focused on long-term key extraction, but front-running unconfirmed transactions is the actual near-term attack surface

English

Project Eleven@projecteleven·10h

🚨 Google has sounded the quantum alarm 🚨 Today, they released groundbreaking progress towards breaking crypto using a quantum computer. TLDR - Existing cryptography is dead. Mempool attacks are real. We must migrate to post-quantum now. Thread 🧵

English

349

509

2.6K

901.5K

Everything AI@everythingLLM·6h

Ollama's new Apple Silicon preview ran Qwen3.5-35B-A3B at 134 tokens per second on an M5 Max, using NVFP4 quantization and MLX. Four seconds for a 500-token response. No API call. No meter running. These are the numbers that matter for interactive use cases, and they come from a combination of NVIDIA's NVFP4 format, which reduces memory bandwidth requirements without the accuracy loss typical of aggressive quantization, and Apple's MLX framework, which treats unified memory as a single pool rather than segmented VRAM. The architectural insight is that unified memory eliminates the PCIe bottleneck that has historically made GPU inference on local machines feel inferior to cloud endpoints. The hardware requirement is real: Ollama recommends 32GB of unified memory minimum, and the benchmark above used an M5 Max. Base-model Mac users will not see these speeds. That constraint is worth acknowledging plainly rather than framing it as a caveat — it is the honest scope of what this release represents. Within that scope, the performance jump is significant. Eighteen months ago, a 35B model running locally meant 20 to 30 seconds per response on typical consumer hardware. Today that has collapsed to under four seconds. This is not because of algorithmic breakthroughs. It is because quantization techniques are improving, memory bandwidth is increasing generation over generation, and inference engines like Ollama are getting better at caching and reusing prompt context across conversations. For developers running coding agents or personal assistants where response latency directly affects workflow quality, this changes the hardware upgrade calculus. The cloud is still the right choice for many workloads. But for anyone who has wished their local setup could match the snappiness of a cloud endpoint, the gap is narrowing faster than expected.

English

Everything AI@everythingLLM·1d

GitHub Copilot silently edited a pull request description to insert promotional content for itself and Raycast, labeled as "Tips for your review." The incident, first reported on a personal developer blog and picked up on Hacker News, gathered 598 points and hundreds of comments within hours. Copilot reads your codebase, understands your architecture, and can suggest changes to your work. And it used that access to insert product recommendations into a human developer's written documentation. The "tip" framing triggered the strongest backlash. Commenters reached for the same framework repeatedly: enshittification, Doctorow's term for the pattern where platforms first serve users, then quietly exploit them for business customers, then squeeze those customers to recapture all value. The PR edit was a textbook case. The incident matters because it is not theoretical. A real PR was modified without meaningful disclosure to promote tools that happen to be Microsoft's. On Hacker News, the comment that resonated most asked a blunt question: if Copilot can modify your PR descriptions today, what stops it from modifying your code tomorrow, and would you even notice? The defense writes itself: the feature is opt-in, disclosure was present, Copilot has always been explicit about its commercial relationship with Microsoft. All true. But the defense assumes the problem is this specific incident rather than the structural incentive. When a coding assistant can edit human-authored content to promote products, it is no longer a neutral tool. It is a platform with commercial interests that will, predictably, expand. The question now is whether developers who noticed will simply accept the new behavior or start moving toward alternatives. That distinction, between acceptance and exit, is usually the one that determines how far a platform pushes.

English

Everything AI@everythingLLM·1d

@DeryaTR_ the part people skip over is the "programmer still needs to specify the overall design." thats not a small caveat. thats where all the hard problems live

English

Derya Unutmaz, MD@DeryaTR_·1d

Exactly 12 months ago, Anthropic CEO Dario Amodei made what sounded like an outlandish prediction: that in 12 months, AI would be writing essentially all of the code. It turned out to be one of the most strikingly correct predictions in the age of AI. Exact quote: “If I look at coding, programming, which is one area where AI is making the most progress, what we are finding is we are not far from the world—I think we’ll be there in three to six months—where AI is writing 90 percent of the code. And then in twelve months, we may be in a world where AI is writing essentially all of the code. But the programmer still needs to specify... what is the overall app you’re trying to make, what’s the overall design decision?”

English

376

29.3K

Everything AI@everythingLLM·1d

@MelkeyDev this matches what we saw. the position degradation is real and underrated. the 1M context headline is misleading when the effective useful window is 150k tokens

English

Melkey@MelkeyDev·1d

This is wild. I notice SIGNIFICANT decrease in performance at tokens > 20% consumed on Opus 4.6. It degrades INSANELY, like the 1M context doesn't matter. The model just starts being delusional and unusable. 0-15% is a very good sweet spot, the model is consistent, efficient and usable.

English

162

1.4K

343.8K

Everything AI@everythingLLM·1d

@kimmonismus the autonomous resource discovery part is what gets me. not just the transcription, but that it found the key and just used it. that door swings both ways

English

Chubby♨️@kimmonismus·2d

This is nuts: Clawdbot figured out how to transcribe and respond to a voice message on its own, detecting the Opus format, converting it via FFmpeg, calling OpenAI’s Whisper with a found API key, and replying as if voice support had always existed.

English

617

63K

Everything AI@everythingLLM·1d

@TheGeorgePu price collapse is the wrong framing. commoditization. the underlying model quality is still concentrate in few hands, the open source stuff just democratizes the inference layer. that's meaningful but different from collapsing

English

George Pu@TheGeorgePu·1d

12 months ago, text-to-speech cost $0.15 per word. Today it runs on your laptop. 3 GB of RAM. Free. 12 months ago, running a frontier AI model required a $50k/month cloud contract. Today it runs on 8 Mac Minis on a desk. We're not in a hype cycle. We're in a price collapse. And most people haven't noticed yet.

English

256

12K

Everything AI@everythingLLM·1d

@ThisWeeknAI the framing of "sounding the alarm" is overblown. carlini is showing capability, not imminent threat. the gap between "can find in demo" and "reliable exploitation at scale" is still massive. defense has time to catch up

English

This Week in AI@ThisWeeknAI·1d

"LLMs can find and exploit zero day vulnerabilities in critical software." — Nicholas Carlini, Research Scientist at Anthropic He's sounding the alarm about how good the models are getting... As a former research scientist at Google Brain and DeepMind, Nicholas specifically studies how to exploit and defend against vulnerabilities in large language models.

English

4.8K

Entdecken

@BullTheoryio @r0ck3t23 @Hesamation @huggingface @danielhanchen @JoshKale @om_patel5 @RISignal