Sam Selvanathan
2.7K posts

Sam Selvanathan
@samselvanathan_
I build AI agents, ship browser LLMs, and let AI drive my kid's toy car. engineer turned product manager, still building software, ex-PayPal, startups. SF.







I've been talking to AI models a lot, and I don't think they reason at a PhD level at all. They seem to be good at math style problems, where you tell them A, B and C are true, and then ask them to figure out D. They're extremely bad at anything involving what I would call mature scholarship. Basically where A, B, and C are partially confirmed to various extents in the literature, and there are multiple conflicting, competing perspectives on what might be true. When it comes to this, they reason like naive undergrads. They try to force everything into one box called "the truth". If a framework is a standard part of their training data, like Bayesianism, they do seem to be able to write about things from that perspective. But if they need to construct perspectives on the fly, and keep track of competing frameworks, based on a novel research direction, they easily get lost about who is saying what and why. This is basic scholarship. The ability to apprehend the state of the literature on a given topic. It is literally the minimum of what you need to do to be a PhD level scholar. And AI models are terrible at it.


Moonshot’s Kimi K2.6 is the new leading open weights model. Kimi K2.6 lands at #4 on the Artificial Analysis Intelligence Index (54) behind only Anthropic, Google, and OpenAI (all 57) Key takeaways: ➤ Increase in performance on agentic tasks: @Kimi_Moonshot's Kimi K2.6 achieves an Elo of 1520 on our GDPval-AA evaluation, which is a marked improvement over Kimi K2.5’s Elo of 1309. GDPval-AA is our leading metric for general agentic performance, measuring the performance on knowledge work tasks such as preparing presentations and analysis. Models are given code execution and web browsing tools in an agentic loop via our open source reference agentic harness called Stirrup. This continues Kimi K2.6’s strength in tool use, maintaining a 96% score on τ²-Bench Telecom, placing it among other frontier models in this category. ➤ Low hallucination rate: Kimi K2.5 scores 6 on the AA-Omniscience Index, our knowledge evaluation measuring both accuracy and hallucination rate. This score is primarily driven by a comparatively low hallucination rate of 39% (reduced from Kimi K2.5’s 65%), indicating a greater capability to abstain rather than fabricate knowledge when the model is uncertain. Kimi K2.6’s low hallucination rate places it similarly to other models such as Claude Opus 4.7 (36%) and MiniMax-M2.7 (34%) ➤ High token usage: Kimi K2.6 demonstrates high token usage, but is in line with other frontier models in the same intelligence tier. To run the full Artificial Analysis Intelligence Index, Kimi K2.6 used ~160M reasoning tokens. This is slightly lower than Claude Sonnet 4.6 (~190M reasoning tokens) but much higher than GPT 5.4 (~110M reasoning tokens). ➤ Open weights: Kimi K2.6 is a Mixture-of-Experts (MoE) model with 1T total parameters and 32B active, same as the previous two generations of models Kimi K2 Thinking and Kimi K2.5. Kimi K2.6 again pushes the open weights frontier in intelligence. ➤ Third Party Access: Kimi K2.6 is accessible through Moonshot’s First Party API as well as third party API providers Novita, Baseten, Fireworks, and Parasail ➤ Multimodality: Kimi K2.6 supports Image and Video input and text output natively. The model’s max context length remains 256k. Further analysis in the threads below.

Software used to be gated by roughly 20 million professional developers up until last year. Good ideas still needed engineers, co-founders, time, and months of app work. Now, anyone can build. ~ Wabi CEO Eugenia Kuyda




We've redesigned Claude Code on desktop. You can now run multiple Claude sessions side by side from one window, with a new sidebar to manage them all.









Run OpenClaw with Gemma 4 and Atomic Chat MacBook Air M4 · 16 GB RAM · 25 tok/s No cloud! No subscription fees! Open-source local model. Runs on your regular device


