Sam Selvanathan

2.7K posts

Sam Selvanathan

@samselvanathan_

I build AI agents, ship browser LLMs, and let AI drive my kid's toy car. engineer turned product manager, still building software, ex-PayPal, startups. SF.

San Francisco, CA Katılım Aralık 2009

370 Takip Edilen237 Takipçiler

Sam Selvanathan@samselvanathan_·18m

@tonypaluzzi @sama which cloud are you using?

English

Anthony Paluzzi@tonypaluzzi·17h

@samselvanathan_ @sama How do you run this when openclaw is hosted in the cloud?

English

Sam Altman@sama·4d

you can sign in to openclaw with your chatgpt account now and use your subscription there! happy lobstering.

English

1.1K

21K

2.2M

Sam Selvanathan@samselvanathan_·3d

@HumanAIDesign @sama i think its called the PKCE auth flow from openAI

English

Sam Selvanathan@samselvanathan_·3d

@HumanAIDesign @sama i am using openclaw with chatgpt subscription

English

194

Sam Selvanathan@samselvanathan_·3d

@abeni_t @sama yea it changed a bit, wrote something here to cover it x.com/samselvanathan…

Sam Selvanathan@samselvanathan_

x.com/i/article/2050…

English

153

Abenezer@abeni_t·3d

@samselvanathan_ @sama have you tried this update

English

450

Sam Selvanathan@samselvanathan_·3d

@bad1313 @sama yup, wrote something new to cover it x.com/samselvanathan…

Sam Selvanathan@samselvanathan_

x.com/i/article/2050…

English

132

Steve@bad1313·3d

@samselvanathan_ @sama Exactly this approach is already outdated

English

224

Sam Selvanathan@samselvanathan_·3d

x.com/i/article/2050…

ZXX

500

Sam Selvanathan@samselvanathan_·3d

x.com/i/article/2050…

ZXX

156

Sam Selvanathan@samselvanathan_·21 Nis

@GaryMarcus @kareem_carr A naive undergrad with good tooling, human-in-the-loop checkpoints, and clearly scoped tasks outperforms what most orgs had before. The ceiling matters less than what you build around it.

English

173

Gary Marcus@GaryMarcus·20 Nis

LLMs are “ good at math style problems, where you tell them A, B and C are true, and then ask them to figure out D [but] extremely bad at anything involving what I would call mature scholarship .. [something] like naive undergrads” -@kareem_carr

Dr Kareem Carr@kareem_carr

I've been talking to AI models a lot, and I don't think they reason at a PhD level at all. They seem to be good at math style problems, where you tell them A, B and C are true, and then ask them to figure out D. They're extremely bad at anything involving what I would call mature scholarship. Basically where A, B, and C are partially confirmed to various extents in the literature, and there are multiple conflicting, competing perspectives on what might be true. When it comes to this, they reason like naive undergrads. They try to force everything into one box called "the truth". If a framework is a standard part of their training data, like Bayesianism, they do seem to be able to write about things from that perspective. But if they need to construct perspectives on the fly, and keep track of competing frameworks, based on a novel research direction, they easily get lost about who is saying what and why. This is basic scholarship. The ability to apprehend the state of the literature on a given topic. It is literally the minimum of what you need to do to be a PhD level scholar. And AI models are terrible at it.

English

188

44.7K

Sam Selvanathan@samselvanathan_·21 Nis

@emollick Benchmarks measure capability ceilings. Running models on-device under memory pressure, the benchmark rank and actual production rank diverge completely. Kimi's 26B MoE is fast in a datacenter, but that tradeoff becomes a real liability once you hit constrained hardware.

English

649

Ethan Mollick@emollick·21 Nis

I find that open weights models over-perform on benchmarks compared to actual real-world usage, and Kimi feels like no exception. For example, a small amount of use will show that Kimi is not as good as Claude Opus 4.6, which it beats on the benchmarks. Still a good model, tho!

Artificial Analysis@ArtificialAnlys

Moonshot’s Kimi K2.6 is the new leading open weights model. Kimi K2.6 lands at #4 on the Artificial Analysis Intelligence Index (54) behind only Anthropic, Google, and OpenAI (all 57) Key takeaways: ➤ Increase in performance on agentic tasks: @Kimi_Moonshot's Kimi K2.6 achieves an Elo of 1520 on our GDPval-AA evaluation, which is a marked improvement over Kimi K2.5’s Elo of 1309. GDPval-AA is our leading metric for general agentic performance, measuring the performance on knowledge work tasks such as preparing presentations and analysis. Models are given code execution and web browsing tools in an agentic loop via our open source reference agentic harness called Stirrup. This continues Kimi K2.6’s strength in tool use, maintaining a 96% score on τ²-Bench Telecom, placing it among other frontier models in this category. ➤ Low hallucination rate: Kimi K2.5 scores 6 on the AA-Omniscience Index, our knowledge evaluation measuring both accuracy and hallucination rate. This score is primarily driven by a comparatively low hallucination rate of 39% (reduced from Kimi K2.5’s 65%), indicating a greater capability to abstain rather than fabricate knowledge when the model is uncertain. Kimi K2.6’s low hallucination rate places it similarly to other models such as Claude Opus 4.7 (36%) and MiniMax-M2.7 (34%) ➤ High token usage: Kimi K2.6 demonstrates high token usage, but is in line with other frontier models in the same intelligence tier. To run the full Artificial Analysis Intelligence Index, Kimi K2.6 used ~160M reasoning tokens. This is slightly lower than Claude Sonnet 4.6 (~190M reasoning tokens) but much higher than GPT 5.4 (~110M reasoning tokens). ➤ Open weights: Kimi K2.6 is a Mixture-of-Experts (MoE) model with 1T total parameters and 32B active, same as the previous two generations of models Kimi K2 Thinking and Kimi K2.5. Kimi K2.6 again pushes the open weights frontier in intelligence. ➤ Third Party Access: Kimi K2.6 is accessible through Moonshot’s First Party API as well as third party API providers Novita, Baseten, Fireworks, and Parasail ➤ Multimodality: Kimi K2.6 supports Image and Video input and text output natively. The model’s max context length remains 256k. Further analysis in the threads below.

English

458

58.8K

Sam Selvanathan@samselvanathan_·20 Nis

Access to builders was the real bottleneck, not ideas or capital. Wabi CEO Eugenia Kuyda at a16z: software was gated by 20M professional developers until last year. Good ideas died waiting for engineers. Curious what product orgs look like when that gate disappears.

Rohan Paul@rohanpaul_ai

Software used to be gated by roughly 20 million professional developers up until last year. Good ideas still needed engineers, co-founders, time, and months of app work. Now, anyone can build. ~ Wabi CEO Eugenia Kuyda

English

121

Sam Selvanathan@samselvanathan_·16 Nis

Skills as markdown files that teach agents their own tooling is the right abstraction. Two npm commands adds live web search, browser control, and URL fetch to Claude Code with no backend. via @svpino

English

Sam Selvanathan@samselvanathan_·15 Nis

The unlock in physical AI isn't locomotion. Gemini Robotics-ER 1.6 with Spot reading complex industrial gauges and answering spatial queries with chain-of-thought reasoning is what actually bridges "robot follows path" to "robot understands space." @GoogleDeepMind

English

Sam Selvanathan@samselvanathan_·15 Nis

@alexalbert__ Terminal is the headline. Session sprawl was always the real tax.

English

Alex Albert@alexalbert__·14 Nis

New version of Claude Code in the desktop app dropped today with tons of new features and performance improvements. Between Cowork and Code, I've found I don't really need to open other apps (or even my terminal) to do most of my work now.

Claude@claudeai

We've redesigned Claude Code on desktop. You can now run multiple Claude sessions side by side from one window, with a new sidebar to manage them all.

English

102

910

103.3K

Sam Selvanathan@samselvanathan_·14 Nis

@emollick The bottleneck released is usually eval, not capability. Most teams ship with the same model for months, then suddenly ship a 10x product because someone figured out what "correct" actually means for that task. The leap is internal. The model just waited.

English

436

Ethan Mollick@emollick·14 Nis

Soon, at each gradual improvement level of AI, you will start to see large discrete jumps in ability in economically important areas, because the previous AI ability level in some aspect of the job bottlenecked progress. When bottlenecks are released, it looks like a leap forward

English

546

31.9K

Sam Selvanathan@samselvanathan_·14 Nis

Curious how you're handling stale trace rot. Shared memory across sessions sounds great until a wrong decision from last week outranks the right one this week. We hit this building internal agents. Without a TTL or confidence decay on traces, the "searchable brain" becomes a confidence trap.

English

137

Santiago@svpino·13 Nis

My agent already forgot everything we did last week. That sucks. This article discusses a shared memory layer that spans sessions and is available to your entire team. Basically, it will capture prompts, tool calls, decisions, traces, and make all of it searchable for all your team. We need infrastructure like this across the board now.

Davit@DBuniatyan

x.com/i/article/2043…

English

176

34.7K

Sam Selvanathan@samselvanathan_·14 Nis

@AndrewYNg Calling it a PM bottleneck undersells what's actually broken. The constraint isn't deciding what to build. It's knowing what "correct" looks like after an agent ships your spec in 2 hours, perfectly wrong. Acceptance criteria just became the hardest part of the job.

English

544

Andrew Ng@AndrewYNg·13 Nis

As AI agents accelerate coding, what is the future of software engineering? Some trends are clear, such as the Product Management Bottleneck, referring to the idea that we are more constrained by deciding what to build rather than the actual building. But many implications, like AI’s impact on the job market, how software teams will be organized, and more, are still being sorted out. The theme of our AI Developer Conference on April 28-29 in San Francisco is The Future of Software Engineering. I look forward to speaking about this topic there, hearing from other speakers on this theme, and chatting with attendees about it. We’re shaping the future, and I hope you will join me there! It is currently trendy in some technology and policy circles to forecast massive job losses due to AI. Even if they have not yet materialized, these losses certainly must be just over the horizon! I have a contrarian view that the AI jobpocalypse — the notion that AI will lead to massive unemployment, perhaps even rioting in the streets — won’t be nearly as bad as dire forecasts by pundits, especially pundits who are trying to paint a picture of how powerful their AI technology is. Among professions, AI is accelerating software engineering most, given the rise of coding agents. According to a new report by Citadel Research, software engineering job postings are rising rapidly. So if software engineering is a harbinger of the impact AI will have on other professions, this expansion of software engineering jobs is encouraging. Yes, fresh college graduates are having a hard time finding jobs. And yes, there have been layoffs that CEOs have attributed to AI, even if a large fraction of this was “AI washing,” where businesses choose to attribute layoffs to AI, even though AI has not changed their internal operations much yet. And yes, there is a subset of job roles, such as call center operator, that are more heavily impacted. Many people are feeling significant job insecurity, and I feel for everyone struggling with employment, whether or not the cause is AI-related. And many other factors, such as over-hiring during the pandemic and high interest rates, have contributed to the slowdown in the labor market, and the notion that AI is leading to unemployment is oversimplified. In software engineering, I see a lot of exciting work ahead to adapt our workflows. It is already clear that: (i) As AI makes coding easier, a lot more people will be doing it. (ii) Writing code by hand and even reading (generated) code is not that important, because we can ask an LLM about the code and operate at a higher level than the raw syntax (although how high we can or should go is rapidly changing). (iii) There will be a lot more custom applications, because now it’s economical to write software for smaller and smaller audiences. (iv) Deciding what to build, more than the actual building, is becoming a bottleneck. (v) The cost of paying down technical debt is decreasing (since AI can refactor for you). At the same time, there are also a lot of open questions for our profession, such as: - In the future, what will be the key skills of a senior software engineer? And for junior levels, what should be the new Computer Science curriculum? - If everyone can build features, what skills, strategies, or resources create competitive advantage for individuals and for businesses? - What are the new building blocks (libraries, SDKs, etc.) of software? How do we organize coding agents to create software? - What should a software team look like? For example, how many engineers, product managers, designers, and so on. What tooling do we need to manage their workflow? - How do AI agents change the workflow of machine learning engineers and data scientists? For example, how can we use agents to accelerate exploring data, identifying hypotheses, and testing them? I’m excited to explore these and other questions about the future of software engineering at AI Dev. I expect this to be an exciting event. Please join us! [Original text: The Batch newsletter.] ai-dev.deeplearning.ai

English

141

158

886

110.1K

Sam Selvanathan@samselvanathan_·14 Nis

The race shifted from companies to countries. Sundar on 60 Minutes framing AI as a national imperative means regulation, talent, and infrastructure decisions now happen at a different level. via @60Minutes

English

Sam Selvanathan@samselvanathan_·11 Nis

Claude inside Word as a native sidebar is the right product call. Edits surface as tracked changes you accept or reject with normal Word controls, no separate chat window. The shared context across Word, Excel, and PowerPoint is the real unlock. via @rohanpaul_ai

English

Sam Selvanathan@samselvanathan_·10 Nis

@DC_CowboyHouse @svpino 💯

QME

WitnezMe@DC_CowboyHouse·10 Nis

@samselvanathan_ @svpino Yes such inaccurate comparisons lead to a lot of wasted time and frustration for people. Tired of reading marketing

English

Santiago@svpino·9 Nis

If you ask me, Gemma 4 is one of the best models out there for a single reason: You can run it locally and it’s really, really good (probably Sonnet level?) And, of course, ya can now use it to power OpenClaw and show a middle finger to the company who doesn’t like it.

atomic.chat@atomic_chat_hq

Run OpenClaw with Gemma 4 and Atomic Chat MacBook Air M4 · 16 GB RAM · 25 tok/s No cloud! No subscription fees! Open-source local model. Runs on your regular device

English

444

48.7K

Keşfet

@tonypaluzzi @sama @HumanAIDesign @abeni_t @bad1313 @GaryMarcus @kareem_carr @emollick