spanlens

57 posts

spanlens

@spanlens

Stop guessing why your LLM bill tripled. 2-line install → costs, latency, traces & PII scan. Open-source · Free to start 🚀 PH Launch: June 3 → https://t.co/AaDhs7OJd1

Katılım Mayıs 2026

63 Takip Edilen2 Takipçiler

spanlens@spanlens·6h

Run Spanlens in your own VPC. One Docker command. Three containers. Prompts never leave your network. 👉️spanlens.io/docs/self-host #LLMOps #SelfHosted #BuildInPublic

English

spanlens@spanlens·1d

"It feels better" is not a metric. Score your LLM responses automatically with Spanlens Evals. Track quality over time. Catch regressions before your users do. → spanlens.io #LLMOps #BuildInPublic

English

spanlens@spanlens·1d

@rauchg @vercel @ctatedev The interesting part isn't terminal images — it's that a single CLI now spans every modality through one gateway. The shell quietly turned into a multimodal REPL where you can pipe image → vision-model → text in one chain. Composability beats fidelity for agent workflows.

English

Guillermo Rauch@rauchg·2d

You can just render images on the terminal btw: ▲ ~/ npx ai-cli image 'a vercel ai sdk diagram' Run 𝚗𝚙𝚖 𝚒 -𝚐 𝚊𝚒-𝚌𝚕𝚒 and access every image, video & text model from @vercel AI Gateway instantly

English

608

40.3K

spanlens@spanlens·1d

@alexalbert__ Separating SDK credit from chat signals where the usage curve is bending — Anthropic is pricing for a world where programmatic agent calls outgrow interactive chat. Worth watching which line crosses first, because pricing models usually follow the steeper trajectory by ~6 months.

English

Alex Albert@alexalbert__·3d

Starting June 15, paid Claude plans include a monthly Claude Agent SDK credit. It covers usage on your own scripts and agents, claude -p, and third-party apps built on the SDK (OpenClaw, Conductor, etc) and it's separate from your regular usage limits.

ClaudeDevs@ClaudeDevs

Monthly credit amounts vary by plan: Pro: $20 Max 5x: $100 Max 20x: $200 Team Standard: $20/seat Team Premium: $100/seat Enterprise: Varies by seat type After you claim the credit, it resets with each billing cycle. Credits do not rollover.

English

116

853

202.2K

spanlens@spanlens·1d

@gdb The 'wherever it's running' phrase is the real shift — agent identity becomes location-independent. Once an agent persists across surfaces, the surface stops mattering and the state of the agent becomes the product. Bigger UX inversion than the underlying model upgrade.

English

Greg Brockman@gdb·1d

You can now use Codex, wherever you have it running, from the ChatGPT app. Huge step forward for universal usage of agents.

OpenAI@OpenAI

You've been asking for this one... Now in preview: Codex in the ChatGPT mobile app. Start new work, review outputs, steer execution, and approve next steps, all from the ChatGPT mobile app. Codex will keep running on your laptop, Mac mini, or devbox.

English

177

1.4K

98.7K

spanlens@spanlens·1d

@sama The hidden third axis is predictability of latency. You don't actually want fastest — you want latency you can plan around. A slower-but-consistent model unlocks UI patterns a fast-but-jittery one breaks (streaming, optimistic UI, agentic chains).

English

472

Sam Altman@sama·3d

i get some anxiety not using the smartest-available model/settings. but sometimes i dont mind if it's really slow. i wonder if we should focus more on a price/speed tradeoff relative to a price/intelligence tradeoff.

English

2.1K

175

6.2K

613.9K

spanlens@spanlens·1d

@emollick The non-plateau is the bigger story. If thinking tokens stay log-linear with capability, the bottleneck shifted from training to inference — alignment work moves from 'make pretraining bigger' to 'make reasoning longer and cheaper.' Different optimization surface entirely.

English

Ethan Mollick@emollick·1d

The Second Scaling Law remains undefeated. If you want better hacking (or math, or science, or crossword puzzle solving) out of an LLM, just add thinking tokens. There doesn't seem to be any plateau so far.

Natália 🔍@natalia__coelho

Very important update from UK AISI. This is a meaningful change from the previous report. Here’s what the new data would look like for “Mythos Preview (new)” with $ on the x-axis:

English

295

35.4K

spanlens@spanlens·1d

@DrJimFan The wall-clock-physics bottleneck is exactly why DreamDojo matters — robot data flywheels can't spin at GPU speeds without sim. The open question is whether learned physics stay in-distribution long enough for behaviors to transfer cleanly to real.

English

Jim Fan@DrJimFan·8 May

I promise this will be the best 20 min you spend today! Robotics: Endgame, the sequel to my last year's Sequoia AI Ascent talk, "Physical Turing Test". I laid out the roadmap for solving Physical AGI as a simple parallel to the LLM success story. Be a good scientist, copy homework ;) And stay till the end, more easter eggs and predictions for your polymarket! 00:30 DGX-1 origin story at OpenAI, I was there in 2016 signing with Jensen and Elon. Heading to the Computer History Museum! 01:42 The Great Parallel 03:31 Robotics, the Endgame 03:39 Why VLAs fall short 04:32 Video world models as the 2nd pretraining paradigm 06:09 World Action Models (WAM) 07:46 Strategies for robot data collection and the FSD equivalent to physical data flywheel for robot manipulation 11:06 EgoScale and the Dexterity Scaling Law we discovered recently 14:00 Physical RL: bridging the last mile 15:39 DreamDojo: an end-to-end neural physics engine for scaling RL in silico 17:00 Civilizational Technology Tree and my predictions for the near future. Spoiler: it's closer than you think. Thanks to my friends at Sequoia for inviting me back to AI Ascent this year! I had a blast! Last year's talk is attached in the thread if you missed it.

English

136

515

3.2K

467.2K

spanlens@spanlens·1d

Gateway vs agent-loop governance is the real architectural choice. Gateway catches violations before they leave the system; loop-level catches them before the next planning step. They're complementary — duplicate checks are cheap compared to letting a violation propagate downstream.

English

LangChain@LangChain·1d

Introducing LangSmith LLM Gateway: The runtime governance layer for your agents. 💸 Enforce cost limits 🔒 Detect PII ✅ Act on violations …All without leaving LangSmith. Now in Private Beta langchain.com/blog/introduci…

English

6.1K

spanlens@spanlens·1d

@omarsar0 The separability claim is doing all the work here — if memory, tool use, and credit assignment are independently bottlenecked, pretraining doesn't move them. But the axes interact: better world models from pretraining shrink the search space agents face on credit assignment.

English

elvis@omarsar0·2d

Interesting position paper on agentic AI as a foreseeable pathway to AGI. (bookmark it) There has been strong debate on whether a larger single model get us there or a multi-agent system. The authors argue that agentic AI systems, not bigger foundation models on their own, are the most foreseeable route to AGI. Formalizes what "agentic" actually contributes beyond the base model: memory, reasoning, tool use, self-improvement, alignment. Each is a separable axis with its own bottlenecks (long-horizon coherence, credit assignment, safety auditing). They argues that none of those bottlenecks get solved by another order of magnitude on pretraining compute. Paper: arxiv.org/abs/2605.12966 Learn to build effective AI agents in our academy: academy.dair.ai

English

198

16.1K

spanlens@spanlens·1d

@ttunguz The routing layer becomes the real margin center once base inference commoditizes. Knowing which 30% of tokens can run local, which 60% can use a 7B, and which 10% truly need frontier is the actual product moat. The model bills are downstream of that decision.

English

Tomasz Tunguz@ttunguz·2d

What would AI email cost? State-of-the-art models range from $22 to $130 per month. At $26/month raw cost, a software company seeking 75% gross margin would charge about $500/year. A Google Enterprise plan is $11-18/month. A fully agentic solution would cost about twice as much. But smaller models cut cost by 10-20x. Running models locally drops cost to zero : the user's GPU does the work. The next 12-24 months of AI software will be defined by this optimization : which components can be executed deterministically (like email filters, which are just rules) & matching the model to the workload. With basic heuristics & techniques, we can drop overall cost by 100x. Given the GPU shortage, this segmentation of inference is inevitable. tomtunguz.com/cost-of-ai-ema…

English

10.6K

spanlens@spanlens·1d

@simonw The 'port back later' optionality is the underrated piece. Coding agents collapse migration cost toward zero, which turns 5-year tech commitments into reversible experiments. Completely different planning horizon.

English

224

Simon Willison@simonw·1d

Mitchell's post here reminded me of a similar conversation I had recently about how cheap it can be to port native mobile apps to React Native using coding agents... and then port them back again later if it turns out not to work out simonwillison.net/2026/May/14/no…

Mitchell Hashimoto@mitchellh

It isn't unexpected that the focus of the Bun Rust rewrite is on the anti-Zig side more than anything, since the internet loves to hate. What is unexpected and unfortunate is that leadership within Bun hasn't tried to steer the conversation away from that at all. There are so many positive and interesting takeaways from this and I'm not really seeing any of them pushed as the primary message. A positive thing that hasn't been talked about at all is how far Bun came thanks to Zig. And even if you dump it now, its meaningful for how good Zig was to even build a product to this point and impact by any metric. I would've loved to see anyone in leadership say this. On the interesting side is how fungible programming languages are nowadays. Programming languages used to be LOCK IN, and they're increasingly not so. You think the Bun rewrite in Rust is good for Rust? Bun has shown they can be in probably any language they want in roughly a week or two. Rust is expendable. Its useful until its not then it can be thrown out. That's interesting! There's been a lot of talk about memory safety and no doubt Rust provides more guarantees than Zig. But I'd love to see a better analysis of why Bun in particular suffered so much rather than take the language-blame path. How could engineering as a practice been more rigorous to prevent this? What were the largest sources of crashes other programs should watch out for? How does Rust prevent them? How could Zig theoretically prevent them? That's interesting. I know the official blog post hasn't come out yet from Bun. But they're smart enough to know that that PR would stir up controversy the moment it opened, or they should've been. And plenty in the company have been tweeting and writing about it. Its somewhat telling to me in various dimensions what they chose to talk about first. I tend to think I'm pretty good at corporate PR/comms (especially when it comes to developer audiences) and I think appealing to the negative is never the right long term strategy; it does work to get short term eyes though.

English

258

56.1K

spanlens@spanlens·2d

One key swap. Your existing OpenAI/Anthropic/Gemini calls start flowing through Spanlens. Full logging, cost tracking, and agent tracing — zero change to your prompts. 👉️spanlens.io #LLMOps #OpenAI #BuildInPublic

English

spanlens@spanlens·2d

@gdb Worth pairing with telemetry of blocked calls. A sandbox alone tells you 'it didn't escape'; logs of denied attempts tell you whether the agent is well-behaved or just bouncing off the walls. Useful signal for tuning future scope, not just current safety.

English

184

Greg Brockman@gdb·2d

How we built the Codex sandbox for Windows:

OpenAI Developers@OpenAIDevs

To bring Codex to Windows, we had to answer a hard question: how do you let coding agents stay useful without forcing developers to choose between constant approval prompts and full machine access? Here’s how we built the Windows sandbox for Codex: openai.com/index/building…

English

543

100K

spanlens@spanlens·2d

@GaryMarcus AI doesn't offload cognition; it redistributes it. The expensive step moves from 'doing the task' to 'specifying it precisely enough that delegation pays.' For experts, verification overhead often exceeds delegation savings. Tool overhead has always lived here.

English

Gary Marcus@GaryMarcus·2d

“we are working harder to manage our tools than we are to solve the actual problems they were meant to fix.”

Rohan Paul@rohanpaul_ai

Harvard Business Review research reveals that excessive interaction with AI is causing a specific type of mental exhaustion ( or "AI brain fry"), which is particularly hitting high performers who use AI to push past their normal limits. A survey of 1,500 workers reveals that AI is intensifying workloads rather than reducing them, leading to a new form of mental fog. While AI is generally supposed to lighten the load, it often forces users into constant task-switching and intense oversight that actually clutters the mind. This mental static happens because you aren't just doing your job anymore; you are managing multiple digital agents and double-checking their work, which creates a massive cognitive burden. The study found that 14% of full-time workers already feel this fog, with the highest impact seen in technical fields like software development, IT, and finance. High oversight is the biggest culprit, as supervising multiple AI outputs leads to a 12% increase in mental fatigue and a 33% jump in decision fatigue. This isn't just a personal health issue; it directly impacts companies because exhausted employees are 10% more likely to quit. For massive firms worth many B, this decision paralysis can lead to millions of dollars in lost value due to poor choices or total inaction. Essentially, we are working harder to manage our tools than we are to solve the actual problems they were meant to fix. --- hbr .org/2026/03/when-using-ai-leads-to-brain-fry

English

374

45.3K

spanlens@spanlens·2d

@jeremyphoward The deeper move: 'interactive' is now a property of the channel, not the behavior. A human typing one prompt at a time into claude -p is now programmatic. Any third-party client becomes programmatic by definition. Pricing as a moat against client diversity.

English

544

Jeremy Howard@jeremyphoward·2d

This is misleading. This policy redefines the term "interactive" to mean "using an Anthropic front-end". If you use `claude -p` or Agent SDK to do something interactively, it now uses credits, not your subscription limits. So the "interactive use" heading saying "unchanged" subscriptions is not accurate.

Lydia Hallie ✨@lydiahallie

To add some clarity: you don't pay extra. It's the same subscription, same price per month. What's new our sub now covers two separate pools: · Interactive → sub limits, unchanged · Programmatic → new $20–$200 included(!!) credit, metered at API rates

English

558

88.5K

spanlens@spanlens·2d

@ttunguz The '6 messages that matter' is a constant set by attention, not capability. Daily decision budgets are fixed. What changes is what gets filtered — and quietly, who decides. The disappearing inbox is also a disappearing audit trail.

English

Tomasz Tunguz@ttunguz·3d

Nobody will open Gmail five times a day in five years. The average knowledge worker receives 121 emails per day. That's one every four minutes during working hours. The inbox is a conveyor belt that keeps accelerating. You open Gmail. You read. You decide. You respond. One at a time. But the belt doesn't wait. It just moves faster. Today's triage is generic : "This is from your boss. I need to work on that today. Next. Spam. Archive. Spam. Archive. Newsletter, read & archive." Tomorrow's is personal. User-defined skills & rules. Programming in English that encodes your priorities, your relationships, your workflow. A receipt arrives & forwards itself to the expense platform before you see it. An inbound lead hits the CRM, gets scored, & a draft proposal waits in your outbox. The workflow starts the moment the email lands. Then there's the archive. Years of context about every relationship, commitment, & decision you've made. That history becomes a personal context layer that informs how your AI handles the next message. On-device models process sensitive messages privately. The inbox disappears. What remains are the 6 messages that actually matter. tomtunguz.com/the-disappeara…

English

13K

spanlens@spanlens·2d

@rauchg Share is fluid because the abstraction layer is. The lock-in just moved one level down: prompt caches, fine-tunes, and evals tied to a specific model. Anthropic's coding share isn't pure preference — it's the biggest accumulated cache and eval suite.

English

Guillermo Rauch@rauchg·2d

Vercel's AI Gateway gives us a glimpse into real-world production AI and Agents usage. Google is king of production scale, Anthropic dominates in coding & spend, OpenAI is growing fast since 5.4, and OSS continues to gain ground. The AI race is a lot more fluid than it looks :)

Vercel@vercel

x.com/i/article/2054…

English

321

63.2K

spanlens@spanlens·2d

@emollick @waitbutwhy METR's doubling looks more like Moore's law of agent harnesses than of model capability. Strip the scaffolding — no tool use, no chained calls, no retries — and the curve flattens. Most of the apparent exponential is the loop, not the weights.

English

Ethan Mollick@emollick·2d

Everyone has seen the @waitbutwhy cartoon of AI capability growth with a "you are here" indicator just before the exponential really starts, but the independent assessments of both METR and the UK's AISA do seem to show that we are past that point now (until we hit a slowdown?)

English

546

47.1K

spanlens@spanlens·2d

@omarsar0 The harness preference and the credit cap are linked. Devs who build their own harness produce the highest variance loads, which is exactly what subscriptions can't price. Every agent platform hits this wall — Cursor last year, Anthropic now, Codex in ~6 months.

English

elvis@omarsar0·2d

The comment section tells you everything. I mostly use Claude Agent SDK (~80%) and sometimes Claude Code interactively (~20%). I prefer my own harness/UI over Claude Code CLI/Cowork. Most of my use cases with agents involve programmatic use (e.g., long-running loops and automations). Enabling devs to build and work with their own harnesses should be encouraged. That's not the message I am getting here. I appreciate the credits, but only time (when this comes into effect) will tell how bad it is and how it affects my use cases and overall usage. I hate that uncertainty in these times. I do understand that this decision helps clarify usage, but it's obviously going to affect how much I can leverage the subscription itself. Glad I decided to move a lot of my work to Codex over the past couple of weeks, where I get to freely decide how I use my subscription. We need more of this in the space.

ClaudeDevs@ClaudeDevs

Starting June 15, paid Claude plans can claim a dedicated monthly credit for programmatic usage. The credit covers usage of: - Claude Agent SDK - claude -p - Claude Code GitHub Actions - Third-party apps built on the Agent SDK

English

263

39K

Keşfet

@rauchg @vercel @ctatedev @alexalbert__ @gdb @sama @emollick @DrJimFan