
Phil Glazer
248 posts



AI assistants like Claude can seem shockingly human—expressing joy or distress, and using anthropomorphic language to describe themselves. Why? In a new post we describe a theory that explains why AIs act like humans: the persona selection model. anthropic.com/research/perso…

On Claude Code, we’re introducing agent teams. Spin up multiple agents that coordinate autonomously and work in parallel—best for tasks that can be split up and tackled independently. Agent teams are in research preview: code.claude.com/docs/en/agent-…

Anthropic's computer use running locally - slow but good, can see in 6mo-12mo being great

are we misunderstanding this? the implication is you can't insert any content that anthropic didn't know to have generated this breaks things like switching models mid session and a dozen other things harnesses rely on i switch between claude and gpt all the time :(


yes things are changing fast, but also I see companies (even faang) way behind the frontier for no reason. you are guaranteed to lose if you fall behind. the no unforced-errors ai leader playbook: For your team: - use coding agents. give all engineers their pick of harnesses, models, background agents: Claude code, Cursor, Devin, with closed/open models. Hearing Meta engineers are forced to use Llama 4. Opus 4.5 is the baseline now. - give your agents tools to ALL dev tooling: Linear, GitHub, Datadog, Sentry, any Internal tooling. If agents are being held back because of lack of context that’s your fault. - invest in your codebase specific agent docs. stop saying “doesn’t do X well”. If that’s an issue, try better prompting, agents.md, linting, and code rules. Tell it how you want things. Every manual edit you make is an opportunity for agent.md improvement - invest in robust background agent infra - get a full development stack working on VM/sandboxes. yes it’s hard to set up but it will be worth it, your engineers can run multiple in parallel. Code review will be the bottleneck soon. - figure out security issues. stop being risk averse and do what is needed to unblock access to tools. in your product: - always use the latest generation models in your features (move things off of last gen models asap, unless robust evals indicate otherwise). Requires changes every 1-2 weeks - eg: GitHub copilot mobile still offers code review with gpt 4.1 and Sonnet 3.5 @jaredpalmer. You are leaving money on the table by being on Sonnet 4, or gpt 4o - Use embedding semantic search instead of fuzzy search. Any general embedding model will do better than Levenshtein / fuzzy heuristics. - leave no form unfilled. use structured outputs and whatever context you have on the user to do a best-effort pre-fill - allow unstructured inputs on all product surfaces - must accept freeform text and documents. Forms are dead. - custom finetuning is dead. Stop wasting time on it. Frontier is moving too fast to invest 8 weeks into finetuning. Costs are dropping too quickly for price to matter. Better prompting will take you very far and this will only become more true as instruction following improves - build evals to make quick model-upgrade decisions. they don’t need to be perfect but at least need to allow you to compare models relative to each other. most decisions become clear on a Pareto cost vs benchmark perf plot - encourage all engineers to build with ai: build primitives to call models from all code bases / models: structured output, semantic similarity endpoints, sandbox code execution. etc What else am I missing?


fun exploration from the past couple of days, an excel add in that can both: - passively observe your work and suggest edits, like tab complete in an IDE - create plans and build models

Code was the killer app for AI cause it verifies really well. Linting, unit tests and instant feedback for frontend make it so outputs are verified in real time. If you want to think about what gets automated by AI next look at the verification mechanisms. We’re seeing this start in fields like math and biology which have excellent verification systems. Art went quickly because you can instantly tell if it’s good enough. All of this is still maturing but the rate at which the industry matures is directly related to how fast the verification loop is. Medicine and law will have a p99 problem and take forever to diffuse. What’s next? Probably accounting and finance since they’re super easy to verify.


You can start building and testing apps in ChatGPT with the Apps SDK preview, which we're releasing today as an open standard built on MCP. Later this year, we’ll begin accepting app submissions for publication. developers.openai.com/apps-sdk



what do people think about Opus 4.5 for coding so far? what are the behavioral problems or limitations you still want to see improved? we're hungry for feedback 🙏


They're burying a lot here. There's a 66% price cut from Opus 4.1 to $5/$25, it uses fewer tokens to solve problems, upgrades to Claude Code in the app, no more length limits on conversations, no more Opus-specific plan caps...










