Chris Bee

1.2K posts

Chris Bee banner
Chris Bee

Chris Bee

@chrisbeetweets

Building Devplan. Former CTO. ex-Zillow/Uber/Amazon. Dad. Dreamer.

Seattle, WA Katılım Kasım 2008
2.7K Takip Edilen1.2K Takipçiler
Sabitlenmiş Tweet
Chris Bee
Chris Bee@chrisbeetweets·
Foundational models get you 80% of the way, but the last 20% is where accuracy and usefulness actually lives and that comes from context and specialization. That last 20% is why we are building devplan.com We feed in your company’s product docs, code, and workflows so the AI isn’t guessing. The result: specs, user stories, and tasks that are specific to your product and codebase, not generic boilerplate.
English
0
1
34
1.7K
Chris Bee
Chris Bee@chrisbeetweets·
Read something this week: the guy who built Claude Code runs 10-15 sessions in parallel. His edge isn't the model. Self-Improvement Loop -After ANY correction from the user: update tasks/lessons.md -Write rules for yourself that prevent the same mistake -Review lessons at the start of every session Verification Before Done -Never mark a task complete without proving it works -Ask yourself: "Would a staff engineer approve this?" -Run tests, check logs, demonstrate correctness Autonomous Bug Fixing -When given a bug report: just fix it. Don't ask for hand-holding -Point at logs, errors, failing tests — then resolve them -Zero context switching required from the user The longer you use it, the fewer corrections you make. Most teams are still catching the same mistakes every session.
English
0
0
2
44
Chris Bee
Chris Bee@chrisbeetweets·
The vibe shift is real. I talked to a number of leaders, academics and old school developers in the past weeks that openly admitted they never thought the models would be able to generate production quality code they way they can right now. All have changing their mindset.
English
0
0
1
42
Chris Bee
Chris Bee@chrisbeetweets·
@aakashgupta The cost story is real but the deeper moat is inference-time self-summarization trained on your own agent harness. Every token of real developer sessions tightens that loop. Distribution at scale is the best training signal.
English
0
0
0
384
Aakash Gupta
Aakash Gupta@aakashgupta·
Three months ago, the consensus was that Cursor was cooked. Claude Code crossed $2.5B in run-rate revenue. Google paid $2.4B for Windsurf’s IP and poached its leadership into DeepMind. OpenAI acquired Astral, the team behind Python’s uv package manager, to feed Codex. Viral tweets were circulating about developers ditching Cursor for Claude Code. The usage-based pricing switch last July had users posting surprise bills on Reddit. Consumer subscriptions were running at negative margins because every token served was profit for Anthropic or OpenAI. The company that popularized vibe coding was getting buried by the model providers it depended on. Then Cursor shipped four major releases in 15 days. JetBrains support on March 4. Automations on March 5. Plugin marketplace with 30+ partners on March 11. And now Composer 2, their own model that moggs Opus 4.6 on cost while matching it on performance. Look at the chart. Composer 2: 61.3 on CursorBench at $0.50 per million input tokens. Opus 4.6: 58.2 at $5.00. GPT-5.4: 63.9 at $2.50. The performance gaps are single digits. The cost gap between Composer and Opus is 10x. The part nobody’s pressing on: Cursor still won’t name the base model. Their blog says “our first continued pretraining run,” which means they took an existing model and continued training on code. When the original Composer launched in October, developers kept catching it responding in Chinese. Same tokenizer patterns as DeepSeek. Nathan Lambert congratulated the research team by tweeting “open weight base models + incredible ML teams in a specific niche can create immense value.” Co-founder Aman Sanger told Bloomberg it was trained exclusively on code. Can’t do taxes, can’t write poems. A Chinese open-source chassis, refined with what Cursor calls compaction-in-the-loop RL, and fed by a billion lines of daily user code flowing through the editor every day. That data flywheel is the one asset no API provider can replicate. The honest read requires some skepticism though. CursorBench is Cursor’s own internal benchmark. They built the test, then showed you they pass it. GPT-5.4 still leads on Terminal-Bench 2.0, which is independently maintained. And Opus 4.6 at high thinking effort still outscores Composer 2 on raw accuracy. The cost advantage is real. The performance parity claim needs external validation before anyone should take this chart at face value. But here’s why the chart matters anyway. This was the P0 coming out of the holidays. Building their own model was existential. Every dollar Cursor paid Anthropic per token was margin funding the competitor building Claude Code to replace them. Every dollar paid to OpenAI funded Codex. The only way to stop bleeding cash to the companies trying to kill you is to stop using their models. Four hundred employees. $2B ARR. Reportedly raising at $50B. Entering the model race against labs with thousands of researchers and tens of billions in compute. That chart is the fundraising slide. Whether it holds up in production against Opus and GPT-5.4 is a different question. But three months ago, the question was whether Cursor would survive at all.
Cursor@cursor_ai

Composer 2 is now available in Cursor.

English
29
14
190
56.2K
Chris Bee
Chris Bee@chrisbeetweets·
If you use Claude Code or Cursor for research, strategy, or discovery work, here's the stack I use to cut context setup from 15 minutes to under a minute. The problem: every new session starts from zero. You paste background, decisions, prior research into the chat window before you can do any real work. It compounds badly when you're running multiple workstreams. Two tools fixed this for me. First: Obsidian. Every note becomes a local markdown file on your machine. Yours permanently. No platform lock-in, readable by any tool. Second: qmd, a CLI built by Tobi Lütke (Shopify's CEO). It indexes your markdown folder using three search methods running entirely on your laptop: BM25 full-text, vector semantic search via a 300MB local embedding model, and LLM reranking. Nothing leaves your machine. Setup: 1. Install Obsidian, import your notes from wherever they live now 2. Install qmd: github.com/tobi/qmd 3. Run `qmd embed` to index your collection 4. Open Claude Code, point it at the folder Now you can ask Claude to find every decision you've made about a specific product area, or pull every research note on a particular problem, without copying anything manually. It searches, reads the relevant files, and starts from that context. The part I didn't expect: it compounds. Every note I add makes future sessions more useful without any extra effort. The time I used to spend on setup now goes into the actual work.
English
1
0
3
89
Chris Bee
Chris Bee@chrisbeetweets·
Karpathy's autoresearch strips the loop down to its minimum. The human writes the .md file. The agent iterates on the code. When you remove everything else, what survives on the human side is the specification. Not the code. The intent behind it. Teams have spent two years trying to go faster. The constraint was never generation speed. It was always the quality of the instructions going in.
English
0
0
2
84
Chris Bee retweetledi
Jacob Andreou
Jacob Andreou@jacobandreou·
your favorite founders’ favorite founder
English
76
605
6.2K
764.9K
Chris Bee
Chris Bee@chrisbeetweets·
Something is shifting on the product side. Specs are starting to matter in a way they didn't before. The teams leaning into that are moving differently. Curious what others are seeing right now. Where do you think this is in a month?
English
1
0
2
50
Chris Bee
Chris Bee@chrisbeetweets·
Roughly a year ago Dario said AI would write 90% of code within 6 months. Most people laughed it off. Looking at where tooling actually landed, the pace of Claude Code adoption, the PR volume teams are pushing now, the prediction looks a lot less crazy than it did in March 2025. Wild to watch a forecast age into something close to true in real time.
English
1
0
2
61
Chris Bee
Chris Bee@chrisbeetweets·
Nobody tracks whether their team actually understands every aspect of what they've built. Velocity, DORA metrics, story points. All covered. But whether the people who shipped something last month could explain why it works the way it does today? Nobody's measuring that. There's a term for what accumulates when you don't. Cognitive debt. It's what happens when your team ships faster than they can understand. The code works. The tests pass. Nobody knows why. The agent wrote it. It looked right. It's in production. Three months later someone asks why a system behaves the way it does and the person who built it pulls up the diff. The intent behind it exists nowhere. Technical debt shows up in your logs. Cognitive debt shows up when your team is too scared to touch something they own. You can have a perfectly green CI pipeline and a team that's slowly losing the plot at the same time. Velocity metrics will never show you this. Five things worth putting in your CLAUDE.md before every agent session: 1. What we're building and why 2. What constraints the agent cannot override 3. What decisions have already been made 4. What done actually looks like 5. What would make us regret this in three months
English
0
0
4
68
Chris Bee retweetledi
Joe Heitzeberg
Joe Heitzeberg@jheitzeb·
Kicking off AI Tinkerers - Seattle Dev Tools Track with @chrisbeetweets Devplan and Actual AI - approx 100 on the waitlist tonight
Joe Heitzeberg tweet media
English
2
1
16
988
Chris Bee
Chris Bee@chrisbeetweets·
Agentic engineering has a weird accountability problem. You own the output but you didn't may not write the decisions that produced it. Most teams are chasing faster execution. That's fine. But fast with bad instructions is just deferred confusion, and that compounds. If a feature shipped sideways, don't debug the model. Trace back to the instruction layer. That's almost always where it fell apart.
English
0
0
4
58
Chris Bee
Chris Bee@chrisbeetweets·
79% of companies paying for OpenAI also pay for Anthropic. Not choosing. Hedging. The model isn't the bottleneck for most of these teams. Getting everyone aligned on what to build and how to spec it out is. No subscription fixes that.
English
0
0
2
71
Chris Bee
Chris Bee@chrisbeetweets·
LLMs have dramatically improved how product managers work. You can synthesize ten customer interviews in seconds. You can analyze hundreds of support tickets and extract clear themes. Drafting PRDs, refining strategy, even pressure-testing ideas is faster than ever. Huge step forward, but there’s still a structural gap. Today, the synthesis layer is manual. A PM gathers context from Slack threads, sales calls, churn reports, dashboards, roadmap docs, and data warehouse queries. They paste it into an LLM or pull from and MCP connection, ask for insights, and then repeat the process a week or a month later. It works, but it’s episodic and dependent on one person holding everything together. What we don’t really have yet is an always-on product agent that continuously ingests company context, tracks patterns over time, and proactively suggests high-leverage projects and priorites based on what’s actually happening across the business. Claude and ChatGPT are powerful. But they only do what we ask and see what we give them. Continuous, system-level context that feeds product decisions feels like a gap. I’m curious how others are thinking about this or if anyone has wired up a solution in this arena that they like.
English
0
0
3
264
Chris Bee
Chris Bee@chrisbeetweets·
@aakashgupta The 90% prototype / 10 % spec tracks for simpler UI-heavy features. However, it can be the opposite for deeper system features with complex business logic and dependencies.
English
0
0
0
43
Aakash Gupta
Aakash Gupta@aakashgupta·
The AI prototyping conversation is splitting PMs into two camps. Camp one is treating these tools like toys. They open Bolt or Lovable, prompt something, get a half-baked output, and go back to writing PRDs in Google Docs. They tried it, it was mid, they moved on. Camp two is doing what Nadav Abrahami describes in this episode. He co-founded Wix, spent 20 years building visual editors, then pulled 30 engineers out of the company to start Dazzle because he saw the workflow shift before most people priced it in. His team used to dedicate three developers for weeks to build functional prototypes for big features. That investment meant prototyping was rare, reserved for only the most complex or politically important initiatives. Now at Dazzle, every feature goes through multiple AI prototypes before anyone writes production code. The constraint that used to gate prototyping, developer time, evaporated. But the part most people will miss from this conversation is what he says about prompting. PMs are treating AI tools like order windows. Type what you want, hit enter, complain when it's wrong. Nadav's approach: go to discuss mode first. Tell the AI what you're planning. Ask it to reflect back its understanding. Because the failure mode with AI isn't that it can't build what you asked. The failure mode is that it builds exactly what you said, and what you said had three ambiguities you didn't notice. He frames the new PM deliverable as prototype plus PRD. The prototype covers 90% of the flows. The PRD covers edge cases. If a developer has any questions after seeing both, something is missing from one of them. Camp one is going to spend the next two years wondering why their specs keep getting misbuilt. Camp two already has users clicking through a functional prototype before the first standup.
Aakash Gupta@aakashgupta

AI prototyping is a superpower. Explore 5 variations, and then what you ship in the product is 10x better. I got the co-founder of Wix to give me a masterclass: 3:03 - When PMs should use AI prototyping 11:40 - Design system template workflow 58:21 - Engineer handoff

English
7
5
59
16.9K
Chris Bee
Chris Bee@chrisbeetweets·
Another earth-shattering AI announcement today from @OpenAI. We are not early. We are not late. We are all just along for the ride.
English
0
0
1
66
Chris Bee
Chris Bee@chrisbeetweets·
@lofidewanto @Grady_Booch Faster task generation from specs is real. The part people underestimate is how much spec quality determines whether any of that matters.
English
1
0
2
18
Dr. Jawa
Dr. Jawa@lofidewanto·
@Grady_Booch This time, it’s not about a higher level of abstraction, since you already had that in SDD before AI. Textual specs were already in place. This time, it’s about the ability to generate tasks and implementations from specs and user stories faster and better. A better generator.
English
2
0
1
87
Grady Booch
Grady Booch@Grady_Booch·
Very little about software engineering has changed over past last three months. A great deal has changed about coding, not unlike when we saw the rise of high order programming languages and compilers, the difference today being that the number of developers is far larger and distribution channels are such that the velocity and breadth of change is far greater. The entire history of software engineering is one of raising the level of abstraction.
Jared Friedman@snowmaker

Software engineering changed more in the last 3 months than the preceeding 30 years. Everything about running a software company needs to be rethought from first principles.

English
95
192
1.8K
220.3K