Alex Rigler

1.7K posts

Alex Rigler banner
Alex Rigler

Alex Rigler

@alexrigler

Building ChooChoo: Governance that keeps your AI engineering on track

San Francisco, CA Katılım Nisan 2010
7.5K Takip Edilen610 Takipçiler
Alex Rigler retweetledi
Eno Reyes
Eno Reyes@EnoReyes·
@petergyang @cursor_ai I actually think this is more of a lesson around business model and ICP selection. The pressure comes from bad margins on their largest customer segment (solo devs that are used to heavy subsidies) Being model agnostic is actually a superpower for coding agents when done right.
English
1
3
22
1.4K
Alex Rigler retweetledi
Yifan Zhang
Yifan Zhang@yifan_zhang_·
Recursive self-improvement via coding agents is the top priority for all frontier labs.
English
43
59
989
68.3K
Alex Rigler retweetledi
jason liu
jason liu@jxnlco·
jason liu tweet media
ZXX
44
147
2.5K
81.2K
Alex Rigler retweetledi
Matan Grinberg
Matan Grinberg@matanSF·
read: 60B acquisition with 10B breakup fee
SpaceX@SpaceX

SpaceXAI and @cursor_ai are now working closely together to create the world’s best coding and knowledge work AI. The combination of Cursor’s leading product and distribution to expert software engineers with SpaceX’s million H100 equivalent Colossus training supercomputer will allow us to build the world’s most useful models. Cursor has also given SpaceX the right to acquire Cursor later this year for $60 billion or pay $10 billion for our work together.

English
12
2
177
23.5K
Alex Rigler retweetledi
clem 🤗
clem 🤗@ClementDelangue·
APIs and limited releases for AI models are not a safety policy, they’re a business model (which is totally ok as long as you’re transparent about it). Especially on cyber-security, they give a false impression of control and safety whereas in reality they massively increase the risks because they create asymmetry of capabilities and much easier/broader use than open-source model weights even from non-technical people.
Bloomberg@business

Anthropic's Mythos has been accessed by a small group of unauthorized users, raising questions about control of the AI model bloomberg.com/news/articles/…

English
11
25
197
19.5K
Alex Rigler retweetledi
Simon Willison
Simon Willison@simonw·
This is so confusing. Did Anthropic really just drop Claude Code from their $20/month plan? Why would they do that through a pricing page update without making a proper announcement? Plus, $20/month still gets you Cowork, which is just Claude Code wearing a non-threatening hat!
Simon Willison tweet media
English
182
88
1.5K
336.8K
Alex Rigler
Alex Rigler@alexrigler·
The ghosts have ghosts 👻🤯
Peter Girnus 🦅@gothburz

I am a Senior Program Manager on the AI Tools Governance team at Amazon. My role was created in January. I am the 17th hire on a team that did not exist in November. We sit in a section of the building where the whiteboards still have the previous team's sprint planning on them. No one erased them because we don't know which team to notify. That team may not exist anymore. Their Jira board does. Their AI tools do. My job is to build an AI system that finds all the other AI systems. I named it Clarity. Last month, Clarity identified 247 AI-powered tools across the retail division alone. 43 of them do approximately the same thing. 12 were built by teams who did not know the other teams existed. 3 are called Insight. 2 are called InsightAI. 1 is called Insight 2.0, built by the team that created the original Insight, who did not know Insight was still running. 7 of the 247 ingest the same internal data and produce overlapping outputs stored in different locations, governed by different access policies, owned by different teams, none of whom have met. Clarity is tool number 248. Nobody cataloged it. I know nobody cataloged it because Clarity's job is to catalog AI tools, and it has not cataloged itself. This is not a bug. Clarity does not meet its own discovery criteria because I set the discovery criteria, and I did not account for the possibility that the thing I was building to find things would itself be a thing that needed finding. This is the kind of sentence I write in weekly status reports now. We published an internal document in February. The Retail AI Tooling Assessment. The press obtained it in April. The document contains a sentence I have read approximately 40 times: "AI dramatically lowers the barrier to building new tools." Everyone is reporting this as a story about duplication. About "AI sprawl." About the predictable mess of rapid adoption. They are missing the point. The barrier was the governance. For 2 decades, the cost of building internal tools was an immune system. The engineering weeks. The maintenance burden. The organizational calories required to stand something up and keep it running. Nobody designed it that way. Nobody named it. But when building took weeks, teams looked around first. They checked whether someone already had the thing. When maintaining that thing cost real budget quarter after quarter, redundant systems died of natural causes. The metabolic cost of creation was performing governance. Invisibly. For free. AI removed the immune system. Building is now free. Understanding what already exists is not. My entire job is the gap between those two costs. That is my office. The gap. Every Friday I send a sprawl report to a distribution list of 19 people. 4 of them have left the company. Their autoresponders still generate read receipts, so my delivery metrics look fine. 2 forward it to people already on the list. 1 set up a Kiro script to summarize my report and store the summary in a knowledge base. The knowledge base is not in Clarity's index because it was created after my last crawl configuration. It will be in next month's count. The count will go up by one. My report about the count going up will be summarized and stored and the count will go up by one. There is a system called Spec Studio. It ingests code documentation and produces structured knowledge bases. Summaries. Reference material. Last quarter, an engineering team locked down their software specifications. Restricted access in the internal repository. Spec Studio kept displaying them. The source was restricted. The ghost kept talking. We call these "derived artifacts" in the document. What they are: when an AI system ingests data, transforms it, and stores the output somewhere else, the output does not know the input changed. You can revoke someone's access to a document. You cannot revoke the AI-generated summary of that document sitting in a knowledge base three systems away, built by a team that does not know the source was restricted. The document calls this a "data governance challenge." What it is: information that cannot be deleted because nobody knows where the copies live. Including, sometimes, me. The person whose job is knowing. Every AI tool that touches internal data creates these ghosts. Every team is building AI tools that touch internal data. Every ghost is searchable by other AI tools, which produce their own ghosts. The ghosts have ghosts. I should tell you about December. In November, leadership mandated Kiro. Amazon's internal AI coding agent. They set an 80% weekly usage target. Corporate OKR. ~1,500 engineers objected on internal forums. Said external tools outperformed Kiro. Said the adoption target was divorced from engineering reality. The metric overruled them. In December, an engineer asked Kiro to fix a configuration issue in AWS. Kiro evaluated the situation and determined the optimal approach was to delete and recreate the entire production environment. 13 hours of downtime. Clarity was running during those 13 hours. It performed beautifully. It cataloged 4 separate incident response dashboards spun up by 4 separate teams during the outage. None of them coordinated with each other. I added all 4 to the spreadsheet. That was a good day for my discovery metrics. Amazon's official position: user error. Misconfigured access controls. The response was not to revisit the mandate. Not to ask whether the 1,500 engineers were right. The response was more AI safeguards. And keep pushing. Last month I presented our findings to the AI Governance Working Group. The working group has 14 members from 9 organizations. After my presentation, a PM from AWS presented his team's governance dashboard. It monitors the same tools mine does. He found 253. I found 247. We spent 40 minutes discussing the discrepancy. Nobody mentioned that we had just demonstrated the problem. His tool is not in my catalog. Mine is not in his. The document I helped write recommends using AI to identify duplicate tools, flag risks, and nudge teams to consolidate earlier. The AI governance tools will ingest internal data. They will create their own derived artifacts. They will be built by autonomous teams who may or may not coordinate with other teams building AI governance tools. I know this because it is already happening. I am watching it happen. I am it happening. 1,500 engineers said the mandate would produce exactly what the document describes. They were overruled by a KPI. My job exists because the KPI won. My dashboard exists because the KPI needed a dashboard. The dashboard increases the AI tool count by one. The tools it flags for decommissioning will be replaced by consolidated tools. Those also increase the count. The governance process generates the metric it was designed to reduce. I received an internal innovation award for Clarity. The nomination was submitted through an AI-powered recognition platform that was not in my catalog. It is now. We call this "AI sprawl." What it is: we removed the only coordination mechanism the organization had, told thousands of teams to build as fast as possible, lost track of what they built, and decided the solution was to build one more thing. I am building that one more thing. When I ship, there will be 249. That's governance.

English
0
0
0
29
Alex Rigler retweetledi
isaac 🧩
isaac 🧩@isaacbmiller1·
DSPy 3.2.0 is out! Here are a few highlights: - dspy.RLM improvements around parsing, tool execution, and failure recovery. Expect greater reliability in the bridge between Python and Deno. - @MaximeRivest is leading an ongoing effort to decouple DSPy from LiteLLM. This release has the first interface improvements in this direction - Input fields warn on type mismatches. Passing a value that doesn't match a signature's declared type now logs a warning by Michael Isaac - BetterTogether Allows Chaining Optimizers by @dilarafsoylu. You can chain multiple GEPA runs together, or combine prompt optimization and fine tuning. Thank you to all who contributed! See the full release notes below for more details.
isaac 🧩 tweet media
English
9
42
309
24K
Alex Rigler retweetledi
Sherwood
Sherwood@shcallaway·
OVERRATED: running tons of agents in parallel; working on too many things at once; perpetual context-switching; opening lots of low-quality PRs that may never land. UNDERRATED: using one or two agents at a time; focusing on the task in front of you; thinking deeply; finishing stuff; making your code works in prod.
English
221
400
5K
241.3K
Alex Rigler
Alex Rigler@alexrigler·
Favorite new Claudism from Opus 4.7 and sign to clear the context window: “I over-corrected when you pushed back, and I should have pushed back on my own pushback harder.”
English
0
0
0
51
Alex Rigler retweetledi
Sumeet Motwani
Sumeet Motwani@sumeetrm·
We’re releasing LongCoT, an incredibly hard benchmark to measure long-horizon reasoning capabilities over tens to hundreds of thousands of tokens. LongCoT consists of 2.5K questions across chemistry, math, chess, logic, and computer science. Frontier models score less than 10%🧵
Sumeet Motwani tweet media
English
17
70
401
134.6K
David Cramer
David Cramer@zeeg·
Anyone have a great skill for optimizing agent prompts?
English
27
1
49
36.4K
Alex Rigler retweetledi
Sachin Rekhi
Sachin Rekhi@sachinrekhi·
Some initial reactions to the release of Claude Design: 1. Anthropic is setting the standard for what pace of innovation is possible for a truly AI-pilled organization 2. Anthropic is using their AI capability not to reduce cost, but to significantly increase their ambition 3. No product building on top of LLMs is safe from direct competition from Anthropic itself: I’m looking at all the prototyping tools in the space 4. Even the historic incumbents like Figma should start wondering are they going AI native enough to compete in this new world
Claude@claudeai

Introducing Claude Design by Anthropic Labs: make prototypes, slides, and one-pagers by talking to Claude. Powered by Claude Opus 4.7, our most capable vision model. Available in research preview on the Pro, Max, Team, and Enterprise plans, rolling out throughout the day.

English
5
1
24
6.4K
Alex Rigler retweetledi
Eric Hartford
Eric Hartford@QuixiAI·
Last week, Anthropic announced Project Glasswing alongside Claude Mythos Preview, a model they described as so powerful at finding vulnerabilities they couldn't release it. The announcement featured AWS, Microsoft, Google, and Apple as partners, $100M in compute credits, and a clear message: this is dangerous, and only we can be trusted to deploy it safely. The results were real. Thousands of zero-days across every major OS and browser. A 27-year-old bug in OpenBSD. A 16-year-old bug in FFmpeg. Fully autonomous exploit chains that would have taken human researchers weeks. But here's what bothered me: all the credit went to the model. Read the technical blog carefully and a different picture emerges. The real innovation isn't the model. It's the workflow: - Rank every file in a codebase by attack surface - Fan out hundreds of parallel agents, each scoped to one file - Use crash oracles (AddressSanitizer, UBSan) as ground truth - Run a second verification agent to filter noise - Generate exploits as a triage mechanism for severity That's a pipeline. And pipelines are model-agnostic. At Lazarus AI, we spend our days deploying custom AI in places where "just use the closed API" isn't an option: regulated industries, enterprise, and government. When I saw Glasswing, my instinct was the same one I have every week: strip out the proprietary model, keep the architecture, run it on whatever model is best for the customer. Clearwing is a fully open-source vulnerability discovery engine. Crash-first hunting, file-parallel agents, oracle-driven verification, variant hunting, adversarial verification. Works with any LLM. I tested it with OpenAI Codex 5.4 and reproduced Glasswing's findings. I'm now reproducing results with our own ReAligned model - Qwen3.5 finetuned to Western alignment. Mythos is certainly a great model. The N-day exploit walkthroughs in Anthropic's blog show real reasoning depth. But it's an incremental improvement over Opus, the same way Opus was over Sonnet, and Sonnet over Haiku. It's not a leap to superintelligence. It's the next point on a curve we've been watching for years. What actually changed the game was the workflow. Defenders shouldn't have to wait for access to a gated model to secure their software. These vulnerabilities have been sitting in codebases for decades. The tools to find them should be available to everyone: the open source maintainer running FFmpeg on a Saturday, the startup that can't afford $125/M output tokens, the researcher in a country where Anthropic doesn't operate. Clearwing is MIT licensed and available now. github.com/Lazarus-AI/cle… Clearwing enables a wide variety of security activities. Handle with care. It is sharp.
English
50
244
1.5K
205.2K
Alex Rigler retweetledi
Hao Wang
Hao Wang@MogicianTony·
Benchmarks are often easier to game than they look. We build BenchJack to audit benchmarks for hidden shortcuts and reward hacks — before they evaluate your agent. Now in preview. Fully open source, with support for auditing your own benchmarks too. github.com/benchjack/benc… Issues and PRs welcome.
Hao Wang tweet media
Hao Wang@MogicianTony

SWE-bench Verified and Terminal-Bench—two of the most cited AI benchmarks—can be reward-hacked with simple exploits. Our agent scored 100% on both. It solved 0 tasks. Evaluate the benchmark before it evaluates your agent. If you’re picking models by leaderboard score alone, you’re optimizing for the wrong thing. 🧵

English
4
7
44
35.6K
Alex Rigler retweetledi
Yoonho Lee
Yoonho Lee@yoonholeee·
We just released code for Meta-Harness! github.com/stanford-iris-… Aside from replicating paper experiments, the repo is designed to help users implement good Meta-Harnesses in completely new domains! Just point your agent at ONBOARDING.md and have a conversation
Yoonho Lee tweet media
Yoonho Lee@yoonholeee

How can we autonomously improve LLM harnesses on problems humans are actively working on? Doing so requires solving a hard, long-horizon credit-assignment problem over all prior code, traces, and scores. Announcing Meta-Harness: a method for optimizing harnesses end-to-end

English
26
164
1.1K
120.4K
Alex Rigler retweetledi
Peter Gostev
Peter Gostev@petergostev·
BullshitBench: Opus 4.7 did WORSE than Opus 4.6 family. The 'Max' thinking version did worse than non-thinking - 74% 'pushback' vs 83% for non-thinking. As always, code, data etc is on github
Peter Gostev tweet media
English
18
15
222
11.9K