rob

1.2K posts

rob banner
rob

rob

@rwitoff

Head of Platform @Coinbase. Co-founder @Unit_410

Austin, TX Katılım Haziran 2008
694 Takip Edilen4.2K Takipçiler
rob
rob@rwitoff·
“The future of high-stakes work is not AI replacing judgment. It is AI making judgment scalable, auditable, and continuously improvable” Better, faster core compliance workflows. From the great @dorvonlevi and team making the whole industry safer.
Dor@dorvonlevi

Building an AI-native @Coinbase means rebuilding everything, especially the hardest parts. We've put a lot of time into redefining compliance, where the stakes are incredibly high, and we have to be extremely thoughtful about implementation. We have invested heavily in rebuilding our compliance ops around AI with that reality as our starting constraint, not an afterthought. Here is an overview of what we've learned and what we built. Most people assume compliance work is mostly checking whether a name appears on a sanctions list. That is the easy 5%. The other 95% is interpretive judgment under uncertainty: a customer claims their wealth came from real estate. Do the property records actually support it? Does the timeline hold? Is the documentation legitimate, or does it feel too polished? You need compliance staff and investigators who understand what “suspicious” actually looks like in context. That's part of why compliance is so hard to automate—and so expensive. The first obvious AI approach is to hand the model the existing procedures and ask it to run them faster. That approach misunderstands what procedures are for. Good procedures are not bad investigations; they are deliberately incomplete investigations. Their job is to create consistency, auditability, and a minimum standard across thousands of cases. They excel at saying what must happen. They are far worse at capturing everything a strong analyst actually notices: which sources they trust, when they widen the search, when a document feels off, when an explanation technically fits but still does not feel earned. Procedures also carry the shape of the old operating model: fragmented systems, time pressure, queue pressure, and the hard limit of how much one human analyst can read, cross-reference, and hold in working memory at once. That is not a flaw in the procedure. It is how you design a process for humans. AI changes the constraint set. Reading, searching, comparing documents, and tracing inconsistencies no longer have to be treated as scarce analyst time. Done carefully, with proper controls and human review, models can explore more context, test more hypotheses, and surface more inconsistencies than any single analyst could reasonably do case by case. So if you simply automate the procedure exactly as written, you may gain efficiency. You will not unlock the full value of AI. You will just make the old bottleneck run faster. The better question is not “Can AI follow the analyst playbook?” It is: once the cost of reading, cross-referencing, and testing hypotheses collapses, what should the investigation become? A second tempting approach: feed it historical Suspicious Activity Reports (SARs) and let it learn from outcomes. This breaks down too. You rarely have the full state of what the analyst actually saw during the investigation. A case that looks straightforward today might only look that way because information surfaced later. A fraud indictment that didn't exist when the original analyst made the call, news articles that hadn't been published yet. Hindsight can contaminate your training data. Also, regulators themselves acknowledge that SAR decisions can be subjective. The architecture has four layers. The first is data: continuously enhancing the coverage, quality, and architecture of the signals the system depends on. The second is classical machine learning models that cluster and classify alerts to determine what type of investigation needs to run. The third is the investigation agent itself: a multi-agent system that orchestrates specialized agents to execute the investigation end to end. The fourth is a safety filter that runs independently of typology, ensuring no risk vector is missed regardless of how the alert is classified. Each layer is independently auditable and learns from the feedback provided by human reviewers. Inside the investigation agent, specialized sub-agents run across the full case surface: alert context, customer and identity signals, access patterns, risk indicators, transaction behavior, source-of-funds, onchain activity, and public adverse media. Each writes its findings into a shared case memory. A coordinator agent reconciles and challenges them. When sub-agents disagree, such as when source-of-funds marks activity as “explained” while adverse media surfaces a recent indictment, the coordinator attempts to resolve these disagreements knowing the common patterns. The narrative agent prepares the final report with all collected evidence and suggested resolution. The last self-validation agent acts as a guardrail: if the system cannot support its conclusion with sufficient confidence or data quality, the case is routed to manual investigation instead of being surfaced as an automated result. Before any of this touched a real customer case, we built what we call a “Golden Set” - historical cases with known right answers. "Known right answers" in compliance is harder than it sounds. It meant re-investigating old cases, getting multiple senior analysts to independently agree on what the right call would have been, then debating the disagreements until consensus. Months of work before we could even start measuring. Here's an important part (for now) - cases currently get BOTH the AI's full investigation AND a senior human review. We didn't reduce scrutiny, in fact, we added more of it until it no longer proves valuable. Cases resolve significantly faster AND get more eyes than they ever did before. Every human correction feeds back into the model as a training signal. It gets better because it's wrong in front of people who know how to fix it. None of this would have shipped without clearing structural blockers most financial institutions are still stuck on. Security and privacy sign-off to send customer data to LLMs at all. Senior compliance officer alignment on AI-assisted human decision making. Model Governance team embedded since December - they observed the entire Golden-Set Evaluation process and are running a formal validation review with our Internal Audit team now. Today this handles roughly 55% of our US fraud case volume with significantly less analyst time per case. Time freed goes to the harder cases AI can't yet handle - and to teaching it. Our internal compliance and quality teams are the ones who are building this system with the engineers, training it, validating it, and continuing to shape how it improves. In the process, they've developed skills that are incredibly valuable: how to design evals, how to think about model bias, how to think about human bias, how to architect human-in-the-loop systems, skills that are becoming among the most valuable at any company. This entire project started ~6 months ago with a whiteboarding session between @galpa42 and I, and was built by an AI-pilled cross-functional and it’s just the first pod - there's a multi-month roadmap,rebuilding compliance from the ground up with AI. Huge thanks to everyone involved and congratulations to @galpa42 for shipping two babies to production this month :) The future of high-stakes work is not AI replacing judgment. It is AI making judgment scalable, auditable, and continuously improvable.

English
0
0
12
2.5K
rob
rob@rwitoff·
There are definitely areas we'll improve here. Our spot exchange lives in a single zone (see link) to optimize for low latency. We can typically fail over faster to a warm standby in another zone, and data is stored durably for DR. This outage was particularly bad though, and we saw managed service failures impact multiple zones. We're resilient to that, but not automatically available. Those recoveries take us longer. We posted more details earlier today, but will share a full RCA after we've had more time to investigate. Happy to walk you through what happened if you want to talk live. Big fan of @Pragmatic_Eng!! #availability-zones" target="_blank" rel="nofollow noopener">docs.cdp.coinbase.com/exchange/intro…
English
0
0
4
155
Gergely Orosz
Gergely Orosz@GergelyOrosz·
@lukerramsden too bad Coinbase will most likely not explain in any public facing postmortem why they cannot do so...
English
3
0
6
2.3K
Gergely Orosz
Gergely Orosz@GergelyOrosz·
Outside of Coinbase, did any other major service have a 8-hour outage? I’ll be honest: did not notice anything else. Want to make sure I didn’t miss anything? (AWS had an outage in a single AZ. This should have… NOT taken down any service with resiliency 101)
English
86
50
1.7K
258.3K
rob
rob@rwitoff·
@Chuyqa @coinbase Confirming this should now be fixed. Let us know if you run into anything else 🙏
English
2
0
3
278
rob
rob@rwitoff·
Yesterday @coinbase experienced a multi-hour service disruption affecting trading, exchange access, and balance updates. Here's our initial read from Coinbase engineering on what happened, how we recovered, and what we're addressing. At approximately 23:50 UTC on 2026-05-07, our monitoring detected cascading quote failures from internal services that triggered multiple Sev1 incidents that engineering immediately began investigating. Customer-facing impacts included spot trading, Prime, International and derivative exchanges. Root cause: a thermal event (cooling system failure) inside a subset of racks within a single building in AWS us-east-1. We run a primary replica of our exchange infrastructure in a single zone, consistent with industry standards to reduce latency. To prepare for failures like this, we maintain a distributed standby, but during this incident, failures in the primary zone that were designed to be isolated were not, extending the duration of our outage. The failure cascaded down two paths: 1. Multiple hardware components beneath our exchange’s matching engine failed, requiring recovery and failover 2. Distributed Kafka clusters that manage messaging across Coinbase systems failed to remain available, also requiring partition failovers to new hardware brokers with many TiBs of data After isolating the incident: automated tooling drained ~10 Kubernetes clusters worth of related workloads out of the affected zone to stabilize internal services. Most services were back to normal within ~30 minutes of diagnosis. The two things we couldn't automatically drain: the exchange (dedicated hardware and storage) and Kafka (managed service that was designed to be resilient to this, with unique problems). The exchange matching engine is the core system responsible for processing orders and maintaining order books. It is a distributed cluster and requires quorum to safely elect a leader and continue processing trading activity. During the incident, infrastructure-level constraints in the affected datacenter left only a subset of nodes healthy, preventing the cluster from reaching quorum. As a result, trading across Retail, Advanced, and Institutional exchanges were blocked. Recovery required our oncall and engineering teams to execute our disaster recovery plan, restore quorum safely, and validate system health under constrained infrastructure conditions. The team built, tested, deployed, and validated the fix while continuing to manage the broader incident. Kafka recovery was a much larger scale operation. Our primary managed Kafka partitions process many terabytes of data daily and are designed with resiliency guarantees for uninterrupted operation during a datacenter failure just like this. In this case, those guarantees failed and required manual recovery. We again relied on disaster recovery procedures to recover stuck partitions onto new hardware (brokers) that enabled us to safely bring x-service messaging back online across Coinbase. During the lag, customers saw delayed balance streams which resolved automatically once replication caught up. No data lost. Once the engine came back up as part of our standard runbooks, we re-opened markets carefully: all products to cancel-only mode first, audited product states, then moved all markets to auction mode, before restoring trading on Coinbase Exchange. What went right: the team. Incident response across the company came together within minutes, followed well-rehearsed playbooks and used secure automation tooling to recover all services. We have a strong, senior team at Coinbase that worked through rare failure modes to recover all services. To our customers: losing access to your account, even temporarily, is unacceptable. We know that. We're sorry, and we’ll publish a full root cause analysis in the coming weeks 🙏
English
61
44
373
298.7K
rob
rob@rwitoff·
Can confirm NASA grade QA standards. I spent 5 years working at @NASAJPL before becoming responsible for infra, security and now eng at Coinbase 12 years ago. Many of our eng + security + quality standards are modeled after, or better than what I grew up on there. wsj.com/articles/BL-DG…
English
0
1
43
3.1K
Architect🛡️
Architect🛡️@Architect9000·
“As seen on $COIN 2026 Q1 earnings”, me 😎 I wasn’t the only one with this sentiment so I asked what was up with “non-technical people pushing code to prod”. When you’re storing a million bucks on a platform, you wanna be hearing about how their engineers are ex-Mossad International Olympiad gold medalists with NASA-grade QA standards. Insinuating that casuals are vibe-coding your bank is a scary thought. AI is going to create new employee role categories which look far more cross-functional and Socratic in nature. We need to experiment with the many possibilities. But when it comes to our money, I think Brian knows he should have led with reliability instead of leverage. @brian_armstrong’s answer was good and should satisfy anyone who let their imagination run wild over his earlier comments. Listen here (28:08): youtube.com/live/d7BeHWXcL…
YouTube video
YouTube
English
5
3
38
6.1K
rob
rob@rwitoff·
@seslly @coinbase We’ll share more in our full RCA, but we had an appropriate RF to survive a zone outage. The way this hardware failed triggered a bug in the managed cluster that still took the cluster down, which we had to work around to recover with the vendor.
English
1
0
13
4.5K
seslly
seslly@seslly·
@rwitoff @coinbase you can have kafka in multiple AZs but having a replication factor of 1 would be the only reason AZ outage like this could be so devastating (MSK requires >=2 AZ so it's the config that bites you) i say this as a former employee that wants yall to do better even without me
English
2
1
22
5.2K
rob
rob@rwitoff·
@zquestz thanks josh. tbf it is fun to imagine there’s a vibe coding conspiracy afoot, unfortunately that’s just not the case 😆🕵️
English
0
0
9
678
Josh Ellithorpe
Josh Ellithorpe@zquestz·
Actual facts about the Coinbase outage yesterday. As usual, Rob explains clearly what happened, and I am sure will take steps to make the systems more resilient in the future. Things that didn't happen, and if your "influencer" told you these were the reason. They are just baiting you for clicks and engagement. - No one vibe coded something that failed. - A "non-engineer" didn't push production code and take out the trading engine. - It wasn't intentional. - It wasn't because Coinbase failed to design a fail-over system. Things happen at scale, don't let the armchair quarterbacks tell you tall tales.
rob@rwitoff

Yesterday @coinbase experienced a multi-hour service disruption affecting trading, exchange access, and balance updates. Here's our initial read from Coinbase engineering on what happened, how we recovered, and what we're addressing. At approximately 23:50 UTC on 2026-05-07, our monitoring detected cascading quote failures from internal services that triggered multiple Sev1 incidents that engineering immediately began investigating. Customer-facing impacts included spot trading, Prime, International and derivative exchanges. Root cause: a thermal event (cooling system failure) inside a subset of racks within a single building in AWS us-east-1. We run a primary replica of our exchange infrastructure in a single zone, consistent with industry standards to reduce latency. To prepare for failures like this, we maintain a distributed standby, but during this incident, failures in the primary zone that were designed to be isolated were not, extending the duration of our outage. The failure cascaded down two paths: 1. Multiple hardware components beneath our exchange’s matching engine failed, requiring recovery and failover 2. Distributed Kafka clusters that manage messaging across Coinbase systems failed to remain available, also requiring partition failovers to new hardware brokers with many TiBs of data After isolating the incident: automated tooling drained ~10 Kubernetes clusters worth of related workloads out of the affected zone to stabilize internal services. Most services were back to normal within ~30 minutes of diagnosis. The two things we couldn't automatically drain: the exchange (dedicated hardware and storage) and Kafka (managed service that was designed to be resilient to this, with unique problems). The exchange matching engine is the core system responsible for processing orders and maintaining order books. It is a distributed cluster and requires quorum to safely elect a leader and continue processing trading activity. During the incident, infrastructure-level constraints in the affected datacenter left only a subset of nodes healthy, preventing the cluster from reaching quorum. As a result, trading across Retail, Advanced, and Institutional exchanges were blocked. Recovery required our oncall and engineering teams to execute our disaster recovery plan, restore quorum safely, and validate system health under constrained infrastructure conditions. The team built, tested, deployed, and validated the fix while continuing to manage the broader incident. Kafka recovery was a much larger scale operation. Our primary managed Kafka partitions process many terabytes of data daily and are designed with resiliency guarantees for uninterrupted operation during a datacenter failure just like this. In this case, those guarantees failed and required manual recovery. We again relied on disaster recovery procedures to recover stuck partitions onto new hardware (brokers) that enabled us to safely bring x-service messaging back online across Coinbase. During the lag, customers saw delayed balance streams which resolved automatically once replication caught up. No data lost. Once the engine came back up as part of our standard runbooks, we re-opened markets carefully: all products to cancel-only mode first, audited product states, then moved all markets to auction mode, before restoring trading on Coinbase Exchange. What went right: the team. Incident response across the company came together within minutes, followed well-rehearsed playbooks and used secure automation tooling to recover all services. We have a strong, senior team at Coinbase that worked through rare failure modes to recover all services. To our customers: losing access to your account, even temporarily, is unacceptable. We know that. We're sorry, and we’ll publish a full root cause analysis in the coming weeks 🙏

English
43
13
164
89.6K
rob
rob@rwitoff·
@coinbase More about our exchange architecture @ #availability-zones" target="_blank" rel="nofollow noopener">docs.cdp.coinbase.com/exchange/intro…
English
6
3
24
10.7K
Jesus Alvarez
Jesus Alvarez@Chuyqa·
@rwitoff @coinbase You still have dozens of pairs sending bad data on ticker and l2 as of 2:39 PM Pacific. OP, IOTX, ICP.. list is rather long. Good luck getting this synced.
English
3
0
8
5.5K
james smith
james smith@jamessmith57013·
@coinbase is there an aws outage going on right now?
English
1
1
0
276
rob
rob@rwitoff·
@0x_Osprey agreed. it will take years.
English
0
1
3
997
Joe
Joe@0x_Osprey·
@rwitoff I still think its gonna be a crazy chasm over the next few months before we ultimately hit the “golden age”😅
English
1
0
5
281
rob
rob@rwitoff·
@katie_haun This is great news for everyone. Industries get better when @HaunVentures is driving. Congrats Katie and team!!
English
0
0
6
732
Katie Haun
Katie Haun@katie_haun·
Today we’re announcing $1 billion in new funds to back the bold founders shaping the next era of finance and technology. I’ve been following the flow of assets my entire career and have never seen a more dynamic time. Financial infrastructure is being rebuilt from the ground up, new assets and markets are emerging, and an agentic economy is developing as AI agents begin to transact on behalf of humans. These areas, among others, are what will define the coming years as we deploy these new funds. We’re excited for what’s ahead, and wrote about our thesis in the post below.
Katie Haun@katie_haun

x.com/i/article/2050…

English
190
79
1.2K
318.7K
rob
rob@rwitoff·
My AI coach gave me a B- this week. Every week, agents check in on my digital life. They send back 4 things: 1. What I'm missing 2. What's changing 3. What's going well 4. What's not All of this data is sitting there: - iMessageDB - ScreentimeDB - EightSleep API (eightctl) - Oura MCP - Google Workspace MCP - and more Everyone will have a world class executive coach (agent) this year and it's going to change your life.
rob tweet media
English
4
0
43
2.8K
rob
rob@rwitoff·
In the last 12 months, we’ve seen a 27x increase in non-engineers using dev tools like Claude, OpenCode and Cursor to build & automate how we work. The goal is to turn everyone into a builder, and safely reduce the distance between idea → execution to near zero. Trust is our most important asset at @coinbase, so this is fueled by a massive effort in quality, guardrails and simplification.
rob tweet media
English
16
26
226
147.2K
rob
rob@rwitoff·
These will blow you away because Fred & Balaji are wired into Coinbase: github, drive, linear, slack & more. Fred for the expert strategy, Balaji for challenging assumptions. A 10x team. Wait until you see all the subagents + new capabilities we're wiring in now. #BestPlaceToBuild
Brian Armstrong@brian_armstrong

Coinbase is testing AI agents that show up in slack/email at work, just like any human teammate. To start we're shipping two which are modeled after legendary former Coinbase employees, @FEhrsam and @balajis. (Who brutally frame mogged who in this matchup?) Soon, it will be easy for any employee to spin up a new agent for themselves or their team. I suspect we will have more agents than human employees at some point soon.

English
1
1
39
6.2K
rob
rob@rwitoff·
@gcockfoster since i’ve called the book’s style “awful” i don’t want to share the name. but you can do for any book!
English
0
0
0
39
rob
rob@rwitoff·
The last autobiography I read had a style + verbosity that were awful, but had an important first-person view I wanted to understand. So I had an agent rewrite and repackage the book for my kindle without the fluff, and more relevant historical context. Just finished the book and loved it. 2026 is awesome.
English
1
0
15
1.9K
rob
rob@rwitoff·
Screens are the cigarettes of our generation. We all know we use our devices poorly, but device manufacturers will never be incentivized to optimize for our time. So Claude and I built a tool that liberates your iOS Screen Time data and lets Claude give you brutally honest advice on your habits. It tells you: - What's eating your time - Where you're context-switching too much - What you're actually doing well - One concrete thing to change this week Open source, all data stays on your device, takes 30 seconds to set up. Try it: github.com/witoff/screent…
rob tweet mediarob tweet media
English
2
0
18
1.5K