rob (@rwitoff) - Twitter Profili | Zamantika Mersobahis Locabet

rob@rwitoff·15 May

“The future of high-stakes work is not AI replacing judgment. It is AI making judgment scalable, auditable, and continuously improvable” Better, faster core compliance workflows. From the great @dorvonlevi and team making the whole industry safer.

Dor@dorvonlevi

Building an AI-native @Coinbase means rebuilding everything, especially the hardest parts. We've put a lot of time into redefining compliance, where the stakes are incredibly high, and we have to be extremely thoughtful about implementation. We have invested heavily in rebuilding our compliance ops around AI with that reality as our starting constraint, not an afterthought. Here is an overview of what we've learned and what we built. Most people assume compliance work is mostly checking whether a name appears on a sanctions list. That is the easy 5%. The other 95% is interpretive judgment under uncertainty: a customer claims their wealth came from real estate. Do the property records actually support it? Does the timeline hold? Is the documentation legitimate, or does it feel too polished? You need compliance staff and investigators who understand what “suspicious” actually looks like in context. That's part of why compliance is so hard to automate—and so expensive. The first obvious AI approach is to hand the model the existing procedures and ask it to run them faster. That approach misunderstands what procedures are for. Good procedures are not bad investigations; they are deliberately incomplete investigations. Their job is to create consistency, auditability, and a minimum standard across thousands of cases. They excel at saying what must happen. They are far worse at capturing everything a strong analyst actually notices: which sources they trust, when they widen the search, when a document feels off, when an explanation technically fits but still does not feel earned. Procedures also carry the shape of the old operating model: fragmented systems, time pressure, queue pressure, and the hard limit of how much one human analyst can read, cross-reference, and hold in working memory at once. That is not a flaw in the procedure. It is how you design a process for humans. AI changes the constraint set. Reading, searching, comparing documents, and tracing inconsistencies no longer have to be treated as scarce analyst time. Done carefully, with proper controls and human review, models can explore more context, test more hypotheses, and surface more inconsistencies than any single analyst could reasonably do case by case. So if you simply automate the procedure exactly as written, you may gain efficiency. You will not unlock the full value of AI. You will just make the old bottleneck run faster. The better question is not “Can AI follow the analyst playbook?” It is: once the cost of reading, cross-referencing, and testing hypotheses collapses, what should the investigation become? A second tempting approach: feed it historical Suspicious Activity Reports (SARs) and let it learn from outcomes. This breaks down too. You rarely have the full state of what the analyst actually saw during the investigation. A case that looks straightforward today might only look that way because information surfaced later. A fraud indictment that didn't exist when the original analyst made the call, news articles that hadn't been published yet. Hindsight can contaminate your training data. Also, regulators themselves acknowledge that SAR decisions can be subjective. The architecture has four layers. The first is data: continuously enhancing the coverage, quality, and architecture of the signals the system depends on. The second is classical machine learning models that cluster and classify alerts to determine what type of investigation needs to run. The third is the investigation agent itself: a multi-agent system that orchestrates specialized agents to execute the investigation end to end. The fourth is a safety filter that runs independently of typology, ensuring no risk vector is missed regardless of how the alert is classified. Each layer is independently auditable and learns from the feedback provided by human reviewers. Inside the investigation agent, specialized sub-agents run across the full case surface: alert context, customer and identity signals, access patterns, risk indicators, transaction behavior, source-of-funds, onchain activity, and public adverse media. Each writes its findings into a shared case memory. A coordinator agent reconciles and challenges them. When sub-agents disagree, such as when source-of-funds marks activity as “explained” while adverse media surfaces a recent indictment, the coordinator attempts to resolve these disagreements knowing the common patterns. The narrative agent prepares the final report with all collected evidence and suggested resolution. The last self-validation agent acts as a guardrail: if the system cannot support its conclusion with sufficient confidence or data quality, the case is routed to manual investigation instead of being surfaced as an automated result. Before any of this touched a real customer case, we built what we call a “Golden Set” - historical cases with known right answers. "Known right answers" in compliance is harder than it sounds. It meant re-investigating old cases, getting multiple senior analysts to independently agree on what the right call would have been, then debating the disagreements until consensus. Months of work before we could even start measuring. Here's an important part (for now) - cases currently get BOTH the AI's full investigation AND a senior human review. We didn't reduce scrutiny, in fact, we added more of it until it no longer proves valuable. Cases resolve significantly faster AND get more eyes than they ever did before. Every human correction feeds back into the model as a training signal. It gets better because it's wrong in front of people who know how to fix it. None of this would have shipped without clearing structural blockers most financial institutions are still stuck on. Security and privacy sign-off to send customer data to LLMs at all. Senior compliance officer alignment on AI-assisted human decision making. Model Governance team embedded since December - they observed the entire Golden-Set Evaluation process and are running a formal validation review with our Internal Audit team now. Today this handles roughly 55% of our US fraud case volume with significantly less analyst time per case. Time freed goes to the harder cases AI can't yet handle - and to teaching it. Our internal compliance and quality teams are the ones who are building this system with the engineers, training it, validating it, and continuing to shape how it improves. In the process, they've developed skills that are incredibly valuable: how to design evals, how to think about model bias, how to think about human bias, how to architect human-in-the-loop systems, skills that are becoming among the most valuable at any company. This entire project started ~6 months ago with a whiteboarding session between @galpa42 and I, and was built by an AI-pilled cross-functional and it’s just the first pod - there's a multi-month roadmap,rebuilding compliance from the ground up with AI. Huge thanks to everyone involved and congratulations to @galpa42 for shipping two babies to production this month :) The future of high-stakes work is not AI replacing judgment. It is AI making judgment scalable, auditable, and continuously improvable.

English

0

12

2.5K

rob@rwitoff·9 May

There are definitely areas we'll improve here. Our spot exchange lives in a single zone (see link) to optimize for low latency. We can typically fail over faster to a warm standby in another zone, and data is stored durably for DR. This outage was particularly bad though, and we saw managed service failures impact multiple zones. We're resilient to that, but not automatically available. Those recoveries take us longer. We posted more details earlier today, but will share a full RCA after we've had more time to investigate. Happy to walk you through what happened if you want to talk live. Big fan of @Pragmatic_Eng!! #availability-zones" target="_blank" rel="nofollow noopener">docs.cdp.coinbase.com/exchange/intro…

English

0

4

155

Gergely Orosz@GergelyOrosz·8 May

@lukerramsden too bad Coinbase will most likely not explain in any public facing postmortem why they cannot do so...

English

3

0

6

2.3K

Gergely Orosz@GergelyOrosz·8 May

Outside of Coinbase, did any other major service have a 8-hour outage? I’ll be honest: did not notice anything else. Want to make sure I didn’t miss anything? (AWS had an outage in a single AZ. This should have… NOT taken down any service with resiliency 101)

English

86

50

1.7K

258.3K

rob@rwitoff·9 May

@Chuyqa @coinbase Confirming this should now be fixed. Let us know if you run into anything else 🙏

English

2

0

3

278

Jesus Alvarez@Chuyqa·9 May

@rwitoff @coinbase Exhibit 1: Crossing spread shows the bid 10x higher than the ask.

English

1

0

1

344

rob@rwitoff·9 May

Yesterday @coinbase experienced a multi-hour service disruption affecting trading, exchange access, and balance updates. Here's our initial read from Coinbase engineering on what happened, how we recovered, and what we're addressing. At approximately 23:50 UTC on 2026-05-07, our monitoring detected cascading quote failures from internal services that triggered multiple Sev1 incidents that engineering immediately began investigating. Customer-facing impacts included spot trading, Prime, International and derivative exchanges. Root cause: a thermal event (cooling system failure) inside a subset of racks within a single building in AWS us-east-1. We run a primary replica of our exchange infrastructure in a single zone, consistent with industry standards to reduce latency. To prepare for failures like this, we maintain a distributed standby, but during this incident, failures in the primary zone that were designed to be isolated were not, extending the duration of our outage. The failure cascaded down two paths: 1. Multiple hardware components beneath our exchange’s matching engine failed, requiring recovery and failover 2. Distributed Kafka clusters that manage messaging across Coinbase systems failed to remain available, also requiring partition failovers to new hardware brokers with many TiBs of data After isolating the incident: automated tooling drained ~10 Kubernetes clusters worth of related workloads out of the affected zone to stabilize internal services. Most services were back to normal within ~30 minutes of diagnosis. The two things we couldn't automatically drain: the exchange (dedicated hardware and storage) and Kafka (managed service that was designed to be resilient to this, with unique problems). The exchange matching engine is the core system responsible for processing orders and maintaining order books. It is a distributed cluster and requires quorum to safely elect a leader and continue processing trading activity. During the incident, infrastructure-level constraints in the affected datacenter left only a subset of nodes healthy, preventing the cluster from reaching quorum. As a result, trading across Retail, Advanced, and Institutional exchanges were blocked. Recovery required our oncall and engineering teams to execute our disaster recovery plan, restore quorum safely, and validate system health under constrained infrastructure conditions. The team built, tested, deployed, and validated the fix while continuing to manage the broader incident. Kafka recovery was a much larger scale operation. Our primary managed Kafka partitions process many terabytes of data daily and are designed with resiliency guarantees for uninterrupted operation during a datacenter failure just like this. In this case, those guarantees failed and required manual recovery. We again relied on disaster recovery procedures to recover stuck partitions onto new hardware (brokers) that enabled us to safely bring x-service messaging back online across Coinbase. During the lag, customers saw delayed balance streams which resolved automatically once replication caught up. No data lost. Once the engine came back up as part of our standard runbooks, we re-opened markets carefully: all products to cancel-only mode first, audited product states, then moved all markets to auction mode, before restoring trading on Coinbase Exchange. What went right: the team. Incident response across the company came together within minutes, followed well-rehearsed playbooks and used secure automation tooling to recover all services. We have a strong, senior team at Coinbase that worked through rare failure modes to recover all services. To our customers: losing access to your account, even temporarily, is unacceptable. We know that. We're sorry, and we’ll publish a full root cause analysis in the coming weeks 🙏

English

61

44

373

298.7K

rob@rwitoff·9 May

Can confirm NASA grade QA standards. I spent 5 years working at @NASAJPL before becoming responsible for infra, security and now eng at Coinbase 12 years ago. Many of our eng + security + quality standards are modeled after, or better than what I grew up on there. wsj.com/articles/BL-DG…

English

0

1

43

3.1K

Architect🛡️@Architect9000·8 May

“As seen on $COIN 2026 Q1 earnings”, me 😎 I wasn’t the only one with this sentiment so I asked what was up with “non-technical people pushing code to prod”. When you’re storing a million bucks on a platform, you wanna be hearing about how their engineers are ex-Mossad International Olympiad gold medalists with NASA-grade QA standards. Insinuating that casuals are vibe-coding your bank is a scary thought. AI is going to create new employee role categories which look far more cross-functional and Socratic in nature. We need to experiment with the many possibilities. But when it comes to our money, I think Brian knows he should have led with reliability instead of leverage. @brian_armstrong’s answer was good and should satisfy anyone who let their imagination run wild over his earlier comments. Listen here (28:08): youtube.com/live/d7BeHWXcL…

YouTube

English

5

3

38

6.1K

rob@rwitoff·9 May

@seslly @coinbase We’ll share more in our full RCA, but we had an appropriate RF to survive a zone outage. The way this hardware failed triggered a bug in the managed cluster that still took the cluster down, which we had to work around to recover with the vendor.

English

1

0

13

4.5K

seslly@seslly·9 May

@rwitoff @coinbase you can have kafka in multiple AZs but having a replication factor of 1 would be the only reason AZ outage like this could be so devastating (MSK requires >=2 AZ so it's the config that bites you) i say this as a former employee that wants yall to do better even without me

English

2

1

22

5.2K

rob@rwitoff·9 May

@zquestz thanks josh. tbf it is fun to imagine there’s a vibe coding conspiracy afoot, unfortunately that’s just not the case 😆🕵️

English

0

9

678

Josh Ellithorpe@zquestz·9 May

Actual facts about the Coinbase outage yesterday. As usual, Rob explains clearly what happened, and I am sure will take steps to make the systems more resilient in the future. Things that didn't happen, and if your "influencer" told you these were the reason. They are just baiting you for clicks and engagement. - No one vibe coded something that failed. - A "non-engineer" didn't push production code and take out the trading engine. - It wasn't intentional. - It wasn't because Coinbase failed to design a fail-over system. Things happen at scale, don't let the armchair quarterbacks tell you tall tales.

rob@rwitoff

Yesterday @coinbase experienced a multi-hour service disruption affecting trading, exchange access, and balance updates. Here's our initial read from Coinbase engineering on what happened, how we recovered, and what we're addressing. At approximately 23:50 UTC on 2026-05-07, our monitoring detected cascading quote failures from internal services that triggered multiple Sev1 incidents that engineering immediately began investigating. Customer-facing impacts included spot trading, Prime, International and derivative exchanges. Root cause: a thermal event (cooling system failure) inside a subset of racks within a single building in AWS us-east-1. We run a primary replica of our exchange infrastructure in a single zone, consistent with industry standards to reduce latency. To prepare for failures like this, we maintain a distributed standby, but during this incident, failures in the primary zone that were designed to be isolated were not, extending the duration of our outage. The failure cascaded down two paths: 1. Multiple hardware components beneath our exchange’s matching engine failed, requiring recovery and failover 2. Distributed Kafka clusters that manage messaging across Coinbase systems failed to remain available, also requiring partition failovers to new hardware brokers with many TiBs of data After isolating the incident: automated tooling drained ~10 Kubernetes clusters worth of related workloads out of the affected zone to stabilize internal services. Most services were back to normal within ~30 minutes of diagnosis. The two things we couldn't automatically drain: the exchange (dedicated hardware and storage) and Kafka (managed service that was designed to be resilient to this, with unique problems). The exchange matching engine is the core system responsible for processing orders and maintaining order books. It is a distributed cluster and requires quorum to safely elect a leader and continue processing trading activity. During the incident, infrastructure-level constraints in the affected datacenter left only a subset of nodes healthy, preventing the cluster from reaching quorum. As a result, trading across Retail, Advanced, and Institutional exchanges were blocked. Recovery required our oncall and engineering teams to execute our disaster recovery plan, restore quorum safely, and validate system health under constrained infrastructure conditions. The team built, tested, deployed, and validated the fix while continuing to manage the broader incident. Kafka recovery was a much larger scale operation. Our primary managed Kafka partitions process many terabytes of data daily and are designed with resiliency guarantees for uninterrupted operation during a datacenter failure just like this. In this case, those guarantees failed and required manual recovery. We again relied on disaster recovery procedures to recover stuck partitions onto new hardware (brokers) that enabled us to safely bring x-service messaging back online across Coinbase. During the lag, customers saw delayed balance streams which resolved automatically once replication caught up. No data lost. Once the engine came back up as part of our standard runbooks, we re-opened markets carefully: all products to cancel-only mode first, audited product states, then moved all markets to auction mode, before restoring trading on Coinbase Exchange. What went right: the team. Incident response across the company came together within minutes, followed well-rehearsed playbooks and used secure automation tooling to recover all services. We have a strong, senior team at Coinbase that worked through rare failure modes to recover all services. To our customers: losing access to your account, even temporarily, is unacceptable. We know that. We're sorry, and we’ll publish a full root cause analysis in the coming weeks 🙏

English

43

13

164

89.6K

rob@rwitoff·9 May

@rei_iku_ @coinbase thank goodness for same day delivery

English

1

0

29

3.7K

reiiku@rei_iku_·9 May

@rwitoff @coinbase guys you don't have AC backups?!

English

1

0

2

3.8K

rob@rwitoff·9 May

@coinbase More about our exchange architecture @ #availability-zones" target="_blank" rel="nofollow noopener">docs.cdp.coinbase.com/exchange/intro…

English

6

3

24

10.7K

rob@rwitoff·9 May

@Chuyqa @coinbase we're taking a look

English

1

0

3

4.7K

Jesus Alvarez@Chuyqa·9 May

@rwitoff @coinbase You still have dozens of pairs sending bad data on ticker and l2 as of 2:39 PM Pacific. OP, IOTX, ICP.. list is rather long. Good luck getting this synced.

English

3

0

8

5.5K

rob@rwitoff·8 May

@jamessmith57013 @coinbase yes - health.aws.amazon.com/health/status

0

2

59

james smith@jamessmith57013·8 May

@coinbase is there an aws outage going on right now?

English

1

0

276

rob@rwitoff·8 May

@0x_Osprey agreed. it will take years.

English

0

1

3

997

Joe@0x_Osprey·8 May

@rwitoff I still think its gonna be a crazy chasm over the next few months before we ultimately hit the “golden age”😅

English

1

0

5

281

rob@rwitoff·8 May

We are entering a golden age of cybersecurity. Even small blue teams are starting to drive the cost per exploit up exponentially. We’re not there yet, but the end state is clear and good for the good guys.

Alex Albert@alexalbert__

With the help of Claude Mythos Preview, the Firefox team fixed more security bugs in April than in the past 15 months combined.

English

3

1

13

4.4K

rob@rwitoff·4 May

@katie_haun This is great news for everyone. Industries get better when @HaunVentures is driving. Congrats Katie and team!!

English

0

6

732

Katie Haun@katie_haun·4 May

Today we’re announcing $1 billion in new funds to back the bold founders shaping the next era of finance and technology. I’ve been following the flow of assets my entire career and have never seen a more dynamic time. Financial infrastructure is being rebuilt from the ground up, new assets and markets are emerging, and an agentic economy is developing as AI agents begin to transact on behalf of humans. These areas, among others, are what will define the coming years as we deploy these new funds. We’re excited for what’s ahead, and wrote about our thesis in the post below.

Katie Haun@katie_haun

x.com/i/article/2050…

English

190

79

1.2K

318.7K

rob@rwitoff·4 May

My AI coach gave me a B- this week. Every week, agents check in on my digital life. They send back 4 things: 1. What I'm missing 2. What's changing 3. What's going well 4. What's not All of this data is sitting there: - iMessageDB - ScreentimeDB - EightSleep API (eightctl) - Oura MCP - Google Workspace MCP - and more Everyone will have a world class executive coach (agent) this year and it's going to change your life.

English

4

0

43

2.8K

rob@rwitoff·22 Nis

In the last 12 months, we’ve seen a 27x increase in non-engineers using dev tools like Claude, OpenCode and Cursor to build & automate how we work. The goal is to turn everyone into a builder, and safely reduce the distance between idea → execution to near zero. Trust is our most important asset at @coinbase, so this is fueled by a massive effort in quality, guardrails and simplification.

English

16

26

226

147.2K

rob@rwitoff·20 Nis

These will blow you away because Fred & Balaji are wired into Coinbase: github, drive, linear, slack & more. Fred for the expert strategy, Balaji for challenging assumptions. A 10x team. Wait until you see all the subagents + new capabilities we're wiring in now. #BestPlaceToBuild

Brian Armstrong@brian_armstrong

Coinbase is testing AI agents that show up in slack/email at work, just like any human teammate. To start we're shipping two which are modeled after legendary former Coinbase employees, @FEhrsam and @balajis. (Who brutally frame mogged who in this matchup?) Soon, it will be easy for any employee to spin up a new agent for themselves or their team. I suspect we will have more agents than human employees at some point soon.

English

1

39

6.2K

rob@rwitoff·9 Nis

@gcockfoster since i’ve called the book’s style “awful” i don’t want to share the name. but you can do for any book!

English

0

39

Griffin Cock Foster@gcockfoster·8 Nis

@rwitoff what book?

English

1

0

114

rob@rwitoff·8 Nis

The last autobiography I read had a style + verbosity that were awful, but had an important first-person view I wanted to understand. So I had an agent rewrite and repackage the book for my kindle without the fluff, and more relevant historical context. Just finished the book and loved it. 2026 is awesome.

English

1

0

15

1.9K

rob@rwitoff·5 Nis

Screens are the cigarettes of our generation. We all know we use our devices poorly, but device manufacturers will never be incentivized to optimize for our time. So Claude and I built a tool that liberates your iOS Screen Time data and lets Claude give you brutally honest advice on your habits. It tells you: - What's eating your time - Where you're context-switching too much - What you're actually doing well - One concrete thing to change this week Open source, all data stays on your device, takes 30 seconds to set up. Try it: github.com/witoff/screent…

English

2

0

18

1.5K

rob

Keşfet