
HelpfulAIGuy
964 posts

HelpfulAIGuy
@HelpfulAIGuy
I lead software teams. I ship products. I bring clarity to AI’s impact on business, leadership, and life.




Yesterday @coinbase experienced a multi-hour service disruption affecting trading, exchange access, and balance updates. Here's our initial read from Coinbase engineering on what happened, how we recovered, and what we're addressing. At approximately 23:50 UTC on 2026-05-07, our monitoring detected cascading quote failures from internal services that triggered multiple Sev1 incidents that engineering immediately began investigating. Customer-facing impacts included spot trading, Prime, International and derivative exchanges. Root cause: a thermal event (cooling system failure) inside a subset of racks within a single building in AWS us-east-1. We run a primary replica of our exchange infrastructure in a single zone, consistent with industry standards to reduce latency. To prepare for failures like this, we maintain a distributed standby, but during this incident, failures in the primary zone that were designed to be isolated were not, extending the duration of our outage. The failure cascaded down two paths: 1. Multiple hardware components beneath our exchange’s matching engine failed, requiring recovery and failover 2. Distributed Kafka clusters that manage messaging across Coinbase systems failed to remain available, also requiring partition failovers to new hardware brokers with many TiBs of data After isolating the incident: automated tooling drained ~10 Kubernetes clusters worth of related workloads out of the affected zone to stabilize internal services. Most services were back to normal within ~30 minutes of diagnosis. The two things we couldn't automatically drain: the exchange (dedicated hardware and storage) and Kafka (managed service that was designed to be resilient to this, with unique problems). The exchange matching engine is the core system responsible for processing orders and maintaining order books. It is a distributed cluster and requires quorum to safely elect a leader and continue processing trading activity. During the incident, infrastructure-level constraints in the affected datacenter left only a subset of nodes healthy, preventing the cluster from reaching quorum. As a result, trading across Retail, Advanced, and Institutional exchanges were blocked. Recovery required our oncall and engineering teams to execute our disaster recovery plan, restore quorum safely, and validate system health under constrained infrastructure conditions. The team built, tested, deployed, and validated the fix while continuing to manage the broader incident. Kafka recovery was a much larger scale operation. Our primary managed Kafka partitions process many terabytes of data daily and are designed with resiliency guarantees for uninterrupted operation during a datacenter failure just like this. In this case, those guarantees failed and required manual recovery. We again relied on disaster recovery procedures to recover stuck partitions onto new hardware (brokers) that enabled us to safely bring x-service messaging back online across Coinbase. During the lag, customers saw delayed balance streams which resolved automatically once replication caught up. No data lost. Once the engine came back up as part of our standard runbooks, we re-opened markets carefully: all products to cancel-only mode first, audited product states, then moved all markets to auction mode, before restoring trading on Coinbase Exchange. What went right: the team. Incident response across the company came together within minutes, followed well-rehearsed playbooks and used secure automation tooling to recover all services. We have a strong, senior team at Coinbase that worked through rare failure modes to recover all services. To our customers: losing access to your account, even temporarily, is unacceptable. We know that. We're sorry, and we’ll publish a full root cause analysis in the coming weeks 🙏




stop building web apps start building mobile apps trust me




@HelpfulAIGuy presents 'Can We Make $1 Million in the Next Hour? (Live-Coding a Full-Stack App)' at Nebraska.Code() this July. nebraskacode.amegala.com #VibeCode #Claude #Gemini #NotebookLM #OpenArt #AgenticWorkflows #RapidDeployment #AI #Debugging @weavixiow @movevc @TransdTech @DistrictOperat1 @OPPDCares @GPC_updates





the american mind cannot understand this





We’ve agreed to a partnership with @SpaceX that will substantially increase our compute capacity. This, along with our other recent compute deals, means that we’ve been able to increase our usage limits for Claude Code and the Claude API.



JUST IN: Anthropic unveils 10 new AI agents built for banks, insurers, & financial firms.











