Eugene Lazin - @lazin.bsky.social
26.7K posts

Eugene Lazin - @lazin.bsky.social
@Lazin
VP of Undefined Behavior at https://t.co/AQ57e435tA


The previous bucket had intelligently adapted its partitioning to our specific traffic pattern. The new bucket had zero history and its default partitioning was a terrible fit for our workload. Fun fact: when an S3 server is overloaded it will return a 503 Slow Down




Straight from the AWS docs: “Your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3 prefix”. We shard our data across extremely granular S3 prefixes, so it’s impossible for us to hit these limits. Confused by this apparent contradiction we investigated deeper.

we had an incident because we migrated traffic to a brand new s3 bucket. our millions of servers instantly crushed the new bucket’s partitions and started getting slammed with 5xx errors























