Jan Pantel@JanPantel
I wanted to take this opportunity to share some background information on what happened during the FACEIT outages over the last 2 days.
My goal here is to provide more information about how the platform responded under record demand.
I hope some of you find these insights interesting or can learn something from them for yourselves.
I’m trying to bridge the gap between non-technical and technical folks, so some things might be too nuanced for some, while other things might be too high-level for others.
On Wednesday, April 22, the amount of traffic to our website far exceeded even our highest forecast, especially so early in the day at 1PM CEST. Like most cloud native software companies, we have more servers provisioned than necessary, including auto-scaling mechanisms to handle traffic spikes. Yet, your enthusiasm for our platform outgrew even these generous buffers, making us push the upper safety auto-scaling threshold.
As we recognized the increased traffic, our engineering team started to increase the ceiling of our auto-scaling configuration in terms of physical cloud limits and Kubernetes configuration, practically provisioning more web servers to handle the load. Unfortunately, we had already reached a critical point at which existing servers (Kubernetes Pods, for the techies here) were failing faster than new ones could spawn.
What happened in this moment is that the extremely spikey traffic caused people to get errors, making them refresh continuously, which sent our servers into a death spiral that they couldn’t recover from without intervention.
As we stabilized the website, the spike in players queuing up led to our Matchmaker falling behind, resulting in much longer than usual queue times.
The surge of matches spawning furthermore led to our game server architecture not being able to scale quickly enough.
Our game servers are run on so-called bare-metal servers, which are not virtual cloud servers but actual machines, guaranteeing the best performance and latency. Bare-metal machines take longer to provision and deploy into our fleet, and they also sometimes have supply issues. Last week, we put in orders that maxed out available capacity in some of our regions and are waiting on more deliveries. During this period, the fallback is cloud scaling to ensure players are not waiting 10 minutes for a server. However, this scaling mechanism was unable to keep up with the demand. Since Wednesday, we've had additional deliveries, increasing server numbers to an all time high.
On Thursday, April 23, as a follow-up to Wednesday's surge, we greatly increased the horizontal scaling of our Matchmaker. However, as we hit peak hours, this extra load pushed our proportionately scaled matchmaking database to its limit. The Matchmaker is using a Redis database, which should be operated in an environment where 20% of memory capacity should be reserved for the system memory, so that it can perform crucial operations like cleaning up stale data while keeping the database performing normally.
Given the increased amount of matchmaking capacity, we had to basically double the amount of memory allocated to said database. There was a change made to our configuration files about a year ago that tipped the ratio of Redis and system memory below the 20% threshold. This hidden bottleneck never caused an issue during normal operations, but under the extreme pressure of Season 8's launch and our newly expanded queues, the database stalled. Write requests began timing out, which caused a cascading failure across our game queues.
We eventually managed to scale the database up further, while keeping already active matches from cancelling and the platform as a whole operational. The “issue” here is that a zero downtime scaling of such a system basically requires a new replica to spawn and replicate all data from the old instance before connections can be rerouted, which takes time and system resources.
Once that scaling safely concluded, we fully restored the service. We are now auditing all of our database configurations to ensure similar resource imbalances are not hiding anywhere else, and reinforced the Matchmaking database with extra system memory headroom.