Adrian Stoicea (@StoiceaAdrian) - Twitter پروفائل

@FACEITcs can you please take a look at my unban ticket ticket (13055340), i got banned for nothing, i have been waiting for a reply 6 days.

English

1

0

86

FACEIT CS2@FACEITcs·13h

Season 8 launched this week and peaked at 12,459 concurrent matches, the highest volume in FACEIT history. Jan from our engineering team shared the technical context behind the platform and queue instability, and adjustments made to handle the increasing load this weekend.

Jan Pantel@JanPantel

I wanted to take this opportunity to share some background information on what happened during the FACEIT outages over the last 2 days. My goal here is to provide more information about how the platform responded under record demand. I hope some of you find these insights interesting or can learn something from them for yourselves. I’m trying to bridge the gap between non-technical and technical folks, so some things might be too nuanced for some, while other things might be too high-level for others. On Wednesday, April 22, the amount of traffic to our website far exceeded even our highest forecast, especially so early in the day at 1PM CEST. Like most cloud native software companies, we have more servers provisioned than necessary, including auto-scaling mechanisms to handle traffic spikes. Yet, your enthusiasm for our platform outgrew even these generous buffers, making us push the upper safety auto-scaling threshold. As we recognized the increased traffic, our engineering team started to increase the ceiling of our auto-scaling configuration in terms of physical cloud limits and Kubernetes configuration, practically provisioning more web servers to handle the load. Unfortunately, we had already reached a critical point at which existing servers (Kubernetes Pods, for the techies here) were failing faster than new ones could spawn. What happened in this moment is that the extremely spikey traffic caused people to get errors, making them refresh continuously, which sent our servers into a death spiral that they couldn’t recover from without intervention. As we stabilized the website, the spike in players queuing up led to our Matchmaker falling behind, resulting in much longer than usual queue times. The surge of matches spawning furthermore led to our game server architecture not being able to scale quickly enough. Our game servers are run on so-called bare-metal servers, which are not virtual cloud servers but actual machines, guaranteeing the best performance and latency. Bare-metal machines take longer to provision and deploy into our fleet, and they also sometimes have supply issues. Last week, we put in orders that maxed out available capacity in some of our regions and are waiting on more deliveries. During this period, the fallback is cloud scaling to ensure players are not waiting 10 minutes for a server. However, this scaling mechanism was unable to keep up with the demand. Since Wednesday, we've had additional deliveries, increasing server numbers to an all time high. On Thursday, April 23, as a follow-up to Wednesday's surge, we greatly increased the horizontal scaling of our Matchmaker. However, as we hit peak hours, this extra load pushed our proportionately scaled matchmaking database to its limit. The Matchmaker is using a Redis database, which should be operated in an environment where 20% of memory capacity should be reserved for the system memory, so that it can perform crucial operations like cleaning up stale data while keeping the database performing normally. Given the increased amount of matchmaking capacity, we had to basically double the amount of memory allocated to said database. There was a change made to our configuration files about a year ago that tipped the ratio of Redis and system memory below the 20% threshold. This hidden bottleneck never caused an issue during normal operations, but under the extreme pressure of Season 8's launch and our newly expanded queues, the database stalled. Write requests began timing out, which caused a cascading failure across our game queues. We eventually managed to scale the database up further, while keeping already active matches from cancelling and the platform as a whole operational. The “issue” here is that a zero downtime scaling of such a system basically requires a new replica to spawn and replicate all data from the old instance before connections can be rerouted, which takes time and system resources. Once that scaling safely concluded, we fully restored the service. We are now auditing all of our database configurations to ensure similar resource imbalances are not hiding anywhere else, and reinforced the Matchmaking database with extra system memory headroom.

English

14

12

306

32.6K

Adrian Stoicea@StoiceaAdrian·10h

@JanPantel can you take a look at my unban ticket (13055340), i got banned for nothing

English

0

1

579

Jan Pantel@JanPantel·13h

I wanted to take this opportunity to share some background information on what happened during the FACEIT outages over the last 2 days. My goal here is to provide more information about how the platform responded under record demand. I hope some of you find these insights interesting or can learn something from them for yourselves. I’m trying to bridge the gap between non-technical and technical folks, so some things might be too nuanced for some, while other things might be too high-level for others. On Wednesday, April 22, the amount of traffic to our website far exceeded even our highest forecast, especially so early in the day at 1PM CEST. Like most cloud native software companies, we have more servers provisioned than necessary, including auto-scaling mechanisms to handle traffic spikes. Yet, your enthusiasm for our platform outgrew even these generous buffers, making us push the upper safety auto-scaling threshold. As we recognized the increased traffic, our engineering team started to increase the ceiling of our auto-scaling configuration in terms of physical cloud limits and Kubernetes configuration, practically provisioning more web servers to handle the load. Unfortunately, we had already reached a critical point at which existing servers (Kubernetes Pods, for the techies here) were failing faster than new ones could spawn. What happened in this moment is that the extremely spikey traffic caused people to get errors, making them refresh continuously, which sent our servers into a death spiral that they couldn’t recover from without intervention. As we stabilized the website, the spike in players queuing up led to our Matchmaker falling behind, resulting in much longer than usual queue times. The surge of matches spawning furthermore led to our game server architecture not being able to scale quickly enough. Our game servers are run on so-called bare-metal servers, which are not virtual cloud servers but actual machines, guaranteeing the best performance and latency. Bare-metal machines take longer to provision and deploy into our fleet, and they also sometimes have supply issues. Last week, we put in orders that maxed out available capacity in some of our regions and are waiting on more deliveries. During this period, the fallback is cloud scaling to ensure players are not waiting 10 minutes for a server. However, this scaling mechanism was unable to keep up with the demand. Since Wednesday, we've had additional deliveries, increasing server numbers to an all time high. On Thursday, April 23, as a follow-up to Wednesday's surge, we greatly increased the horizontal scaling of our Matchmaker. However, as we hit peak hours, this extra load pushed our proportionately scaled matchmaking database to its limit. The Matchmaker is using a Redis database, which should be operated in an environment where 20% of memory capacity should be reserved for the system memory, so that it can perform crucial operations like cleaning up stale data while keeping the database performing normally. Given the increased amount of matchmaking capacity, we had to basically double the amount of memory allocated to said database. There was a change made to our configuration files about a year ago that tipped the ratio of Redis and system memory below the 20% threshold. This hidden bottleneck never caused an issue during normal operations, but under the extreme pressure of Season 8's launch and our newly expanded queues, the database stalled. Write requests began timing out, which caused a cascading failure across our game queues. We eventually managed to scale the database up further, while keeping already active matches from cancelling and the platform as a whole operational. The “issue” here is that a zero downtime scaling of such a system basically requires a new replica to spawn and replicate all data from the old instance before connections can be rerouted, which takes time and system resources. Once that scaling safely concluded, we fully restored the service. We are now auditing all of our database configurations to ensure similar resource imbalances are not hiding anywhere else, and reinforced the Matchmaking database with extra system memory headroom.

English

24

11

219

69.4K

Adrian Stoicea@StoiceaAdrian·10h

@FACEIT_Darwin Can you look into my unban ticket please? ticket (13055340) i got banned for nothing

English

0

45

FACEIT Darwin@FACEIT_Darwin·12h

I hope this clarifies some of the issues we've been experiencing on our end over the last few days 🙏

Jan Pantel@JanPantel

I wanted to take this opportunity to share some background information on what happened during the FACEIT outages over the last 2 days. My goal here is to provide more information about how the platform responded under record demand. I hope some of you find these insights interesting or can learn something from them for yourselves. I’m trying to bridge the gap between non-technical and technical folks, so some things might be too nuanced for some, while other things might be too high-level for others. On Wednesday, April 22, the amount of traffic to our website far exceeded even our highest forecast, especially so early in the day at 1PM CEST. Like most cloud native software companies, we have more servers provisioned than necessary, including auto-scaling mechanisms to handle traffic spikes. Yet, your enthusiasm for our platform outgrew even these generous buffers, making us push the upper safety auto-scaling threshold. As we recognized the increased traffic, our engineering team started to increase the ceiling of our auto-scaling configuration in terms of physical cloud limits and Kubernetes configuration, practically provisioning more web servers to handle the load. Unfortunately, we had already reached a critical point at which existing servers (Kubernetes Pods, for the techies here) were failing faster than new ones could spawn. What happened in this moment is that the extremely spikey traffic caused people to get errors, making them refresh continuously, which sent our servers into a death spiral that they couldn’t recover from without intervention. As we stabilized the website, the spike in players queuing up led to our Matchmaker falling behind, resulting in much longer than usual queue times. The surge of matches spawning furthermore led to our game server architecture not being able to scale quickly enough. Our game servers are run on so-called bare-metal servers, which are not virtual cloud servers but actual machines, guaranteeing the best performance and latency. Bare-metal machines take longer to provision and deploy into our fleet, and they also sometimes have supply issues. Last week, we put in orders that maxed out available capacity in some of our regions and are waiting on more deliveries. During this period, the fallback is cloud scaling to ensure players are not waiting 10 minutes for a server. However, this scaling mechanism was unable to keep up with the demand. Since Wednesday, we've had additional deliveries, increasing server numbers to an all time high. On Thursday, April 23, as a follow-up to Wednesday's surge, we greatly increased the horizontal scaling of our Matchmaker. However, as we hit peak hours, this extra load pushed our proportionately scaled matchmaking database to its limit. The Matchmaker is using a Redis database, which should be operated in an environment where 20% of memory capacity should be reserved for the system memory, so that it can perform crucial operations like cleaning up stale data while keeping the database performing normally. Given the increased amount of matchmaking capacity, we had to basically double the amount of memory allocated to said database. There was a change made to our configuration files about a year ago that tipped the ratio of Redis and system memory below the 20% threshold. This hidden bottleneck never caused an issue during normal operations, but under the extreme pressure of Season 8's launch and our newly expanded queues, the database stalled. Write requests began timing out, which caused a cascading failure across our game queues. We eventually managed to scale the database up further, while keeping already active matches from cancelling and the platform as a whole operational. The “issue” here is that a zero downtime scaling of such a system basically requires a new replica to spawn and replicate all data from the old instance before connections can be rerouted, which takes time and system resources. Once that scaling safely concluded, we fully restored the service. We are now auditing all of our database configurations to ensure similar resource imbalances are not hiding anywhere else, and reinforced the Matchmaking database with extra system memory headroom.

English

22

1

107

18.5K

Adrian Stoicea@StoiceaAdrian·4d

@FACEIT @FACEITcs @FACEIT_Darwin , Apologies for tagging you directly. I sent an email two days ago but haven’t received a response yet. I’ve noticed that you’re quite active seem to approach posts in a fair and neutral manner, so I was hoping you might be able to look into this

English

0

23

Adrian Stoicea@StoiceaAdrian·19 Kas

130€ Float 0.053★ Gut Knife | Lore (Factory New) (Counter-Strike 2) skinport.com/i/AUVBZV5PA6M via @Skinport

English

0

33

Adrian Stoicea@StoiceaAdrian·22 Eyl

@NVIDIAGeForce gimme that for wukong #GeForceSummer!

English

0

5

NVIDIA GeForce@NVIDIAGeForce·20 Eyl

🔥#GeForceSummer of RTX FINALE 🔥 this is the LAST CHANCE to win something from the core prize pool, including: 🟢 RTX 4090 FE 🟢 RTX 4080 SUPER 🟢 RTX 4070 SUPER 🟢 GeForce RTX Laptops 🟢 G-SYNC Displays Like + Comment + Share for your final chance to win!

English

15.7K

12.2K

21.3K

2.1M

Adrian Stoicea@StoiceaAdrian·22 Eyl

@NVIDIAGeForce @FalconNW Tryna play that new wukong all winter #GeForceSummer!

English

0

5

NVIDIA GeForce@NVIDIAGeForce·19 Eyl

the #GeForceSummer of RTX is sizzling to the end, but before it’s all over… here is another chance to WIN a unique @FalconNW gaming rig featuring a GeForce RTX 4080 SUPER! want it? tell us your top PC game of this summer & comment #GeForceSummer! 🔥

English

5.2K

1.1K

4K

561.1K

Adrian Stoicea@StoiceaAdrian·21 Ağu

@NVIDIAGeForce #BlackMythRTX yessir

English

0

4

NVIDIA GeForce@NVIDIAGeForce·20 Ağu

Ready, Destined One? To celebrate the launch Black Myth Wukong w/ full ray tracing + DLSS 3, we're giving away an exclusive GeForce RTX 4080 SUPER, featuring stunning wrap art of Sun Wukong. Want it?! Comment #BlackMythRTX + like this post to enter!

English

37.4K

4.6K

56.8K

5.4M

Adrian Stoicea

دریافت کریں