We want to provide you with an update on the January 21st incident that impacted our services. Here's a breakdown of the situation:
Incident Summary: On January 20th, 2024, at around 9 PM EST, an internal maintenance process by the Azure OSS team resulted in a configuration change to Azure Resource Manager. Unfortunately, this led to repeated failures of the Azure Resource Manager's node upon startup.
Root Cause: The configuration change triggered a negative feedback loop, overwhelming the remaining Azure Resource Manager nodes and causing a rapid drop in availability. This, in turn, affected our backend storage, leading to random failures on data plane API calls. These failures, specifically, disrupted the functionality of our database server, leading to intermittent crashes, particularly during the timeframe of 12 AM to 2 AM.
Resolution: The Azure engineering team worked to address the issue, and we’re able to fully resolve it around 4 AM EST on January 21st, 2024.
Preventive Measures: To prevent similar incidents in the future, we are closely reviewing our internal processes and working collaboratively with the Azure OSS team to implement additional safeguards.
We sincerely apologize for any inconvenience this may have caused, and we appreciate your understanding as we continue to enhance our systems to provide you with a more reliable experience.