Incident Timeline
Incident Duration: Approximately 6 hours, from 11:20 UTC to 17:30 UTC on September 13, 2023
Timeline of Events
Incident Identification (September 13, 2023)
- 11:20 UTC / 7:20 ET: Customers started experiencing issues with degraded web and mobile app performance, including higher latency and disconnects.
Incident Response (September 13, 2023)
- 13:00 UTC / 9:30 ET: We initiated an internal investigation into the performance degradation and suspected connectivity issues with our servers.
Mitigation and Communication (September 13, 2023)
- 17:28 UTC / 13:28 ET: Full platform performance was restored, and intermittent connectivity errors disappeared. However, we continued monitoring the situation.
Incident Closure and Ongoing Investigation (September 13, 2023)
- 19:00 UTC / 15:00 ET: The incident was officially closed as platform performance returned to normal levels.
Root Cause Analysis
The root cause of this incident was identified as a faulty device in Azure Frontdoor. This device continued transmitting traffic from the edge sites for an extended period of time, leading to congestion and packet drops. The prolonged transmission from the faulty device resulted in higher latency, disconnects, and failed service responses.
Mitigation
Microsoft Azure mitigated the issue by routing traffic away from the problematic device to a healthy one. This action restored normal service operations.
Preventive Measures
To prevent future occurrences, we are committed to implementing the following measures:
- Collaboration with Azure: We will maintain a strong collaboration with Microsoft Azure's OSS team to ensure a proactive approach to identifying and addressing potential issues promptly.
- Traffic Monitoring: Regular monitoring of traffic patterns will be implemented to detect anomalies and address them swiftly.
- Redundancy and Failover: We will explore redundancy options and failover mechanisms to minimize the impact of similar incidents.
Conclusion
We sincerely apologize for the inconvenience and disruption this incident may have caused our customers during the impact window of 11:20 UTC to 17:30 UTC (7:20 ET to 13:30 ET) on September 13, 2023. We appreciate your patience and understanding throughout the incident resolution process. Our commitment to providing reliable and performant services remains unwavering, and we will continue to work diligently to improve our systems and prevent future incidents.
If you have any further questions or require additional information, please do not hesitate to reach out to us. Thank you for your continued support.