Degraded web and mobile app performance due to server connectivity issues

Incident Report for Cigo

Postmortem

Incident Timeline

Incident Duration: Approximately 6 hours, from 11:20 UTC to 17:30 UTC on September 13, 2023

Timeline of Events

Incident Identification (September 13, 2023)

11:20 UTC / 7:20 ET: Customers started experiencing issues with degraded web and mobile app performance, including higher latency and disconnects.

Incident Response (September 13, 2023)

13:00 UTC / 9:30 ET: We initiated an internal investigation into the performance degradation and suspected connectivity issues with our servers.

Mitigation and Communication (September 13, 2023)

17:28 UTC / 13:28 ET: Full platform performance was restored, and intermittent connectivity errors disappeared. However, we continued monitoring the situation.

Incident Closure and Ongoing Investigation (September 13, 2023)

19:00 UTC / 15:00 ET: The incident was officially closed as platform performance returned to normal levels.

Root Cause Analysis

The root cause of this incident was identified as a faulty device in Azure Frontdoor. This device continued transmitting traffic from the edge sites for an extended period of time, leading to congestion and packet drops. The prolonged transmission from the faulty device resulted in higher latency, disconnects, and failed service responses.

Mitigation

Microsoft Azure mitigated the issue by routing traffic away from the problematic device to a healthy one. This action restored normal service operations.

Preventive Measures

To prevent future occurrences, we are committed to implementing the following measures:

Collaboration with Azure: We will maintain a strong collaboration with Microsoft Azure's OSS team to ensure a proactive approach to identifying and addressing potential issues promptly.
Traffic Monitoring: Regular monitoring of traffic patterns will be implemented to detect anomalies and address them swiftly.
Redundancy and Failover: We will explore redundancy options and failover mechanisms to minimize the impact of similar incidents.

Conclusion

We sincerely apologize for the inconvenience and disruption this incident may have caused our customers during the impact window of 11:20 UTC to 17:30 UTC (7:20 ET to 13:30 ET) on September 13, 2023. We appreciate your patience and understanding throughout the incident resolution process. Our commitment to providing reliable and performant services remains unwavering, and we will continue to work diligently to improve our systems and prevent future incidents.

If you have any further questions or require additional information, please do not hesitate to reach out to us. Thank you for your continued support.

Posted Sep 14, 2023 - 10:06 EDT

Resolved

We are pleased to inform you that we are closing this incident with the following important notes:

1. Platform Performance: Our platform's performance has returned to its normal levels since approximately 1:28 PM (Eastern Time). We've closely monitored the situation, and the intermittent connectivity errors that were affecting our services have now disappeared.

2. Ongoing Investigation: While the immediate issue has been resolved, we continue to work closely with the Azure Operations Support System (OSS) team to conduct a comprehensive root cause analysis. Our joint efforts aim to identify the underlying reasons for the incident.

3. Future Updates: As soon as we gather more information and insights from our collaboration with the Azure OSS team, we will provide a post-mortem update on this incident. This update will offer a detailed account of our investigation results and outline our plan of action to prevent a recurrence of this issue.

We sincerely apologize for any inconvenience or disruption this incident may have caused to our customers' operations. Our team is committed to ensuring the reliability and performance of our services, and we appreciate your patience and understanding throughout this process.

If you have any further questions or require additional information, please do not hesitate to reach out to us.

Posted Sep 13, 2023 - 16:42 EDT

Monitoring

We wanted to provide you with an update on the recent server connectivity issues that were impacting our web and mobile app performance. Based on our observations, it appears that the server connectivity issues have been resolved, and our systems are now showing signs of stability.

However, we are still actively monitoring our systems to ensure that everything remains in good working order. We understand the importance of a comprehensive analysis to prevent future occurrences, and to that end, we are eagerly awaiting further information from the Microsoft Azure Operations Support System (OSS) team. We hope that their expertise will help us pinpoint the root cause of the issue, allowing us to take any necessary preventive measures going forward.

We appreciate your patience and understanding as we continue to work on this matter, and we will keep you updated as soon as we receive more information from the Microsoft Azure OSS team. If you have any questions or concerns in the meantime, please don't hesitate to reach out to us.

Posted Sep 13, 2023 - 13:39 EDT

Investigating

We are presently in the process of identifying the underlying reasons for the diminished performance of our web and mobile applications, which appears to be stemming from connectivity problems with our servers. Our initial examination has not uncovered any issues originating from our side. Consequently, we have initiated contact with our cloud hosting provider, Microsoft Azure, to collaborate with their Operations Support System (OSS) team in order to further investigate this matter.

Posted Sep 13, 2023 - 10:59 EDT

This incident affected: Web applications (Dispatch Web Platform, Customer Tracker), APIs (Public API, Operator API), and Services (Routing and Itinerary Optimization, Notifications, Outbound Email Service, Outbound SMS Service).