Unplanned Platform Downtime (Cloud Vendor Outage)

Incident Report for Cigo

Postmortem

We want to provide you with an update on the January 21st incident that impacted our services. Here's a breakdown of the situation:

Incident Summary: On January 20th, 2024, at around 9 PM EST, an internal maintenance process by the Azure OSS team resulted in a configuration change to Azure Resource Manager. Unfortunately, this led to repeated failures of the Azure Resource Manager's node upon startup.

Root Cause: The configuration change triggered a negative feedback loop, overwhelming the remaining Azure Resource Manager nodes and causing a rapid drop in availability. This, in turn, affected our backend storage, leading to random failures on data plane API calls. These failures, specifically, disrupted the functionality of our database server, leading to intermittent crashes, particularly during the timeframe of 12 AM to 2 AM.

Resolution: The Azure engineering team worked to address the issue, and we’re able to fully resolve it around 4 AM EST on January 21st, 2024.

Preventive Measures: To prevent similar incidents in the future, we are closely reviewing our internal processes and working collaboratively with the Azure OSS team to implement additional safeguards.

We sincerely apologize for any inconvenience this may have caused, and we appreciate your understanding as we continue to enhance our systems to provide you with a more reliable experience.

Posted Jan 29, 2024 - 10:52 EST

Resolved

The incident has been successfully resolved. We're now awaiting the Root-Cause Analysis report from Microsoft Azure's team. Once received, we'll compile a post-mortem of the event to provide you with a comprehensive overview.

Posted Jan 21, 2024 - 08:24 EST

Update

Our team is actively monitoring our database server instances to guarantee service availability. While things are looking positive with the recent improvements, we are awaiting official confirmation from the Azure team to ensure that the problem has been fully resolved.

Posted Jan 21, 2024 - 03:30 EST

Monitoring

It appears that Microsoft Azure has successfully implemented a fix, and our database server connections are now operational.

However, we are currently awaiting official confirmation from the Microsoft Operational Systems Support team to validate that the issue has been fully mitigated.

Thank you for your continued understanding.

Posted Jan 21, 2024 - 02:57 EST

Identified

Our ongoing investigation into the connectivity disruption impacting a subset of our database servers and various Azure services has identified an issue on Microsoft's end. The Microsoft Operational Systems Support team has acknowledged the problem and is actively addressing it.

We are diligently awaiting further updates from their team and will keep you informed as soon as new information becomes available. Your patience during this time is sincerely appreciated.

Posted Jan 21, 2024 - 02:23 EST

Investigating

We are currently experiencing extended database downtimes stemming from Microsoft Azure's database instances. This has been occurring intermittently since 9:45 PM (EST), and the issue has been recurring at a higher frequency since 12 AM (EST). Our team is actively investigating the root cause of this service disruption. We appreciate your patience as we work to resolve this issue promptly.

Posted Jan 21, 2024 - 01:40 EST

This incident affected: Web applications (Dispatch Web Platform, Customer Tracker), APIs (Public API, Operator API), Mobile applications (iOS, Android), and Services (Routing and Itinerary Optimization, Maps, Notifications, Outbound Email Service, Outbound SMS Service).