Network Degradation Impacting Cigo Services

Incident Report for Cigo

Postmortem

📣 Incident Summary – March 18–19, 2025

Service Impact on Cigo Tracker due to Azure Regional Outage

On March 18, 2025, Cigo Tracker experienced intermittent service disruption due to a regional outage within Microsoft Azure’s East US data center region. Below is a summary of the root cause, impact, and the steps being taken to prevent future occurrences.

🕒 What Happened?

Azure’s East US region suffered two separate impact windows:

March 18, 13:37 to 16:52 UTC
March 18, 23:20 to March 19, 00:30 UTC

The incident was triggered by a third-party fiber cut during external drilling work, which caused reduced network capacity in one of Azure’s Availability Zones. A tooling failure during Azure's recovery efforts later reintroduced traffic prematurely, leading to congestion and a second round of intermittent connectivity issues.

🔍 Root Cause

Fiber Cut: A construction-related accident physically damaged fiber cabling serving the East US datacenter, degrading network capacity.
Router Maintenance: A key router in the same zone was already under repair, limiting redundancy.
Tooling Error: Azure’s automated recovery system failed to fully isolate damaged infrastructure, inadvertently reintroducing traffic too early.
Congestion Spillover: The unexpected traffic load caused congestion to spread beyond AZ03 into neighboring zones.

🎯 Impact on Cigo Tracker

While the Azure issue only affected a subset of inter-zone traffic in East US, this included infrastructure we rely on, resulting in intermittent connectivity issues for some customers during the incident windows. Core services were restored once Azure manually completed isolation and fiber recovery work.

🛠 Resolution Timeline

13:37 UTC, Mar 18 – Outage begins due to fiber cut
13:55 UTC – Initial mitigation starts; traffic rerouted
16:52 UTC – First impact window ends
23:20 UTC – Second outage begins due to tooling error during recovery
00:30 UTC, Mar 19 – Final mitigation complete
06:50 UTC – Full restoration of all infrastructure

✅ What Azure is Doing to Prevent Recurrence

Fixing tooling failures that allowed reintroduction of unready capacity (by May 2025)
Accelerating a capacity upgrade for the East US datacenter (by July 2025)
Architecting better safeguards to prevent impact from spreading across zones (by February 2026)

We apologize for the inconvenience caused. Please rest assured that our team is working closely with Azure and continuing to invest in the resiliency of our platform.

If you have any questions or would like help designing a more resilient setup, feel free to reach out to our support team.

Thank you for your continued trust

Posted Apr 11, 2025 - 11:41 EDT

Resolved

After monitoring the situation over the past few hours, we can confirm that the immediate impact of the outage has been fully mitigated. Our services are stable, and network performance has returned to normal.

We will provide a more detailed post-mortem once we receive a conclusive report from Microsoft's Azure Operations Support (OSS) team.

Thank you for your patience and understanding. We sincerely apologize for any inconvenience this may have caused.

Posted Mar 18, 2025 - 23:35 EDT

Monitoring

We want to inform you that we recently experienced network degradation affecting our services due to an ongoing issue within Microsoft's Azure infrastructure in the East US region.

What Happened?
According to Azure, between 13:09 UTC and 18:51 UTC, a fiber cut impacted network capacity in the region, leading to intermittent connectivity loss and increased latency. While Azure has since mitigated the issue, we observed disruptions in our own services between 7:20 PM and 8:27 PM (Eastern Time), specifically affecting connections between the Cigo Tracker web app and our Redis service.

Current Status
As of 8:27 PM UTC, network latencies have returned to normal, and service stability has been restored. However, to ensure a prompt and complete resolution, we have escalated this matter to Azure's Operations Support with a critical priority.

We appreciate your patience and will continue monitoring the situation closely. If you experience any further issues, please reach out to our support team.

Posted Mar 18, 2025 - 20:44 EDT

This incident affected: Web applications (Dispatch Web Platform, Customer Tracker) and APIs (Public API, Operator API).