Email Consumer Disruption and Queue Management Issue Impacting Notification Service

Incident Report for Cigo

Resolved

Incident Summary:
On August 11th, a significant issue occurred with the Email Consumer in the Notification Service. The Email Consumer stopped functioning at approximately 8:00 PM EDT, leading to a series of cascading issues within the Notification Service. This incident did not impact the rest of the Cigo Tracker service.

Timeline of Events:
- 8:00 PM EDT: The Email Consumer ceased functioning.
- 8:00 PM - 7:00 AM EDT: Messages began queuing at a significantly reduced rate.
- 7:00 AM - 10:00 AM EDT: The rate at which messages were queued increased substantially.
- 11:00 AM - 2:08 PM EDT: The Email queue reached its capacity, resulting in the rejection of new messages. During this period, both email and SMS message publishing were halted, despite the expectation that only the Email queue should have been affected.

Impact:
- The system experienced a complete halt in both email and SMS message publishing between 11:00 AM and 2:08 PM EDT, affecting communication and potentially leading to delays in message delivery.

Root Cause:
- There is a suspected code-logic error that may be causing incorrect checks on the queues, leading to the stoppage of both email and SMS publishing when the Email queue reaches its capacity.

Actions Taken:
- An alert was sent to the appropriate internal communications channel when the Email Consumer failed. However, due to a configuration issue, the notification was not received by the intended recipients at the time of the failure.
- The issue was remediated at around 2:00 PM EDT by restarting the Email Consumer and fixing the Email queue. Adjustments have been made to the channel configuration to include additional team members and ensure quicker responses in the future.

Next Steps:
- Investigate the root cause of the Email queue failure.
- Determine why reaching the maximum queue size for emails also impacted the SMS queue.
- Ensure that the monitoring and alerting systems are fully operational and that all team members are notified promptly of any critical failures.
- Conduct a thorough review of the queue management logic to identify and rectify any underlying issues.

Conclusion:
We are taking appropriate actions to prevent this issue from recurring. Our team is committed to ensuring the reliability of the Notification Service, and we will continue to monitor and improve our systems to provide the best possible service to our customers.
Posted Aug 11, 2024 - 11:00 EDT