S2 - Slowness in Access Control Systems

Incident Report for Eptura Visitor

Postmortem

We are grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.

Type of Event:
S2 - All On-premise, ACS Extenders reporting as offline preventing customer activity.

Services/Modules Impacted:
Access Control Systems

Root Cause:
The trigger action, pending since the outage on October 18th, was identified and updated during working hours. This caused the queue to accumulate messages over the past 10 days, leading to a slowdown in the Access Control System

Remediation:
The team proactively investigated the alert and identified that the messaging queue and database consumption limits were increasing. To address this, the database DTUs were increased, and active monitoring of the queue was implemented to ensure proper message processing. Services were restarted as necessary.

This collaborative effort demonstrated the team’s expertise and commitment to promptly resolving issues.

Timeline:

All times listed in BST

28 Oct 10:00 - The Eptura CloudOps Team implemented a change requested by the Engineering team to update affected triggers from the outage on 18th Oct.

28 Oct 10:08 - The Eptura CloudOps team received an alert on monitoring and the team initiated troubleshooting and identified high compute consumption for the database.

28 Oct 10:24 - An issue was reported by the Harbour Exchange Team about the ACS system performing slowly and QR readers not working.

28 Oct 11:00 - The Database DTUs were increased to handle the processing of the queue faster and more efficiently.

28 Oct 13:43 - The messages started to queue since 18th Oct for the service to process making the system slower. It was decided to not roll back the changes as the messages needed to be processed and they were already in the queue.

28 Oct 15:26 - The status page was updated with the affected component as Access Control Systems.

28 Oct 15:30 - The Eptura CloudOps team continued to monitor the queue and restarted the service in intervals as needed,

28 Oct 16:18 - Incident status updated to Identified with the message that the message queues are returning to normal.

28 Oct 22:36 - Incident status updated to Monitoring as the background service was restarted by CloudOps. The Eptura CloudOps monitored the queue until it was completely cleared.

29th Oct 08:23 - The status page was set to resolve after successful monitoring.

Total Duration of Event:
12 hours 28 minutes

Preventive Action:

Our focus is on enhancing monitoring, refining management processes, and elevating team training. The Eptura CloudOps team is already actively engaged in several proactive monitoring initiatives, including:

Regularly checking the status of the service queue
Monitoring SEQ error logs to swiftly identify and address any anomalies.
Utilizing a dedicated Microsoft Teams channel for real-time updates on system status and alerts.
Training the team and implementing the checks that the trigger updates should only be actioned during maintenance window and not during weekdays.

By implementing these proactive measures, we are fortifying our systems and ensuring a reliable experience for all customers. Thank you for your continued trust as we strive to enhance our services.

In addition to these proactive measures, we are also working on long-term preventive strategies to further enhance system reliability:

Enhanced Monitoring and Alerting: We are strengthening our alerting mechanisms by automating and implementing more checks for real-time notifications and self-healing. The Eptura CloudOps team is also scheduling routine health checks of our systems to proactively identify and mitigate potential issues, thereby enhancing overall reliability.
Architecture Review Planning: The Eptura Engineering team is planning an Architecture Review in 2025, focusing on improving job behaviors and notification processes. This review will ensure our systems remain robust and efficient.

Posted Dec 03, 2024 - 10:07 UTC

Resolved

We are pleased to inform you that the issue with Access Control Systems has been resolved. Our CloudOps team has completed the necessary actions and verified that the service is functioning as expected.

A Root Cause Analysis (RCA) will be conducted to understand the incident in detail. It will be available on our Status Page within 10 days.

Thank you for your patience and cooperation throughout this process.

Posted Oct 29, 2024 - 08:23 UTC

Monitoring

We have implemented a solution for the issue affecting the Access Control system and are currently monitoring the situation to ensure stability and performance. Our CloudOps team is overseeing the process to confirm that the issue has been fully resolved.

Next update: 29 October 10:00 CEST

Posted Oct 28, 2024 - 22:36 UTC

Update

We continue to investigate the service interruption that may cause delays in access controls.

Our Engineering team is actively working to determine the root cause of the disruption and assess its impact.

We will provide our next update in 4 hours.

Thank you for your patience as we work to resolve this issue.

Posted Oct 28, 2024 - 20:16 UTC

Identified

We have identified the issue causing the recent service outage and are pleased to inform you that the message queues are now clearing. Due to the high volume of messages, it will take 3-4 hours for everything to return to normal. Rest assured, we are actively monitoring the situation to ensure a smooth resolution.

We will provide the next update in 4 hours.

Thank you for your patience and understanding.

Posted Oct 28, 2024 - 16:18 UTC

Investigating

We have noticed a spike and a high utilization in our backend service queues and as a result, you may face delays in access control operations. Our engineering and ClousOps teams are investigating the issue further.
Apologies for any inconvenience caused due to this.

Posted Oct 28, 2024 - 15:26 UTC

This incident affected: Access Control.