We are grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.
Type of Event:
S2 - All On-premise, ACS Extenders reporting as offline preventing customer activity.
Services/Modules Impacted:
Access Control Systems
Root Cause:
The trigger action, pending since the outage on October 18th, was identified and updated during working hours. This caused the queue to accumulate messages over the past 10 days, leading to a slowdown in the Access Control System
Remediation:
The team proactively investigated the alert and identified that the messaging queue and database consumption limits were increasing. To address this, the database DTUs were increased, and active monitoring of the queue was implemented to ensure proper message processing. Services were restarted as necessary.
This collaborative effort demonstrated the team’s expertise and commitment to promptly resolving issues.
Timeline:
All times listed in BST
28 Oct 10:00 - The Eptura CloudOps Team implemented a change requested by the Engineering team to update affected triggers from the outage on 18th Oct.
28 Oct 10:08 - The Eptura CloudOps team received an alert on monitoring and the team initiated troubleshooting and identified high compute consumption for the database.
28 Oct 10:24 - An issue was reported by the Harbour Exchange Team about the ACS system performing slowly and QR readers not working.
28 Oct 11:00 - The Database DTUs were increased to handle the processing of the queue faster and more efficiently.
28 Oct 13:43 - The messages started to queue since 18th Oct for the service to process making the system slower. It was decided to not roll back the changes as the messages needed to be processed and they were already in the queue.
28 Oct 15:26 - The status page was updated with the affected component as Access Control Systems.
28 Oct 15:30 - The Eptura CloudOps team continued to monitor the queue and restarted the service in intervals as needed,
28 Oct 16:18 - Incident status updated to Identified with the message that the message queues are returning to normal.
28 Oct 22:36 - Incident status updated to Monitoring as the background service was restarted by CloudOps. The Eptura CloudOps monitored the queue until it was completely cleared.
29th Oct 08:23 - The status page was set to resolve after successful monitoring.
Total Duration of Event:
12 hours 28 minutes
Preventive Action:
Our focus is on enhancing monitoring, refining management processes, and elevating team training. The Eptura CloudOps team is already actively engaged in several proactive monitoring initiatives, including:
By implementing these proactive measures, we are fortifying our systems and ensuring a reliable experience for all customers. Thank you for your continued trust as we strive to enhance our services.
In addition to these proactive measures, we are also working on long-term preventive strategies to further enhance system reliability: