S2 - All On Premise Access Control Systems are down

Incident Report for Eptura Visitor

Postmortem

We are grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.

Type of Event:
S2 - All On-premise, ACS Extenders reporting as offline preventing customer activity.

Services/Modules Impacted:
Access Control Systems

Root Cause:
The Halibut connection experienced a temporary disruption, which paused event synchronization (visitors, visits, check-ins, check-outs). Despite a previous certificate outage on September 15 adding complexity, it was not the root cause of the current issue. Through diligent investigation, we discovered that the check client status handler, designed to update every minute, had paused due to a Quartz .NET library scheduling parameter discrepancy. Comparing production with staging revealed that scheduled job timing updates were not occurring in production, leading to the disconnection.

Remediation:
The team proactively investigated the Halibut client configuration and successfully validated the certificate thumbprints. Eptura CloudOps and Engineering teams diligently explored various system components and message processing mechanisms, ensuring all pending tasks were addressed. Further investigation revealed an opportunity for improvement with the scheduling system, specifically in updating timing parameters.

Taking swift action, the Eptura CloudOps team updated the parameters directly in the database, effectively restoring the Halibut connection. This collaborative effort showcased the team's expertise and commitment to resolving issues promptly.

Timeline:

All times listed in EST

18 Oct 12:05 - Issue Reported by Brookfield about the ACS system not working.

18 Oct 15:00 - The support team identified that all ACS extenders are showing offline in Blackgate and started investigating.

18 Oct 15:12 - Support Team confirmed API and Certifcates are all good, continue investiagting. Requested screenshot from SCA software form the customer.

18 Oct 15:55 - More locations from Brookfield started reporting issues.

18 Oct 18:00 - Customer confirmed that ACS is connected without any issues on their side.

19 Oct 08:55 - Support logged a high-priority bug with the Engineering team.

20 Oct 15:43 - Support raised a Fire Alarm for Eptura CloudOps and the Engineering team to investigate the issue. S2 for Eptura CloudOps was created.

21 Oct 05:53 - The status page was updated indicating the S2 outage for customer notification.

21 Oct 14:00 - Issue identified, the fix was implemented, and the status page was updated to monitoring.

22 Oct 05:17 - The status page was set to resolve after successful monitoring.

Total Duration of Event:
86 hours

Preventive Action:

Our focus is on enhancing monitoring, refining management processes, and elevating team training. The Eptura CloudOps team is already actively engaged in several proactive monitoring initiatives, including:

Regularly checking the status of the Blackgate dashboard to ensure optimal extender performance.
Monitoring SEQ error logs to swiftly identify and address any anomalies.
Utilizing a dedicated Microsoft Teams channel for real-time updates on system status and alerts.

By implementing these proactive measures, we are fortifying our systems and ensuring a reliable experience for all customers. Thank you for your continued trust as we strive to enhance our services.

In addition to these proactive measures, we are also working on long-term preventive strategies to further enhance system reliability:

Enhanced Monitoring and Alerting: We are strengthening our alerting mechanisms by automating and implementing more checks for real-time notifications and self-healing. The Eptura CloudOps team is also scheduling routine health checks of our systems to proactively identify and mitigate potential issues, thereby enhancing overall reliability.
Architecture Review Planning: The Eptura Engineering team is planning an Architecture Review in 2025, focusing on improving job behaviors and notification processes. This review will ensure our systems remain robust and efficient.

Posted Nov 05, 2024 - 11:34 UTC

Resolved

We are pleased to inform you that the issue has now been resolved. Our teams have completed the necessary actions and verified that the service is now functioning normally.

A Root Cause Analysis (RCA) will be conducted to understand the incident in detail and will be made available on our Status Page within 10 days.

Thank you for your patience and cooperation throughout this process. If you have any further questions or concerns, please feel free to reach out.

Posted Oct 22, 2024 - 08:17 UTC

Monitoring

We have implemented a solution for the issue affecting the Access Control system and are currently monitoring the situation to ensure stability and performance. Our CloudOps team is overseeing the process to confirm that the issue has been fully resolved.

Next Update : 11 am CEST

Posted Oct 21, 2024 - 17:00 UTC

Update

We are continuing to investigate the issue. We have ruled out any certificate eroors and are investigation connection logs now. We will keep you updated on the progress.

Next Update: 7 pm CEST

Posted Oct 21, 2024 - 14:50 UTC

Update

We are continuing to investigate the issue. We have ruled out any certificate eroors and are investigation connection logs now. We will keep you updated on the progress.

Next Update: 7 pm CEST

Posted Oct 21, 2024 - 14:49 UTC

Update

We are continuing to investigate the issue. Aprreciate our patience.

Next Update: 05:00 PM CEST

Posted Oct 21, 2024 - 12:01 UTC

Investigating

All On-premise Access Control Extenders are offline. Our teams are actively investigating the issue and we will update you on the progress.

Next Update: 02:00 PM CEST

Posted Oct 21, 2024 - 08:53 UTC

This incident affected: Access Control.