We are grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.
Type of Event:
S2 - All On-premise, ACS Extenders reporting as offline preventing customer activity.
Services/Modules Impacted:
Access Control Systems
Root Cause:
The Halibut connection experienced a temporary disruption, which paused event synchronization (visitors, visits, check-ins, check-outs). Despite a previous certificate outage on September 15 adding complexity, it was not the root cause of the current issue. Through diligent investigation, we discovered that the check client status handler, designed to update every minute, had paused due to a Quartz .NET library scheduling parameter discrepancy. Comparing production with staging revealed that scheduled job timing updates were not occurring in production, leading to the disconnection.
Remediation:
The team proactively investigated the Halibut client configuration and successfully validated the certificate thumbprints. Eptura CloudOps and Engineering teams diligently explored various system components and message processing mechanisms, ensuring all pending tasks were addressed. Further investigation revealed an opportunity for improvement with the scheduling system, specifically in updating timing parameters.
Taking swift action, the Eptura CloudOps team updated the parameters directly in the database, effectively restoring the Halibut connection. This collaborative effort showcased the team's expertise and commitment to resolving issues promptly.
Timeline:
All times listed in EST
18 Oct 12:05 - Issue Reported by Brookfield about the ACS system not working.
18 Oct 15:00 - The support team identified that all ACS extenders are showing offline in Blackgate and started investigating.
18 Oct 15:12 - Support Team confirmed API and Certifcates are all good, continue investiagting. Requested screenshot from SCA software form the customer.
18 Oct 15:55 - More locations from Brookfield started reporting issues.
18 Oct 18:00 - Customer confirmed that ACS is connected without any issues on their side.
19 Oct 08:55 - Support logged a high-priority bug with the Engineering team.
20 Oct 15:43 - Support raised a Fire Alarm for Eptura CloudOps and the Engineering team to investigate the issue. S2 for Eptura CloudOps was created.
21 Oct 05:53 - The status page was updated indicating the S2 outage for customer notification.
21 Oct 14:00 - Issue identified, the fix was implemented, and the status page was updated to monitoring.
22 Oct 05:17 - The status page was set to resolve after successful monitoring.
Total Duration of Event:
86 hours
Preventive Action:
Our focus is on enhancing monitoring, refining management processes, and elevating team training. The Eptura CloudOps team is already actively engaged in several proactive monitoring initiatives, including:
By implementing these proactive measures, we are fortifying our systems and ensuring a reliable experience for all customers. Thank you for your continued trust as we strive to enhance our services.
In addition to these proactive measures, we are also working on long-term preventive strategies to further enhance system reliability: