Proxyclick platform authentication failures

Incident Report for Eptura Visitor

Postmortem

We are grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.

Type of Event:
S1 – Visitor Not accessible

Services/Modules Impacted:
Visitor API not responsive, login (forms-based and SAML SSO) blocked

Root Cause:
The storage cluster surpassed its capacity, preventing the creation of new producers. Consequently, the Visitor messaging system experienced a failure, rendering Eptura Visitor inaccessible to all clients.

Remediation:
Eptura initially increased the disk size to accommodate additional messages within the service. This temporary solution enabled us to conduct a thorough investigation for a more permanent resolution. We pinpointed a particular issue that had been subject to manual cleanup as part of our monitoring efforts. Following the cleanup, we restored the message service to operational status, normalizing activity to restore Visitor services.

Timeline:

All times listed in U.S. Central Time
7:28 a.m.: Monitoring alerts indicated that Eptura exceeded its storage capacity.
7:34 a.m.: An incident was reported, prompting Eptura to start investigation.
7:53 a.m.: Eptura updated the Visitor status page to reflect the incident and investigation.
11:03 a.m.: Eptura identified the root cause and implemented measures to mitigate the service disruption.
12:10 p.m.: Eptura redeployed API nodes to allocate additional server resources.
12:18 p.m.: System performance remained impacted due to a backlog of pending API requests.
1:11 p.m.: Eptura confirmed that the backlog cleared and resumed monitoring.
4:33 p.m.: Following a successful monitoring period, Eptura marked the incident as resolved.

Total Duration of Event:
9 hours, 5 minutes

Preventive Action:
We have implemented automated storage cluster cleanup processes and enhanced monitoring, and will continue to examine both manual and automated enhancements to optimize cluster sizing and management moving forward.

Posted Sep 04, 2024 - 16:05 UTC

Resolved

We have completed remediation and confirmed that all queues have returned to normal volumes and performance has stabilized. We will continue to closely monitor the status during our root cause analysis and will release our findings as a final update to this incident within
10 business days.

Posted Aug 01, 2024 - 21:33 UTC

Update

We are continuing to monitor the processing of API requests, Next update will be 23:00CEST

Posted Aug 01, 2024 - 19:14 UTC

Monitoring

We have allocated additional server resources to expedite recovery and are redeploying API nodes to initiate this change. We continue to monitor progress. we will monitor the incident until further notice. next update at 21:00 CEST.

Posted Aug 01, 2024 - 18:10 UTC

Update

While we have addressed the initial root cause of the disruption, performance continues to be affected due to the volume of pending API requests backlogged. We have allocated additional server resources to expedite recovery and are redeploying API nodes to initiate this change. We continue to monitor progress and will post our next update at 20:00 CEST.

Posted Aug 01, 2024 - 17:18 UTC

Update

We are pleased to report that we have identified the root cause and are implementing mitigations to address the service disruption. Please allow time for these updates to populate through the infrastructure, as service stabilizes you may still experience intermittent disruption. Updates will be provided as we continue to monitor, next update will be issued at 19:00 CEST

Posted Aug 01, 2024 - 16:03 UTC

Identified

A potential root cause of the issue has been identified. Our engineering and infrastructure teams are working towards implementing a solution to restore the service as soon as possible.

Posted Aug 01, 2024 - 14:19 UTC

Update

We are continuing to investigate this issue.

Posted Aug 01, 2024 - 13:13 UTC

Investigating

We are currently investigating reports of user authentication failures across the platform.

Posted Aug 01, 2024 - 12:53 UTC

This incident affected: Dashboard, iPad app, Browser-based Kiosk, SMS, Mail, Slack, Skype for Business, API, Webhooks, Box integration, Dropbox integration, OneDrive integration, Calendar integration, Salesforce integration, Exports, and Proovr.