We are grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.
Type of Event:
S1 – Visitor Not accessible
Services/Modules Impacted:
Visitor API not responsive, login (forms-based and SAML SSO) blocked
Root Cause:
The storage cluster surpassed its capacity, preventing the creation of new producers. Consequently, the Visitor messaging system experienced a failure, rendering Eptura Visitor inaccessible to all clients.
Remediation:
Eptura initially increased the disk size to accommodate additional messages within the service. This temporary solution enabled us to conduct a thorough investigation for a more permanent resolution. We pinpointed a particular issue that had been subject to manual cleanup as part of our monitoring efforts. Following the cleanup, we restored the message service to operational status, normalizing activity to restore Visitor services.
Timeline:
All times listed in U.S. Central Time
7:28 a.m.: Monitoring alerts indicated that Eptura exceeded its storage capacity.
7:34 a.m.: An incident was reported, prompting Eptura to start investigation.
7:53 a.m.: Eptura updated the Visitor status page to reflect the incident and investigation.
11:03 a.m.: Eptura identified the root cause and implemented measures to mitigate the service disruption.
12:10 p.m.: Eptura redeployed API nodes to allocate additional server resources.
12:18 p.m.: System performance remained impacted due to a backlog of pending API requests.
1:11 p.m.: Eptura confirmed that the backlog cleared and resumed monitoring.
4:33 p.m.: Following a successful monitoring period, Eptura marked the incident as resolved.
Total Duration of Event:
9 hours, 5 minutes
Preventive Action:
We have implemented automated storage cluster cleanup processes and enhanced monitoring, and will continue to examine both manual and automated enhancements to optimize cluster sizing and management moving forward.