S2 - Elevated response times in the admin dashboard

Incident Report for Eptura Visitor

Postmortem

We are grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.

Type of Event:
S2 – Visitor (PXC) dashboard intermittent performance Issues

Services/Modules Impacted:
Visitor Dashboard

Root Cause:
The servers were operating at full capacity when the service job was simultaneously handled by both API servers, leading to an increase in server traffic. This situation temporarily saturated the servers, causing them to drop incoming connections. Additionally, the extra load extended the completion time for existing requests, which intermittently affected dashboard performance.

Remediation:
Eptura proactively restarted the API services to address an intermittent issue, providing immediate relief. This strategic approach allowed us to conduct an in-depth investigation to devise a more permanent solution. We identified a specific configuration issue within the API servers, which our CloudOps team promptly addressed, optimizing the service configurations to prevent jobs from being simultaneously picked up by the API servers. This enhancement ensures smoother operations and enhanced reliability.

Timeline:

All times listed in UTC
9:40 a.m.: Monitoring alerts indicated that API services were not responding.

10: 15 a.m.: Eptura restarted the API services and restored the functionality.

12:06 p.m.: An incident was reported, prompting Eptura to start the investigation.

13:33 p.m.: Eptura updated the Visitor status page to reflect the incident and investigation.

15:00 p.m.: Eptura restarted the services but in vain. Eptura engaged Micorosft’s Network team to aid the investigation.

16:00 p.m.: Eptura deployed additional API nodes and restarted services to resolve the issue.

16:18 p.m.: System performance remained impacted due to a backlog of pending API requests.

18:30 p.m.: Eptura confirmed that the backlog cleared and resumed monitoring.

5:30 a.m.: Monitoring alerts indicated that API services were not responding.

07:15 a.m.: Eptura team restarted the services on API nodes to resolve the issue and resumed monitoring.

19:44 p.m.: Following a successful monitoring period, Eptura marked the incident as resolved.

Total Duration of Event:
8 hours 35 minutes

Preventive Action:
We have enhanced our infrastructure by provisioning additional servers to efficiently manage the daily requests. Furthermore, the Eptura CloudOps team has implemented advanced internal monitoring systems. These systems are designed to better monitor errors and dropped connections from a load balancer perspective, enabling us to take swift and effective action whenever necessary. This proactive approach ensures a smoother and more reliable service experience.

Posted Sep 11, 2024 - 16:24 UTC

Resolved

This incident has been resolved.

Posted Aug 14, 2024 - 17:44 UTC

Update

Standard application monitoring identified that one of the queues within the application was growing from 5:10 UTC to 7:20 UTC. During this time some customers may have observed elevated dashboard response time times.

Eptura Cloud Operations implemented mitigations and we continue to monitor.

Posted Aug 14, 2024 - 11:42 UTC

Update

We will continue to monitor until 8am CST to ensure all logs and processing is clear. Thank you for your understanding.

Posted Aug 13, 2024 - 23:25 UTC

Update

We are continuing to monitor to ensure fix has mitigated the impact experienced. We will leave monitoring window open for another 2 hours.

Posted Aug 13, 2024 - 21:02 UTC

Monitoring

Eptura Engineering has identified and implemented a fix as of 12:30pm CST. As we have not seen further service disruptions, we are moving to a Monitoring status for the next 2 hours.

Posted Aug 13, 2024 - 18:44 UTC

Update

Our Engineering & Infrastructure teams are investigating issue related to intermittent gateway timeouts which is impacting response times in the admin dashboard. Investigation into the underlying root cause of timeouts is ongoing with both Eptura & Microsoft teams engaged.

Posted Aug 13, 2024 - 15:47 UTC

Investigating

We are currently investigating an issue with admin dashboard

Posted Aug 13, 2024 - 12:33 UTC

This incident affected: Dashboard.