We are grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.
Type of Event:
S2 – Visitor (PXC) API and Dashboard Performance Issues
Services/Modules Impacted:
Visitor Dashboard, web application, and API
Root Cause:
We're always working to keep our systems running smoothly, and part of that includes a weekly cleanup job that removes unused documents. This job efficiently processes up to 10,000 documents at a time and is managed by a scheduler that communicates with our system via an event.
Recently, the job completed successfully and was marked as such in our database. However, an event acknowledgment didn't go through, leading to repeated retries. While this caused some temporary database deadlocks, our team quickly identified the issue and resolved it with a database restart.
Remediation:
To address the immediate problem, the customers were notified, and the database was restarted, which cleared the deadlocks and stopped the retry loop.
Timeline:
All times listed in CEST
06 Oct 2024
10:00 p.m.: HardDeleteDocuments job executed
07 Oct 2024
07:23 a.m.: Deadlock errors were logged in the database. 10 deadlock errors were logged throughout the business day, but the number was not alarming.
08 Oct 2024
02:18 a.m.: Deadlock errors started to show in the database again.
07:40 a.m: An incident was reported, prompting Eptura to initiate the investigation.
09:46 a.m.: Eptura updated the Visitor status page to reflect the incident and investigation.
03:00 p.m.: The Eptura Infra team identified the root cause of the issue and suggested restarting the Database.
03:27 p.m.: The Eptura team updated the status page to notify the customers of the above and also notified of the 30-minute downtime.
03:41 p.m.: The Eptura team restarted the database and performed sanity checks on the app.
04:00 p.m.: The Eptura team updated cases to ask customers for initial feedback on the issue.
04:27 p.m.: The Eptura team updated the status page confirming the issue is resolved and moved to monitoring.
09 Oct 2024
03:48 p.m.: Following a successful monitoring period, Eptura marked the incident as resolved.
Total Duration of Event:
6 hours 47 minutes
Preventive Action:
Eptura is proactively enhancing our monitoring to better track job executions. We're setting up alert monitoring jobs and reviewing our retry mechanism to implement stronger solutions, like limited retries or a dead letter queue, to prevent future issues.
Additionally, we're planning a broader re-architecture in early 2025 to ensure smooth scaling of our application. We're committed to continuous improvement and excited to deliver an even better experience for you!