Elevated response times in Proxyclick admin dashboard

Incident Report for Eptura Visitor

Postmortem

We are grateful for your continued support and loyalty. We value your feedback and appreciate your patience as we worked to resolve this incident.

Type of Event:
S2 – Visitor (PXC) API and Dashboard Performance Issues

Services/Modules Impacted:
Visitor Dashboard, web application, and API

Root Cause:
We're always working to keep our systems running smoothly, and part of that includes a weekly cleanup job that removes unused documents. This job efficiently processes up to 10,000 documents at a time and is managed by a scheduler that communicates with our system via an event.

Recently, the job completed successfully and was marked as such in our database. However, an event acknowledgment didn't go through, leading to repeated retries. While this caused some temporary database deadlocks, our team quickly identified the issue and resolved it with a database restart.

Remediation:
To address the immediate problem, the customers were notified, and the database was restarted, which cleared the deadlocks and stopped the retry loop.

Timeline:

All times listed in CEST

‌

06 Oct 2024

10:00 p.m.: HardDeleteDocuments job executed

07 Oct 2024

07:23 a.m.: Deadlock errors were logged in the database. 10 deadlock errors were logged throughout the business day, but the number was not alarming.

08 Oct 2024

02:18 a.m.: Deadlock errors started to show in the database again.

07:40 a.m: An incident was reported, prompting Eptura to initiate the investigation.

09:46 a.m.: Eptura updated the Visitor status page to reflect the incident and investigation.

03:00 p.m.: The Eptura Infra team identified the root cause of the issue and suggested restarting the Database.

03:27 p.m.: The Eptura team updated the status page to notify the customers of the above and also notified of the 30-minute downtime.

03:41 p.m.: The Eptura team restarted the database and performed sanity checks on the app.

04:00 p.m.: The Eptura team updated cases to ask customers for initial feedback on the issue.

04:27 p.m.: The Eptura team updated the status page confirming the issue is resolved and moved to monitoring.

09 Oct 2024

03:48 p.m.: Following a successful monitoring period, Eptura marked the incident as resolved.

Total Duration of Event:
6 hours 47 minutes

Preventive Action:

Eptura is proactively enhancing our monitoring to better track job executions. We're setting up alert monitoring jobs and reviewing our retry mechanism to implement stronger solutions, like limited retries or a dead letter queue, to prevent future issues.

Additionally, we're planning a broader re-architecture in early 2025 to ensure smooth scaling of our application. We're committed to continuous improvement and excited to deliver an even better experience for you!

Posted Oct 25, 2024 - 09:32 UTC

Resolved

We are pleased to share that this incident is resolved as we have confirmed full restoration of admin response time performance.
We will publish our root cause analysis findings on this incident within 10 business days.

Posted Oct 09, 2024 - 05:18 UTC

Update

The systems are functioning normally. We shall continue to monitor the situation for extended period of time.

Next update: 03:30 US Central time on October 9, 2024.

Posted Oct 08, 2024 - 16:01 UTC

Monitoring

The system is operational now. We shall continue to monitor the situation.

Next update: 11:00 US Central Time.

Posted Oct 08, 2024 - 14:27 UTC

Update

We believe we have identified the root cause of the issue. The impact is currently intermittent. To resolve the issue, we will be restarting the database service at 15:30 CET. This will result in an approximately 30-minute outage. We expect the service to be fully restored after the database restart.

During this period, our services will be temporarily unavailable. We apologize for any inconvenience this may cause and appreciate your understanding and patience as we work to improve our systems.

Posted Oct 08, 2024 - 13:27 UTC

Update

We are continuing to investigate this issue.

Posted Oct 08, 2024 - 13:23 UTC

Update

Our Infra team is actively working to determine the root cause of the disruption and assess its impact.

We will provide next update in 4 hours.

Thank you for your patience as we work to resolve this issue.

Posted Oct 08, 2024 - 11:04 UTC

Update

Our Infra team is currently investigating the issue to identify the cause and take remedial actions.

Next Update in 2 hours.

Posted Oct 08, 2024 - 08:57 UTC

Investigating

We are currently investigating reports of increased loading times and slow response in the Proxyclick admin dashboard.

Posted Oct 08, 2024 - 07:46 UTC

This incident affected: Dashboard, iPad app, Browser-based Kiosk, SMS, Mail, Slack, Skype for Business, API, Webhooks, Box integration, Dropbox integration, OneDrive integration, Calendar integration, Salesforce integration, Exports, and Proovr.