Proxyclick by Eptura Detailed Root Cause Analysis (RCA) – S1 Event 2023-03-28
On March 28, 2023, Proxyclick monitoring detected a degradation in service, which grew into a full API outage. Engineering and DevOps teams identified the issue as related to an update earlier in the day and issued a hotfix, restoring service.
Type of Event:
Proxyclick Web Application
DevOps and Engineering created and deployed a hotfix, restoring normal operation.
Timeline of Events:
09:00 UTC – Scheduled platform update deployed
09:25 UTC – Automated monitoring detects S2 performance degradation and alerts Engineering and DevOps teams
09:31 UTC – Engineering and DevOps confirm the monitoring alert and begin incident management
09:34 UTC – First customer reports received by Support team
09:49 UTC – Incident impact reclassified as S1
10:53 UTC – Status Page updated to track incident
11:30 UTC – Engineering and DevOps teams identify the true cause of the incident and begin remediation
11:39 UTC – Engineering and DevOps teams complete deployment of hotfix, services begin return to normal operations
13:08 UTC – All impacts to service are mitigated and the incident is closed
Total Duration: 4 hours, 8 minutes
Groups Involved in the Event:
Root Cause Analysis:
An update to the authentication logic for accessing the Proxyclick Web Application and APIs was mistakenly marked as deployed to production in a previous release but was included for the first time in the scheduled deployment on March 28, 2023. While this update worked without issue in all pre-production environments, a code defect was present which was unable to handle the higher load of the production environment.
The investigation into the cause of the outage was delayed by the incorrect deployment record, as the team initially excluded the authentication components of the application. Significant time was spent reviewing incorrect logs and service components as a result. Once this was corrected the teams found the underlying issue, then created and deployed a hotfix patch to bring the outage to a close.
Preventative Action and Analysis:
Engineering and QA teams will implement better load-scaling into pre-production testing to reduce the potential for similar issues in future. Engineering and Product teams will review the deployment process to improve accuracy of release records. Incident management processes will be updated to ensure all logging sources are checked earlier in the investigation.