Proxyclick by Eptura Detailed Root Cause Analysis (RCA) – S2 Event 2023-02-13 At 16:43 UTC on February 13, 2023, Proxyclick started to receive reports that some clients were unable to load the Visitor Dashboard, either partially or totally. Engineers quickly identified the issue as originating at one API node within our service cluster and restarted the impacted node. However, the issue persisted. Further investigation showed the underlying issue to be a front-end server which was not routing through our Dashboard load balancer due to misconfigured DNS entries. The DNS record was updated and impacted nodes restarted again, which restored normal operations for users being served by the misconfigured application server.
Type of Event: Service Disruption
Services Impacted: Webapp Dashboard
Remediation: Impacted service nodes were restarted after a configuration fix was applied.
Timeline of Events: 16:40 UTC: Internal routing from Dashboard Application Servers was limited to only passing through the API Load Balancer 16:43 UTC: First reports received from client users 16:50 UTC: DevOps identified the issue as limited to one Application Server and restarted the node, briefly alleviating the symptoms 17:02 UTC: Support reported the issue persisting and DevOps and Engineering began a series of restarts while researching the underlying cause 17:36 UTC: DevOps identified the root cause as a hardcoded DNS entry which was not removed from one Application Server after the service migration on January 15, 2023 which bypassed the API Load Balancer, combined with the internal routing change from 16:40. This DNS entry was removed and the Application Server restarted, restoring full service
Duration: 56 Minutes
Groups Involved in the Event: Support DevOps Engineering
Root Cause Analysis: The Root Cause of this Service Disruption was identified during the incident as a hardcoded DNS entry put in place for testing during the January 15, 2023, service migration which forced one Application Server to bypass the API Load Balancer when processing user requests, which caused the Application Server to be unable to reach the Proxyclick API after an internal routing change to only permit traffic to reach the API through the Load Balancer.
Preventative Action and Analysis: Additional checks have been added to our procedures that involve hardcoded changes to our traffic routing to ensure that all such changes are rolled back or centrally documented for the duration that they are needed to prevent subsequent configuration updates from being made without awareness of the original change.