Proxyclick by Eptura Detailed Root Cause Analysis (RCA) – S1 Event 2023-02-15 On February 15, 2023, at 15:33 UTC, Proxyclick started to receive reports that users were unable to access the application. Engineering and DevOps teams isolated an issue with the Web Application servers and restarted them, restoring service.
Type of Event: Service Disruption
Services Impacted: Proxyclick Web Application
Remediation: DevOps and Engineering restarted the impacted service hosts, restoring normal operation.
Timeline of Events: 15:33 UTC - First reports received by Support 15:35 UTC - DevOps and Engineering begin investigation 15:42 UTC - DevOps restarts impacted service hosts 15:44 UTC - Service host restart completes and normal operations resume
Total Duration: 11 Minutes
Groups Involved in the Event: Support DevOps Engineering
Root Cause Analysis: A primary service host for the Proxyclick Web Application crashed due to an Out-of-Memory exception and was not automatically restarted. The cause of this OOM exception was identified as a memory leak in the application which had previously escaped notice due to frequent restarting of the service hosts during regular product update deployments. Proxyclick Engineering had a larger than normal gap between releases after the service migration event on January 15th, 2023 which surfaced the conditions for this memory leak to consume all available memory on the service host.
Preventative Action and Analysis: DevOps has implemented additional health monitoring to Proxyclick Load Balancer infrastructure to detect service hosts failing due to memory limits and remove them from the pool. Additionally, a self-healing trigger has been added to this health check response to bring the failed host back into service automatically to maintain HA and load capacity.
Engineering will investigate the memory leak to produce a patch that resolves the root issue permanently in a future release.