Incident Description
Proxyclick experienced a service disruption on March 20th, 2019 from 15:00 UTC until 16:35 UTC.
Although it was possible to access the Dashboard, it was impossible to perform actions such as creating a new visit, or checking your visitors in and out. Some customers reported that visitors registering their arrival and departure on the iPad app during the outage were not synchronized back with the dashboard after the outage was resolved.
Once the outage was resolved, the auto-recovery mode brought back the impacted systems to a healthy state. No events from successful actions performed in our application before or after the disruption were lost.
Incident Window
The first alerts from our monitoring arrived at 15:00 UTC.
Full access to our services was restored at 16:35 UTC.
Root Cause
The trigger for the incident was a mass storage outage on the infrastructure supporting our messaging system. Actions happening in our application are persisted in the messaging system in order to guarantee their delivery to the underlying components. Without access to its storage, the messaging system started to reject new events resulting in errors returned by our back-end.
Moving forward
While our architecture is designed to be resilient against failures in individual host data centers, in this case an underlying flaw in the provider's high availability infrastructure design reduced the redundancy.
Proxyclick will liaise with its provider to follow up on the measures they will implement in order to prevent such a mass storage outage from happening again.
We will also explore other options to improve the resilience of our systems.
Finally we will review the code of the iPad app to understand why it did not always synchronize the activity performed during this outage with the dashboard once the problem was fixed.