Issues creating visits
Incident Report for Proxyclick
Postmortem

Incident Description

Proxyclick experienced a service disruption on March 20th, 2019 from 15:00 UTC until 16:35 UTC.

Although it was possible to access the Dashboard, it was impossible to perform actions such as creating a new visit, or checking your visitors in and out. Some customers reported that visitors registering their arrival and departure on the iPad app during the outage were not synchronized back with the dashboard after the outage was resolved.

Once the outage was resolved, the auto-recovery mode brought back the impacted systems to a healthy state. No events from successful actions performed in our application before or after the disruption were lost.

Incident Window

The first alerts from our monitoring arrived at 15:00 UTC.

Full access to our services was restored at 16:35 UTC.

Root Cause

The trigger for the incident was a mass storage outage on the infrastructure supporting our messaging system. Actions happening in our application are persisted in the messaging system in order to guarantee their delivery to the underlying components. Without access to its storage, the messaging system started to reject new events resulting in errors returned by our back-end.

Moving forward

While our architecture is designed to be resilient against failures in individual host data centers, in this case an underlying flaw in the provider's high availability infrastructure design reduced the redundancy.

Proxyclick will liaise with its provider to follow up on the measures they will implement in order to prevent such a mass storage outage from happening again.

We will also explore other options to improve the resilience of our systems.

Finally we will review the code of the iPad app to understand why it did not always synchronize the activity performed during this outage with the dashboard once the problem was fixed.

Posted Mar 21, 2019 - 17:11 CET

Resolved
We continue to see success after monitoring for the last 3 hours. This incident is resolved. A Root Cause Analysis will be posted.
Posted Mar 20, 2019 - 20:53 CET
Monitoring
The issue at our provider has been fixed. As a result, visits seem to be created normally since then. We're monitoring before closing this incident.
Posted Mar 20, 2019 - 17:39 CET
Identified
We're in contact with one of our providers to solve the issue. We see some visits are correctly created but some visits are still not created.
Posted Mar 20, 2019 - 16:44 CET
Investigating
It is currently not possible to create visits. We're investigating and will post updates on this page
Posted Mar 20, 2019 - 16:12 CET
This incident affected: Dashboard, iPad app, Browser-based Kiosk, and API.