Incident Description (user perspective)
Proxyclick experienced a service disruption on March 21st, 2019 from 17:45 UTC until 21:55 UTC. This was a highly unusual event, and the first time we experienced an outage for this length of time.
Although it was possible to access and navigate through the Dashboard, it was impossible to perform actions such as creating a new visit, or checking your visitors in and out. The iPad app allowed visitors to complete check-ins if they had been preregistered or as new unexpected visitors, and badges were printed as normal. However, for the duration of the outage, no visitor data was submitted successfully to Proxyclick from the iPad App.
Although the user-facing impact of this incident was the same as the previous incident on March 20th, the root cause of both incidents are different as explained below.
The first alerts from our monitoring arrived at 17:45 UTC (12:45 pm EST).
Full access to our services was restored at 21:55 UTC (4:55 pm EST).
Root Cause Analysis
We use a messaging system as part of our infrastructure. Actions performed in our application are retained in the messaging system until receipt is confirmed to ensure their delivery to the underlying components.
Unlike most of our infrastructure which utilizes dedicated private servers, our messaging system is built on a hosted cloud infrastructure.
The messaging system is also geographically redundant across three separate data centers.
Chronology of the events
As reported in this incident report, the messaging system experienced a first outage on March 20th at 15:00 UTC due to a technical issue at our hosting provider which affected key components of the messaging system on all three data centers.
Still on the 20th, in parallel with our hosting partner’s efforts to restore the affected components, we ourselves began the process of standing up a new instance of our messaging infrastructure to ensure we could resume functionality as quickly as possible. Prior to the completion of this new infrastructure, engineers at our hosting partner were able to restore the affected components and full access to our services was restored. Efforts to stand up the new cluster were halted at this time, and we requested that the hosting provider remove this incomplete deployment.
On March 21st, in the process of fulfilling our request, the hosting provider performed a very unfortunate human error and removed the production messaging system across all three data centers. This action bypassed all protections in place for normal technical issues, and removed all portions of the redundant systems simultaneously. This was only possible due to the hosted nature of these systems, which utilized a centralized project management across all data centers.
The loss of the messaging system caused the core infrastructure to return communication errors to the front-end components. Users of the Dashboard were presented with error messages when attempting to complete any actions involving new data, while visitors checking in on the iPad were given no indication of faults.
Our engineers were notified promptly by our monitoring, and began following our documented recovery steps to deploy a new messaging system without delay.
Acquisition of new hosted cloud infrastructure, provisioning of new external IP addresses, DNS updates, and deployment from source went according to plan. After testing and validating that the cluster was fully redeployed, we were able to bring the messaging system back up in all three data centers approximately four hours after being notified.
Immediate Next Steps
1. Migration of messaging system from hosted cloud infrastructure to dedicated private servers (as with nearly all of our remaining infrastructure). We expect to complete this over the next few days, which will significantly improve our protection against the types of incident we encountered this week.
2. Discussions with our hosting provider to understand how this error was allowed to happen, and ensure protections are in place to prevent any future such incidents. While dedicated servers mitigate a significant portion of the risk, we want to be absolutely certain we have accounted for any further possibilities.
3. Global review of all infrastructure to determine if any other hosting adjustments need to be made to improve reliability and robustness of our systems. Our initial reviews indicate that there are no further components that will need to be relocated.
4. Analysis of the iPad App to improve its detection of communication issues with our infrastructure. This will ensure that any future communication issues are treated similarly to when local internet connectivity is lost at the Kiosk, and visit updates (Check-Ins, Check-Outs, Deliveries) are cached locally until connectivity is restored.
5. Review our long-term infrastructure plans and adjust accordingly based on our lessons and insight gained by these experiences.
6. Continued focus on security and availability of our systems, as demonstrated by our recent SOC 2 Type II certification. This includes utilization of modern, high performance platforms such as the Apache Pulsar engine used to build the messaging system that suffered outages this week, and investment in redundant, geographically separated infrastructure.
7. Pursuit of further improvements to our hosting design and disaster plans to better prevent outages, either from human error or technological failure. This incident reminds us that data security and system availability is a continuous mindset and requires ongoing investment, one which we remain fully committed to making in order to better serve our customers.