Storage nodes on our internal service bus failed past our load balancing threshold, causing severely degraded performance across all components but primarily affecting our Dashboard and API. Storage was recycled and restored within a few minutes of the initial outage, however, the backlog of requests on the service bus concentrated on only a small subset of our API nodes and caused a secondary overflow to our SQL cluster which extended the incident duration. All storage and API nodes have been recycled and the queues processed, fully restoring service.
Our engineering teams will implement improved fail-over and queuing changes to further harden our infrastructure against this type of failure.
Posted Jun 07, 2022 - 14:47 UTC
Monitoring
A fix has been implemented and we are monitoring the results.