Outage Postmortem – 18th October 2016
Starting on Friday, the 14th of October 2016, we experienced a succession of increasingly severe network failures which culminated in a near total service outage by 01h30 UTC on Tuesday the 18th.
As a result many of our customers were unable to access the Trigger.IO website, Toolkit or Build servers during this time.
The failure first became visible on Friday afternoon when we received a support email from a customer who was unable to perform builds. At the time our monitoring system showed no problems and the customer replied to my initial queries that the problem had resolved itself.
I was puzzled but ended up writing the incident off as a brief routing problem on the wider Internet and did not think about it again until the emails started flooding in late Monday night.
At this point in time our monitoring system was still reporting 100% uptime and I had no difficulty accessing any of our infrastructure from the office.
This was about to change.
At 01h30 UTC I observed a rapid rise in the failover graph followed by most of the individual server monitors turning red.
By 3h00 UTC we discovered that a network router responsible for one of our server instances had failed and we took that region offline.
Recovery was almost immediate but we still had no clear picture of what had happened or the full extent of downtime experienced by our customers.
- We were contacted by 12 different customers who experienced an outage during this period.
- The shortest period reported was 1 hour starting around 02h00 UTC.
- The longest period reported is an estimate of 3-4 days starting late Saturday or early Sunday.
- 5 of the affected customers are based in California, USA.
- 3 of the affected customers are baed in other parts of the USA.
- 1 of the affected customers is based in Germany.
- 3 of the affected customers are based in the United Kingdom.
1) For an unknown period of time a network router in the West Coast, USA data center region has been experiencing intermittent hardware failures.
2) Specifically, the router’s BGP table was experiencing corruption which prevented it from routing traffic originating from some regions of the Internet.
3) Around Friday the router’s rate of failure started to increase and by 01h30 UTC on Tuesday morning it was going through a rapid cycle of failures and recoveries.
4) This caused our failover system to start receiving conflicting information about server availability resulting in a cascade of increasingly rapid switchovers of traffic between server regions and ultimately leading to a global service outage.
5) Because our monitoring system was not on a network affected by the router in question we were only able to detect the failure by the time we experienced a global outage.
6) Because our monitoring system checked for host uptime rather than individual API endpoint availability we were unable to monitor metrics that could have helped us identify routing failures earlier.
A big thank you to all the Trigger.IO customers who provided information that helped us to reconstruct what happened.
The following steps have been taken to ensure this does not happen again:
1) The faulty router hardware has been replaced.
2) We’ve added API availability checks to our failover system in addition to host checks.
3) We’ve amended our failover system to take a hysteresis factor into account before performing switch overs.
4) We’ve deployed a second monitoring service, not hosted on our infrastructure provider’s networks, that monitors our API endpoints from multiple international regions.
I know you rely on Trigger.IO to get work done and I’m sorry for the time you lost while we were down.
Having good uptime in the past is a bad reason for not paying more attention to the hard lessons other companies in our field have already learnt.
If you have any questions about the outage or would like to talk about what happened please get hold of me directly on firstname.lastname@example.org.
All the best,
General Manager @ Trigger.IO