When the network fails
Problems with storage systems, particularly cloud network storage like Amazon EBS or a shared NAS, are notorious for causing complex outages which take a long time to resolve. I liken this to commercial airline failures – because of the levels of redundancy, it’s never a single, simple cause but rather a combination of cascading system failures which cause the actual outage, slow down initial diagnosis and delay complete recovery.
However, at least with these you can clearly see the failure and eventually have information to figure out what is happening. That’s often not the case when you have a networking issue. These can be transient so it’s difficult to collect sufficient diagnostic information because the condition resolves itself. We often see customers e-mailing us after receiving “no data received” alerts which have disappeared before we have a chance to request traceroutes (part of the reason why we set a minimum trigger threshold of 5 minutes for these kinds of alerts).
Human error can also cause major problems. Publishing incorrect rules which are then blindly accepted by other nodes can be done accidentally or deliberately, resulting in blocking IP access worldwide (e.g. Pakistan blocking YouTube in 2008 and a Chinese ISP hijacking certain IPs in 2010).
In November last year we saw an issue where certain EU traffic to our service was being routed to our Washington data centre via Asia and the US West coast instead of across the Atlantic. This was caused by incorrect routing announcements to certain ISPs which took 5 hours to track down by our hosting provider, Softlayer.
Github had significant downtime just before Christmas due to a complex networking failure scenario and around the same time a frontend aggregate switch pair errored into a looping condition at our Softlayer San Jose data centre, causing all our servers in that data centre to lose public network connectivity (this is our failover data centre so there was no customer impact).
Networking is, of course, just another component you need to ensure has no single point of failure but the level of complexity and reliance we place on networking means when something does break, it will likely cause major disruption. This is where the multi-data centre deployment strategy comes in; something we’ll be talking about our own implementation of in the coming months.
Enjoy this post? You may also like Making a point with SLAs