Photo via labsji/Flickr
Sunday afternoon, around 4 EST, Instagram, Vine, Netflix, Airbnb all snagged, slowed, and left their users waiting for log-in pages that never seemed to load. It sure ruined my plan to wrap up the weekend by posting a video of myself sending pictures to someone I might stay with in Seattle while watching Sleepless in Seattle.
How could such disparate services–owned by the Internet's heaviest hitter–all get hit at once? A preemptive strike from the Syrian Electronic Army? Anonymous putting its foot (their feet?) down to end brunch-related photography? Holiday Inn striking back?
It was none of these. It was an Amazon data center in Northern Virginia having a very public connectivity glitch. Amazon Web Services provides online storage and computing power for high- and low-profile companies, from Reddit to Maine’s Kennebec Journal. With five times the utilized computing capacity of its next 14 competitors, AWS also dominates public cloud computing.
According to Amazon updates during the glitch, the problem was with the Elastic Block Storage volumes in its ECS database, which were experiencing launch and network packet loss errors. There was also a glitch with the load balancers, but on the whole the problems were solved over the next four to five hours. Or as an AWS spokesperson said in an email:
Yesterday, from 12:51 PDT to 1:42 PM PDT, we experienced network packet loss, which caused a small number of EBS volumes in a single Availability Zone ("AZ") in US-East-1 to experience degraded performance, and a small number of EC2 instances to become unreachable in that same single AZ. The root cause was a "gray" (partial) failure with a networking device that caused a portion of the AZ to experience packet loss. The network issue was resolved and most volumes and instances returned to normal.
It was the second service disruption in as many weeks at Amazon. The server hiccup on August 19, however, felled Amazon.com, its Cloud Player and Audible.com for half an hour, but left Netflix unaffected.
Amazon’s US-East region is its oldest and largest, and also, seemingly, its spottiest. In 2012, there were outages in June, and downtime in October and on Christmas Eve.
It raises the obvious question: Why would all these companies go through a giant, fallible consolidated location? Mostly because companies believe it’s cheaper than doing all of that stuff in house. You’re not paying to hire and train people to maintain that equipment for you; you don’t have to worry about maintaining the systems. You just write a check, and in exchange your data is supposed to flow safely, and without interruption, through one of these centers. Well-placed centers can even have natural, more energy efficient cooling system for the green-minded Silicon Valley company.
As the oldest, Amazon’s US-East has been their default data region; even though other facilities are newer and have less traffic. “If you’ve been in US-East for a while, chances are you’ve built up a substantial amount of data in that region. It’s not always easy to move data around depending on how the applications are constructed,” an industry exec who’s put a lot of workloads in Amazon and did not want to be identified, told Gigoam in October 2012.
So it’s an iffy relationship that the companies are habitually bound to. It’s not quite the decentralized, dynamic web of utopian visions. Instead it’s more like traffic–aging infrastructure getting snared in Northern Virginia.