In an exclusive interview with Motherboard, Twitter explains how its internal Command Center ensures heavy traffic and bugs don’t take down the site.
Two things were crystal clear coming out of the 2010 FIFA World Cup: one, Spain was absolutely ridiculous, and two, Twitter needed to get its act together.
"Twitter's performance has been shaky today, with many users complaining about fail whales and Twitter's status page acknowledging 'periodic high rates of errors,'" said veteran tech blogger Om Malik on June 11, 2010, just hours after the tournament kicked off in Johannesburg. "Those issues are likely caused by countless soccer fans weighing in about the World Cup."
I guess I was part of the problem that day.
"The World Cup in 2010—we saw the site do less than great during that time," said Twitter Staff Reliability Engineer Max Michaels in an interview earlier this week. Michaels, speaking exclusively to Motherboard, was explaining the genesis of the Twitter Command Center, the internal team that's responsible for identifying and fixing technical problems that may degrade the site's performance—ideally before any of the site's 320 million users ever see those problems in the first place.
Michaels is based in San Francisco. But the Command Center—which Twitter is first publicly speaking about today—is made up of engineers who are stationed all over the world, working 24 hours per day, seven days per week in rotating shifts. "We're the mission control of Twitter," said Michaels. "Our job is to keep the lights on for all of our users."
It's the Command Center that helps ensure that your timeline refreshes the instant you pull-to-refresh, that your personalized Trends accurately reflect the pulse of the platform, and that your thoughts, however trivial, are published for all the world to see as soon as you tap "Tweet."
"Obviously there's the lunch tweets and pictures of food and stuff like that, but if you look at some of the bigger things—all the stuff in Egypt was kind of driven by Twitter," said Michaels, describing the sense of urgency that he and his fellow Command Center engineers feel every day. And not that Twitter prioritizes one piece of content on its platform over another, but Michaels noted that he comes from a background in advertising, "and keeping people's voices available during a revolution in Egypt sometimes feels just a little bit more valuable than serving the correct ad at the right time."
The Command Center was formed in response to events like that World Cup in South Africa, in which Twitter struggled under the pressure of a barrage of tweets celebrating beautiful goals and debating controversial referee decisions. Until the Center's creation, if something broke—if a huge, unexpected spike of traffic suddenly threw Twitter for a loop—there was no single, dedicated unit inside the company empowered to put out the flames. "Instead of having everyone in one place to watch over the whole thing," said Michaels, "it was individual members of each team watching over their piece of the pie."
And as Twitter grew in popularity—the company more than more than doubled the number of monthly active users from in the year following the World Cup, from 40 million to 85 million—it recognized that this ad-hoc, siloed approach to tackling the site's technical problems was no longer appropriate.
"After a while we realized we needed someone in the middle that was kind of watching the whole service, every piece of it at once," said Michaels. "Because seeing how your piece is working doesn't necessarily reflect how it's impacting Twitter as a whole. That's where we came in."
Beyond the obvious structural advantage of having a one-stop shop designed to put out server fires before they rage out of control, a key component of the Command Center's success, Michaels explained, is the regular audits and war games simulations it performs to better understand how Twitter responds to unexpected events that the team cannot prepare for in advance. "After these type of events," Michaels said, "we look at the traffic and the pattern so that in the future we can figure out what needs fixing. We try to give ourselves a second chance at these events so we can actually dig in deeper while they're happening."
One of these unexpected events occurred in March 2015, when Zayn Malik, of the popular boy band One Direction, announced on social media that he was leaving the group so he could be a "normal 22-year-old who is able to relax and have some private time out of the spotlight." Malik's announcement quickly led to a mournful hashtag, #AlwaysInOurHeartsZaynMalik, and an outpouring of tweets, retweets, and digital tears.
"It's very odd that as a grown man I know this," Micheals joked, referring to Malik's high-profile announcement and the subsequent tide of tweets. But there was an upside to the sad day since it gave Michaels and the Command Center the opportunity to pore over the data to better prepare Twitter's infrastructure for the next boy band breakup or furious international debate surrounding the color of a dress.
While no online service is invulnerable to the occasional technical glitch, the Command Center appears to be working as intended. In late 2013, Twitter retired the cutesy Fail Whale error message partly in recognition of the progress it had made in terms of stability, and service outages are now treated as major news stories and not merely as fodder for dank memes. "From an outside perspective Twitter can seem like, 'Well how complex can it be?'" Michaels said. "But if you think of all of the different moving parts of it, from keeping our search fresh to keeping our trending topics fresh, it's actually incredibly complex."
"You want to do your job well at any time," he continued, "but when you're supporting world-impacting events it definitely adds a little bit more to it."