2012 brought us some of the worst website outages and downtime in recent memory. Here’s the list that made our top 15.
15. Google App Engine
When: Friday, October 26th
Cause: Traffic Spike
For four hours between 10:30AM-2:30PM EST on October 26th, Google App Engine failed to deliver about 50% of its requests. As a service used by hundreds of thousands of developers to create applications, this outage was felt heavily across the web. The downtime was caused by an increased load on traffic routers.
When: Thursday, October 18th
Cause: Network Issue
Starting at 8:30AM EST, Tumblr experienced an outage due to "network problems following an issue with one of [their] uplink providers." The problem ensued for six hours until service was finally restored around 2:15PM EST.
When: Tuesday, July 10th
Cause: Power Outage
Salesforce underwent a significant outage in the early morning that affected six of the company's regions. The outage was identified as a power failure at an Equinix data center in Silicon Valley. Though the power outage only lasted for one minute, it took over nine hours to fully restore their service. This outage came just weeks after a smaller previous incident.
When: Thursday, June 31st
Cause: Cascaded Bug
Twitter, notorious for severe outages, went down at around noon on June 21st. The disruption lasted for three hours, when Twitter identified the problem as a "cascaded bug in one of our infrastructure components." The outage was so severe, however, that the infamous "Fail Whale" error page couldn't even load - the site simply timed out. The outage marked the longest and worst crash for Twitter in 8 months.
When: Tuesday, October 16th - Thursday, October 18th
Cause: Distributed Denial of Service (DDoS) Attack
On Tuesday and Wednesday, Github experienced partial outages of 26 minutes due to a network issue and 24 minutes due to errors in its search service respectively. Then, on Thursday, Github underwent a DDoS attack that lasted for 5 hours. Developers in companies and startups across the world were at a standstill from doing any work, as they could not pull or push any of their code. Overall, it was a rough week for Github.
When: Thursday, November 21st
Cause: Traffic Spike
Kohl's ran a massive online special for Black Friday shoppers, offering over 500 early bird specials, 20% off sales prices, and free shipping for orders over $50. The bargains started the day before Thanksgiving and ran until 3pm on Black Friday. However, given the surge in traffic, the Kohl's website experienced an outage for several hours on Thanksgiving evening. As the heaviest online traffic week of the year, a few hours of downtime can be incredibly costly for online retailers.
9. Super Bowl (Coke, Acura, Act of Valor)
When: Sunday, February 5th
Cause: Traffic Spike
The Super Bowl is the largest advertising event of the year. Advertisers spend millions on precious seconds to capture the eyeballs of millions watching the big game. Some advertisers' websites, however, buckle under the massive influx of traffic they receive due to their ads. For Coke, Acura, and Act of Valor, their websites all experienced severe outages directly after their ads aired during the Super Bowl.
When: Thursday, June 1st - Friday, June 2nd
Cause: The Like Button
Facebook slowed down or was completely unavailable for most users for three hours between June 1st and June 2nd. With over 1 billion users worldwide, an outage of any kind is detrimental to a web property the size of Facebook. What's worse, though, is that Facebook affected thousands of retail and content sites on the web as well. How? The Like button. Third party widgets, such as the Like button, rely upon the servers and performance of that third party (third party widgets are one of the biggest culprits of poor performance). So when Facebook experienced problems, websites who had the Like button embedded on their pages underwent performance spikes between 5 and 20 seconds!
7. Bank of America
When: Friday, September 14th - Wednesday, September 19th
Cause: Service Upgrade / Traffic Spike
On September 14th, problems started with Bank of America's website with the message "some of our pages are temporarily unavailable" on the homepage. The issues were sporadic on Saturday, but were prominent again on Monday with unavailable webpages. Starting at 10AM on Tuesday, the majority of users were unable to connect to Bank of America's website due to slowness and time-out failures. The website placed the message "We're sorry, our site is running slowly" on their homepage. The problems were not resolved until Wednesday morning. Some speculated the issues were caused by a DDoS attack, but Bank of America denied the claims. They attributed the outages to end of the month traffic along with a code release which migrated older customers to their new platform.
When: Friday, July 27th
Cause: Power Outage
Hosting.com suffered an outage in the early morning which caused more than 1,100 customer websites to experience downtime for as many as five hours. According to Hosting.com CEO Art Zeile, the cause of the outage came from human error, as an engineer performing maintenance on servers mistakenly cut the power to the facility. The power loss only lasted for a couple of minutes, but all of the servers needed to restart which prolonged website downtime for customers. The majority of website owners did not have backup hosting and were not prepared for such an outage, leaving them at mercy to the resolution of a singular data source.
5. Hurricane Sandy
When: Monday, October 29th - Monday, November 5th
Cause: Natural Disaster
When Hurricane Sandy hit the East Coast, it took down some major data centers in New York and New Jersey that host popular websites such as Gawker Media, Huffington Post, and BuzzFeed. The hurricane caused sporadic outages for an entire week before the data centers were able to restore power and reboot.
We have to give major props to Squarespace for literally carrying fuel up 17 floors for 3 days -- all to provide 100% uptime to over 1 million websites. That's dedication.
4. Leap Second Bug
When: Sunday, July 1st
Cause: Additional second of time added to atomic clocks due to leap year
The Leap Second Bug caused outages for many popular services such as Reddit, LinkedIn, Yelp, Gawker Media, Foursquare, StumbleUpon, Mozilla, and Microsoft Windows Azure. What is the Leap Second Bug? As explained here, every 18 months a leap second is added to adjust our atomic clocks to the Earth's slowing rotation. A grand total of 24 leap seconds have been added since 1972! One small second threw Java and digital certificates for a loop with a new timestamp and thus caused problems for these services. Google, however, was prepared for the leap second. They slowly added milliseconds over time to make up for the leap second when the transfer finally happened.
3. Royal Bank of Scotland
When: Tuesday, June 19th - Thursday, August 2nd
Cause: Batch processing backlog
The IT staff was responsible for system failures that affected 17 million customers of RBS, NatWest and Ulster Bank. The problem occurred during maintenance on systems which caused an error in their automated batch scheduler and processor. This prevented millions of customers from receiving or making payments, and lasted for more than a week! The outage cost RBS a whopping £125 million!
When: Monday, September 10th
Cause: Domain Name Server (DNS) Failure
Around 11AM PST, GoDaddy announced they were experiencing intermittent outages and later attributed the issue to a DNS failure. The infamous hacker group Anonymous originally took credit for the outage by way of a DDoS attack, but later rescinded this claim. GoDaddy hosts more than 5 million websites, so thousands - and possibly millions - of websites experienced downtime due to this issue. Service was restored for the majority of users by 8PM PST, but the sheer magnitude and scale of GoDaddy's reach online made this one of the biggest and most publicized outages of the year.
1. Amazon Web Services (AWS)
When: Friday, June 29th / Monday, October 22nd / Monday, December 24th
Cause: Natural Disaster / Memory Leak / Elastic Load Balancing Failure
AWS had a rough year for uptime, as it experienced three major outages. The first outage happened on June 29th due to a major storm that impacted popular services such as Instagram, Pinterest, and Netflix until the following day. On October 22nd, a memory leak and failed monitoring system caused Reddit, Foursquare, Minecraft, Airbnb, Heroku, GitHub, imgur, Pocket, HipChat, Coursera and a number of others to go down. The outage lasted for six hours until service was restored. Finally, on Christmas Eve, Netflix went down until Christmas morning due to an elastic load balancing failure in AWS.
What are the biggest outages you remember from last year? Do you think anyone on this list will turn into a repeat offender in 2013?