Every day brings a new set of outages on the Internet. Websites go down, online services run into trouble, networks have glitches, and so on. When a lot of users are affected, these outages make the news and set the blogosphere abuzz. We here at Pingdom work with downtime-related issues every day and probably spend more time reading about these things than most, so we decided to sum up the year so far for your convenience, and add some analysis of our own in the process.
These are 14 (not 13, that would be bad luck! 😉 ) of the more notable Web- and Internet-related outages and incidents so far in 2008. We chose outages that have either affected a lot of people, or have other implications that we deemed important to highlight.
One thing that the following examples clearly show is that no one is immune to downtime. Not Google, not Microsoft, and not Apple. In addition to this, sometimes whole parts of the Internet itself simply break.
SaaS availability problems
- Gmail trouble (Mar 1, Mar 12, Mar 27, Aug 11) – Google has had numerous difficulties with its Gmail service this year. Often only a subset of the users have been affected, but they have made themselves heard, and there has been plenty of big news coverage which should not be surprising considering how many people use the service.
- Google Apps trouble (Jul 8, Aug 7) – Problems with Google Apps (including Google Docs) caused grievances for users on at least two occasions. The noise generated by these outages, although they were relatively short-lived, gives an indication of what would happen if Google Apps would go down for a longer period of time, say a day. We suspect the roar would be heard all over the Internet.
Glitches in the cloud
- Amazon S3 outages (Feb 15, Jul 20) – These outages are significant because AWS has become somewhat of a poster boy for cloud computing, so every time S3 (or EC2) has a problem, “the cloud” is called into question. And of course, another reason it gets a lot of attention is that a lot of services use Amazon S3, so just as when a hosting company or data center has an outage, a lot of sites are affected. The one on July 20 lasted eight hours.
- Apple’s MobileMe outages (Jul 10-11) – When Apple was migrating .Mac accounts to the new MobileMe, things did not go as smoothly as they would have wished. Steve Jobs has later admitted (in a leaked email) that Apple made a mistake launching MobileMe, the iPhone 3G, the iPhone 2.0 software and App Store all on the same day, and that MobileMe should have been given more time and testing.
- iPhone activation woes (Jul 11) – Not only did Apple have problems with the MobileMe launch, but many who bought the brand new iPhone 3G couldn’t activate their phones because the iTunes activation servers were overloaded.
- Cuil launch trouble (Jul 28) – When Cuil launched its much-hyped “Google killer,” it crashed. They had built up a big buzz, and then weren’t able to handle the number of visitors they got. Now there are also rumors that Cuil itself is causing downtime for some sites due to excessive crawling by its indexing bot.
- Microsoft Photosynth launch trouble (Aug 21) – The website of Microsoft’s new 3D photo-stitching service couldn’t handle the load on its launch day. Microsoft seems to consider the initial downtime as a badge of honor and promised to quickly add more horsepower, but as Michael Arrington pointed out over at TechCrunch: “We see similar optimistic responses to server failure all the time from startups. Except they’re startups.”
Internet and data center issues
- Mediterranean submarine cable cuts (Jan 30) – A pair of cut submarine telecom cables in the Mediterranean just north of Egypt caused severe Internet outages and disruptions in the Middle East, Pakistan and India. This incident reminded us all that ultimately the Internet is greatly dependent on the physical cables that interconnect the various networks that make up the Internet. (Renesys has a detailed description of the effects of the outage that you may want to check out, including a map of the most affected areas.) Further cable incidents in the same region followed, sparking various conspiracy theories.
- Explosion and fire at The Planet (Jun 1) – Probably the most massive data center outage of the year, an explosion and electrical fire at one of The Planet’s data centers in Houston affected thousands of sites (around 9,000 servers), some for several days. The fire department’s initial refusal to let The Planet activate its backup power generators didn’t exactly help. In addition to this, services that depended on DNS servers located in that data center were also affected.
- Blackberry service outages (Feb 11, Jun 18) – The Blackberry addicts (some have given it the nickname “Crackberry”) have had two major service outages so far this year to fret about. Considering that many Blackberry users are business users, this could have had a real effect on some businesses’ ability to communicate.
- Netflix problems (Mar 24, Aug 11-15) – In March the online video-rental service (with seven million subscribers) suffered from a technical glitch that took out its website and logistics system for about 12 hours. In August a system problem prevented the service from delivering DVDs for several days.
- The YouTube IP hijacking (Feb 24) – YouTube was unavailable for roughly two hours because an ISP, Pakistan Telecom, had mistakenly claimed their IP address space (including the IP addresses used by YouTube’s DNS servers). This effectively took YouTube offline in a matter of minutes. This is interesting because it proved that a single ISP can, under the right (or wrong!) circumstances, inadvertently sabotage parts of the entire Internet.
- Revision3 taken down by MediaDefender DDoS attack (May 24-26) – Anti-piracy company MediaDefender crippled the Revision3 Internet TV service for most of the Memorial Day weekend with an attack on the service’s (legal) BitTorrent tracking server, bombarding Revision3’s network with up to 8,000 SYN packets a second. Revision3 has posted a lengthy explanation of the incident on their blog.
- SiteMeter crashing blogs (Aug 2) – An update to SiteMeter’s script (websites can have it included on their pages to get visitor statistics) started crashing popular blogs like Gawker, Lifehacker, Gizmodo and Valleywag for Internet Explorer users. Presumably every single website using SiteMeter had this problem. This is significant because it shows that third-party apps and scripts can quite easily stop a whole site from working.
This is a summary of some key points that we feel were highlighted by the various outages we have listed above.
- Capacity problems during launches – There seems to be a trend involving new services building up hype and then launching a new product only to be unable to handle the amount of traffic the service gets. These are often pure scaling issues, for example when people couldn’t activate their new iPhones or when Microsoft’s Photosynth proved to be “more popular than expected.” That this happens for small startups that have limited resources is not surprising, but for big companies like Apple to stumble like they did with MobileMe and the iPhone activation is a bit unexpected.
- The importance of DNS – Several of these issues also underline how critical DNS servers are on the Internet. Lose control of your DNS servers, and you lose control of your site, so spread them out. Don’t have them all in one place! For example, after the YouTube incident described above, YouTube has added additional DNS servers on another network to prevent something similar from happening to them again.
- Infrastructure problems are affecting “the cloud” – It may not be fair to single out Amazon, but their S3 service is the most-used “cloud” utility service out there so they always come up when cloud computing is discussed. The problems they have had so far have been various issues with their infrastructure, but hopefully these problems will diminish as the technology matures. Arguably you could also cite the problems that Google have had with Gmail and Google Apps to be “issues with the cloud”.
- Third-party interference – Several of these outages show how vulnerable services and websites are to “external influence”. The most recent incident with the SiteMeter script is highly interesting, because in the mashup world of Web 2.0, websites are stacked with third-party scripts and applications. Any one of these may be the link that breaks the chain. The more you have, the higher the risk of them affecting your site in some way (performance or downtime).
Downtime happens to everyone sooner or later. It’s simply a matter of fact on the Internet. However, by learning from the mistakes (or bad luck) of others and being well aware of what can go wrong, it’s possible to take a proactive approach and minimize the risk of future downtime, and when it happens, at least keep it short.
By the way, if you feel that we missed some major (or interesting) outages, please let us know in the comments!