As you may have noticed if you’re a Pingdom user, we had problems yesterday. In this post we will do our best to explain what happened and why, and how we will learn from this going forward.
And let us also take this opportunity to sincerely apologize for any inconvenience this may have caused you. It’s super-important for us to provide you with a quality service, and those of you who have been with us for a long time hopefully know that we deliver on that promise. This was an extreme, highly unusual scenario.
Yesterday at 3:29 p.m. CET (GMT+1) we had a critical hardware failure at our main data center, located in Stockholm, Sweden. This is where the Pingdom website, control panel and backend are hosted, as well as the Stockholm monitoring location. The hardware failure effectively cut off those servers from the Internet.
We determined the root cause relatively quickly, but it took time to get replacement hardware into place and get it back up and running properly. In the end we were offline for just over four hours.
That doesn’t mean monitoring stopped working during this time. The actual monitoring from our 30+ monitoring locations is designed to continue working independently of our backend if it can’t be reached.
Once the backend was reachable again, the system started the process of catching up with the monitoring results for those past few hours, i.e. getting the results from those 30+ (32 right now, to be exact) monitoring locations. They cache results until they can be sent to and processed by the backend and show up in reports, etc.
Aside from the site being unavailable for four hours, which was bad enough, there were some other unfortunate consequences of the outage.
- Delayed data in reports. Since our backend was unreachable for such a long time, it took a couple of hours until all the monitoring data from our monitoring network had been sent in and processed so the report data was 100% correct and up to date.
- Delayed alerts. Those who had outages that started during the time our backend was unreachable got their alerts late, which of course is unacceptable. We plan to implement a distributed solution for alerts that will completely eliminate any risk for this in the future.
- Some incorrect alerts. Unfortunately, in addition to regular but delayed alerts a number of incorrect ones were triggered as the system came back up. We did not discover this until we had time to investigate the system in more detail. We had designed a fail-safe for this, but it turns out it was circumvented in this case. After the fact we can only apologize. We have found the problem and will correct it so it can’t happen again.
- Overloaded site. To make matters worse, since hours worth of delayed alerts were sent out in a short amount of time as the results flooded back into the system, a ton of people of course tried to connect to our site and control panel to investigate what was going on, creating almost a kind of DDoS attack which made our site and control panel slow and even at times unavailable.
Note regarding SMS refunds: Any lost SMS credits due to late or incorrect alerts in connection with yesterday’s outage will of course be refunded.
What we have learned going forward
We tend to say that no service is immune against downtime, and that includes us. What matters is that you resolve it, and learn from it.
This four-hour downtime was, by quite some margin, the single largest issue we’ve ever had with the Pingdom service during the four years we’ve been around. Basically every single thing that could possibly go wrong, did. To name just one example, getting the replacement hardware into place easily took an hour longer than necessary simply because roadwork had created a huge traffic jam on the freeway to our data center.
We’ll use the experience and knowledge we’ve gained from this to make our system stronger and better, with additional fail-safes and redundancy. We’re using an excellent data center, but we need to add more, especially for backend-related functionality.
What we plan to do to avoid similar incidents in the future:
- Add more locations with backend functionality to handle alerts and other time-critical resources.
- Add more redundant hardware, and better failover capability.
- Have more spare parts on location.
- Add more fail-safes in the system to gracefully recover from longer site issues.
- Make alert code modifications to better handle extreme situations without side effects.
That’s actually the only upside to this incident. It will make Pingdom an even better service. You should be able to trust us 100%, and yesterday during and for a while after the outage, we stumbled in that regard. We hope your trust in us has not been too damaged by this incident.
And please remember, if you have any questions you can always get in touch with the Pingdom team at support (at) pingdom.com.