Synthetic Monitoring

Simulate visitor interaction with your site to monitor the end user experience.

View Product Info

FEATURES

Simulate visitor interaction

Identify bottlenecks and speed up your website.

Learn More

Real User Monitoring

Enhance your site performance with data from actual site visitors

View Product Info

FEATURES

Real user insights in real time

Know how your site or web app is performing with real user insights

Learn More

Infrastructure Monitoring Powered by SolarWinds AppOptics

Instant visibility into servers, virtual hosts, and containerized environments

View Infrastructure Monitoring Info
Comprehensive set of turnkey infrastructure integrations

Including dozens of AWS and Azure services, container orchestrations like Docker and Kubernetes, and more 

Learn More

Application Performance Monitoring Powered by SolarWinds AppOptics

Comprehensive, full-stack visibility, and troubleshooting

View Application Performance Monitoring Info
Complete visibility into application issues

Pinpoint the root cause down to a poor-performing line of code

Learn More

Log Management and Analytics Powered by SolarWinds Loggly

Integrated, cost-effective, hosted, and scalable full-stack, multi-source log management

 View Log Management and Analytics Info
Collect, search, and analyze log data

Quickly jump into the relevant logs to accelerate troubleshooting

Learn More

About yesterday’s Pingdom outage

PingdomAs you may have noticed if you’re a Pingdom user, we had problems yesterday. In this post we will do our best to explain what happened and why, and how we will learn from this going forward.

And let us also take this opportunity to sincerely apologize for any inconvenience this may have caused you. It’s super-important for us to provide you with a quality service, and those of you who have been with us for a long time hopefully know that we deliver on that promise. This was an extreme, highly unusual scenario.

What happened

Yesterday at 3:29 p.m. CET (GMT+1) we had a critical hardware failure at our main data center, located in Stockholm, Sweden. This is where the Pingdom website, control panel and backend are hosted, as well as the Stockholm monitoring location. The hardware failure effectively cut off those servers from the Internet.

We determined the root cause relatively quickly, but it took time to get replacement hardware into place and get it back up and running properly. In the end we were offline for just over four hours.

That doesn’t mean monitoring stopped working during this time. The actual monitoring from our 30+ monitoring locations is designed to continue working independently of our backend if it can’t be reached.

Once the backend was reachable again, the system started the process of catching up with the monitoring results for those past few hours, i.e. getting the results from those 30+ (32 right now, to be exact) monitoring locations. They cache results until they can be sent to and processed by the backend and show up in reports, etc.

The consequences

Aside from the site being unavailable for four hours, which was bad enough, there were some other unfortunate consequences of the outage.

  • Delayed data in reports. Since our backend was unreachable for such a long time, it took a couple of hours until all the monitoring data from our monitoring network had been sent in and processed so the report data was 100% correct and up to date.
  • Delayed alerts. Those who had outages that started during the time our backend was unreachable got their alerts late, which of course is unacceptable. We plan to implement a distributed solution for alerts that will completely eliminate any risk for this in the future.
  • Some incorrect alerts. Unfortunately, in addition to regular but delayed alerts a number of incorrect ones were triggered as the system came back up. We did not discover this until we had time to investigate the system in more detail. We had designed a fail-safe for this, but it turns out it was circumvented in this case. After the fact we can only apologize. We have found the problem and will correct it so it can’t happen again.
  • Overloaded site. To make matters worse, since hours worth of delayed alerts were sent out in a short amount of time as the results flooded back into the system, a ton of people of course tried to connect to our site and control panel to investigate what was going on, creating almost a kind of DDoS attack which made our site and control panel slow and even at times unavailable.

Note regarding SMS refunds: Any lost SMS credits due to late or incorrect alerts in connection with yesterday’s outage will of course be refunded.

What we have learned going forward

We tend to say that no service is immune against downtime, and that includes us. What matters is that you resolve it, and learn from it.

This four-hour downtime was, by quite some margin, the single largest issue we’ve ever had with the Pingdom service during the four years we’ve been around. Basically every single thing that could possibly go wrong, did. To name just one example, getting the replacement hardware into place easily took an hour longer than necessary simply because roadwork had created a huge traffic jam on the freeway to our data center.

We’ll use the experience and knowledge we’ve gained from this to make our system stronger and better, with additional fail-safes and redundancy. We’re using an excellent data center, but we need to add more, especially for backend-related functionality.

What we plan to do to avoid similar incidents in the future:

  • Add more locations with backend functionality to handle alerts and other time-critical resources.
  • Add more redundant hardware, and better failover capability.
  • Have more spare parts on location.
  • Add more fail-safes in the system to gracefully recover from longer site issues.
  • Make alert code modifications to better handle extreme situations without side effects.

That’s actually the only upside to this incident. It will make Pingdom an even better service. You should be able to trust us 100%, and yesterday during and for a while after the outage, we stumbled in that regard. We hope your trust in us has not been too damaged by this incident.

And please remember, if you have any questions you can always get in touch with the Pingdom team at support (at) pingdom.com.

Webpages Are Getting Larger Every Year, and Here’s Why it Matters

Last updated: February 29, 2024 Average size of a webpage matters because it [...]

A Beginner’s Guide to Using CDNs

Last updated: February 28, 2024 Websites have become larger and more complex [...]

The Five Most Common HTTP Errors According to Google

Last updated: February 28, 2024 Sometimes when you try to visit a web page, [...]

Page Load Time vs. Response Time – What Is the Difference?

Last updated: February 28, 2024 Page load time and response time are key met [...]

Can gzip Compression Really Improve Web Performance?

Last updated: February 26, 2024 The size of the web is slowly growing. Over [...]

Monitor your website’s uptime and performance

With Pingdom's website monitoring you are always the first to know when your site is in trouble, and as a result you are making the Internet faster and more reliable. Nice, huh?

START YOUR FREE 30-DAY TRIAL

MONITOR YOUR WEB APPLICATION PERFORMANCE

Gain availability and performance insights with Pingdom – a comprehensive web application performance and digital experience monitoring tool.

START YOUR FREE 30-DAY TRIAL
Start monitoring for free