Synthetic Monitoring

Simulate visitor interaction with your site to monitor the end user experience.

View Product Info

FEATURES

Simulate visitor interaction

Identify bottlenecks and speed up your website.

Learn More

Real User Monitoring

Enhance your site performance with data from actual site visitors

View Product Info

FEATURES

Real user insights in real time

Know how your site or web app is performing with real user insights

Learn More

Infrastructure Monitoring Powered by SolarWinds AppOptics

Instant visibility into servers, virtual hosts, and containerized environments

View Infrastructure Monitoring Info
Comprehensive set of turnkey infrastructure integrations

Including dozens of AWS and Azure services, container orchestrations like Docker and Kubernetes, and more 

Learn More

Application Performance Monitoring Powered by SolarWinds AppOptics

Comprehensive, full-stack visibility, and troubleshooting

View Application Performance Monitoring Info
Complete visibility into application issues

Pinpoint the root cause down to a poor-performing line of code

Learn More

Log Management and Analytics Powered by SolarWinds Loggly

Integrated, cost-effective, hosted, and scalable full-stack, multi-source log management

 View Log Management and Analytics Info
Collect, search, and analyze log data

Quickly jump into the relevant logs to accelerate troubleshooting

Learn More

The Roblox Outage

A Halloween Story

Just before Halloween 2021, Roblox engineers experienced a horror story: a service outage that also took down critical monitoring systems. It seemed like the issue was a hardware problem, but it wasn’t. Users were frustrated, and the clock was ticking. After three full days of downtime, service was finally restored on Halloween day.

While the incident itself was an IT nightmare, Roblox’s detailed technical post-mortem several months later was an excellent way to bounce back. The post-mortem was a pleasant, late Christmas gift to users and SysAdmins everywhere.

Thanks to transparency from Roblox, we have an outage case study to learn from. In this post, we’ll summarize the scope and root cause of the outage, explain what other ITOps teams can learn from it, and consider how SolarWinds® Pingdom® can help you reduce the risk of extended downtime in your environment.

The scope of the outage

The outage began on October 28, 2021, and was resolved 73 hours later on October 31. The outage affected 50 million Roblox users. While we don’t have a specific dollar cost for the outage, it was a significant incident for Roblox and HashiCorp.

Roblox uses HashiCorp’s HashiStack to manage its global infrastructure. The stack consists of these components:

  • Nomad: Schedules containers on specific hardware nodes and checks container health.
  • Vault: A secrets management solution for securing sensitive data like credentials.
  • Consul: An identity-based networking solution that provides service discovery, health checks, and session locking.

Roblox was one of HashiCorp’s hallmark customers. Therefore, the Roblox outage impacted the reputation of HashiCorp as well. Fortunately, HashiCorp engineers worked with Roblox to troubleshoot and triage the issue, demonstrating their commitment to customer success even when the going gets tough.

The root cause of the outage

From a technical perspective, two issues together formed the root cause of the Roblox outage:

  1. Roblox enabled a new streaming feature on Consul at a time when database reads and writes were unusually high.
  2. The BoltDB database, used by Consul, experienced a performance issue.

While the above two factors made up the root cause, several other factors contributed to the length of the outage. For example, because Roblox’s monitoring systems depended on the same systems that were down, visibility was limited.

Additionally, the issues were technically nuanced and took time to debug. Several initial attempts to restore service failed, and one of the recovery attempts (adding faster hardware) may have worsened things.

What can you learn from the outage?

Roblox did a great job making the details of the outage public, which gives other ITOps teams plenty to study. The top four lessons we can learn from the Roblox outage include:

Visibility is vital.

In the Roblox outage, there was an unfortunate chicken-and-egg problem. Roblox had monitoring in place, but the monitoring tooling depended on the systems that were down. Therefore, they didn’t have deep visibility into the problem when they began troubleshooting.

For IT operations, there are two takeaways from this. First, make sure you’re monitoring all your critical infrastructure. Second, mitigate single points of failure that could knock your monitoring offline along with your production systems.

Limit single points of failure.

Roblox was susceptible to this outage because they ran their backend services on a single Consul cluster. They’ve learned from the outage and have since built infrastructure for an additional backend data center. Other ITOps teams should consider doing the same for their critical systems. If it’s mission-critical, aim for N+1 at a minimum. Of course, everything is ultimately a business decision. If you can’t justify the cost of N+1 (or better), ensure you’re comfortable with the downtime risk.

Know your tradeoffs.

Roblox chose to avoid the public cloud for its core infrastructure. After the outage, some questioned if that was a wise choice, but Roblox has made a purposeful decision to stay the course. Public cloud infrastructure isn’t immune to outages, and Roblox considered the tradeoffs. They feel that maintaining control of their core infrastructure is more important than a public cloud’s benefits.

For other organizations, the calculations on the tradeoffs will be different. Weighing the tradeoffs to make a decision based on your preferences, expertise, resources, and risk appetite is what’s important.

Learn from your mistakes.

Continuous improvement leads to long-term success. After the outage, Roblox was transparent about what happened and implemented solutions (like the additional backend infrastructure) to prevent repeat failures. Mistakes happen, and no system is perfect. Teams that recognize this and adopt practices like blameless postmortems tend to do better in the long run.

How Pingdom can help

Pingdom is a simple but robust website and web application monitoring platform that provides ITOps teams visibility into availability and performance. As a cloud platform, Pingdom helps you avoid the chicken-and-egg problem Roblox experienced during its outage. Because Pingdom is decoupled from your critical infrastructure, you can limit single points of failure that could take out your servers and monitoring tools at the same time.

Additionally, Pingdom can monitor site availability from over 100 locations across the globe. That means your team can detect performance issues and outages that may only affect users in specific geographic areas. It can also make troubleshooting tough-to-diagnose issues easier because you can immediately isolate symptoms based on region. To see what SolarWinds Pingdom can do for you, sign up for a free 30-day trial today

Web API Monitoring Explained: A Helpful Introductory Guide

An API, application programming interface, is a collection of tools, protocols, [...]

User Experience for Observability

Modern software applications involve multiple layers of code and services, work [...]

The Roblox Outage

Just before Halloween 2021, Roblox engineers experienced a horror story: a serv [...]

Internet Availability Threats Following the Russian invasion of Ukraine

Since Russia’s invasion of Ukraine on February 24, 2022, denial-of-service (D [...]

User Monitoring on Heroku-Based Apps

Introduction In today's fast-paced business environment, tech startups are o [...]

Monitor your website’s uptime and performance

With Pingdom's website monitoring you are always the first to know when your site is in trouble, and as a result you are making the Internet faster and more reliable. Nice, huh?

START YOUR FREE 30-DAY TRIAL

MONITOR YOUR WEB APPLICATION PERFORMANCE

Gain availability and performance insights with Pingdom – a comprehensive web application performance and digital experience monitoring tool.

START YOUR FREE 30-DAY TRIAL
Start monitoring for free