Synthetic Monitoring

Simulate visitor interaction with your site to monitor the end user experience.

View Product Info

FEATURES

Simulate visitor interaction

Identify bottlenecks and speed up your website.

Learn More

Real User Monitoring

Enhance your site performance with data from actual site visitors

View Product Info

FEATURES

Real user insights in real time

Know how your site or web app is performing with real user insights

Learn More

Infrastructure Monitoring Powered by SolarWinds AppOptics

Instant visibility into servers, virtual hosts, and containerized environments

View Infrastructure Monitoring Info
Comprehensive set of turnkey infrastructure integrations

Including dozens of AWS and Azure services, container orchestrations like Docker and Kubernetes, and more 

Learn More

Application Performance Monitoring Powered by SolarWinds AppOptics

Comprehensive, full-stack visibility, and troubleshooting

View Application Performance Monitoring Info
Complete visibility into application issues

Pinpoint the root cause down to a poor-performing line of code

Learn More

Log Management and Analytics Powered by SolarWinds Loggly

Integrated, cost-effective, hosted, and scalable full-stack, multi-source log management

 View Log Management and Analytics Info
Collect, search, and analyze log data

Quickly jump into the relevant logs to accelerate troubleshooting

Learn More

Articles Home Google European Point of Presence Outage (August 2024) Recap

Google European Point of Presence Outage (August 2024) Recap

03 Feb 2025 | Pingdom Team

On Aug 12, 2024, a point of presence (POP) serving the Google europe-west2 region experienced a power outage that caused service disruptions for multiple Google products. Google published an excellent incident report explaining the outage’s timelines and details.

In this article, we’ll unpack the specifics of the POP outage and three key takeaways for IT teams of all sizes.

Scope of the outage

The Google Cloud and Google Workspace outage began around 13:20 UTC on Aug 12, 2024, and lasted approximately 37 minutes. Power was restored to the affected equipment at 16:43 UTC, and networking gear was fully operational by 16:57 UTC. Network traffic to and from the Google europe-west2 region was degraded, timed out, or failed to connect during the outage.

The affected services included:

  • Google Cloud CDN
  • Google Cloud Load Balancing
  • Google Cloud Networking
  • Hybrid connectivity
  • Virtual Private Cloud
  • YouTube
  • Google Workspace (Gmail, Calendar, Chat, Docs, Drive, Meet, and Tasks for connections in the U.K.)
  • Google Cloud APIs
  • Google Cloud Interconnect

The outage only impacted services if:

  • Ingress traffic from the internet routed to europe-west2 depended on offline or unreachable equipment
  • Egress traffic to the internet routed from europe-west2 depended on offline or unreachable equipment

Root cause

What was the root cause of the outage? Ultimately, it was an electrical power outage at a London data center. The primary and secondary power lines feeding the Google POP failed.

Electrical power is a tricky topic for IT and operations professionals, as it’s not typically their primary area of expertise. They assume functioning electrical power is available. Google ostensibly followed many data center and incident response design best practices:

✅ Dual power feeds for critical equipment 
✅ Backup generator in case of power failure
✅ Stored onsite power backup, such as an uninterruptible power supply (UPS)
✅ Automated recovery tooling 

However, despite these measures, the power loss still impacted their services. Both power feeds failed, and the backup power solutions did not support the critical networking equipment. Additionally, Google Front End (GFE) experienced overloads during the incident, which automation did not resolve. Service restoration required manual adjustments to address GFE overloads, which impacted Google services. 

Lessons to learn

The incident report by Google for the europe-west2 outage has plenty to explore, from the outage’s impact on route replacement and border gateway protocol to data center redundancy. 

The sections below review the three biggest takeaways from an IT operations and monitoring perspective.

Takeaway #1: Create, test, and verify an incident response plan.

Google, the author of one of the most respected site reliability engineering books in the industry, couldn’t automate everything. Fortunately, the organization is large enough to staff incident response teams with engineers to address gaps quickly. Smaller IT organizations often have limited automation coverage and understaffed incident response teams. This IT Ops double whammy can create a negative feedback loop of engineers focused on firefighting, limited automation testing, and minimal time to invest in automation improvements.

Getting to Google-level incident response isn’t practical for many organizations. However, there are some practical steps IT Ops teams can take to improve incident response and uptime. Here are three tips to help you get started:

  1. Test your existing automation flows. Whether it’s backup and restore or outage alerts, regularly testing your automation is essential. Aim to test key automation scenarios at least once every quarter.
  2. Create an incident response team (IRT). If something goes wrong, who gets the first call? Who is responsible for resolving issues within the different components of your tech stack? If your IT service management maturity is low, ensure your IT Ops team has a well-defined group of incident responders.
  3. Map out your incident response process. A corollary to creating an IRT is defining an incident response process. Make sure your team can answer questions such as: 
    • What is (and is not) an incident?
    • How does first-line support escalate an incident? 
    • How are incidents communicated to stakeholders?

Takeaway #2: Focus on end-user outcomes.

Many IT pros dismiss electrical power as outside their responsibility. This is often reasonable, especially for teams running most or all of their workloads in the cloud. However, IT Ops is still typically accountable for service availability, regardless of the root cause of an outage.

That’s why monitoring from the perspective of the end user is critical. Transaction monitoring can help IT Ops monitor end-to-end performance for key workflows and raise an alert when something breaks. Similarly, real user monitoring (RUM) allows IT teams to monitor real user problems and detect when an incident impacts multiple users. If IT is responsible for the broken dependency, they can take action to resolve it. If they’re not, they can escalate to the appropriate team.

Takeaway #3: Use tabletop exercises for proactive learning.

It’s not practical to predict every possible failure mode. However, the people closest to the work can often identify gaps by discussing theoretical emergencies. If you’re not responsible for physical infrastructure, tabletop exercises can help identify gaps in your incident response plans.

Here’s a simplified example of how a tabletop exercise run by an IT Ops team responsible for a medium-sized e-commerce site might begin:

  • Facilitator: “On Thursday at 8:30 p.m., users contact support, indicating they can’t access the site. Our website monitoring tool suggests HTTP GET requests to oursite.example.com time out. Pings to the IP address succeed.”
  • Network engineer: “First, I’d check the status of the DNS servers.”
  • Facilitator: ”How?”
  • Network engineer: “I’d manually attempt to poll the servers.”
  • Facilitator: “OK, assume they’re unresponsive. What next?”
  • Network engineer: “I’d escalate to our DNS provider, Acme DNS.”
  • Facilitator: “Is there anything you can do while you wait for Acme DNS to respond?”
  • Network engineer: “Work on standing up a secondary DNS provider.”

In this simplified example, we can already start to spot areas where improvements could be implemented, such as directly monitoring the DNS servers and improving stakeholder communication (for example, by using a public status page). Additionally, standing up a secondary DNS provider may create more problems than it solves. Teams working through tabletop exercises can identify weak spots in their monitoring and incident response processes before a real-world event exposes them.

Pingdom improves your website monitoring

Pingdom is a powerful but straightforward website monitoring tool designed to help IT teams monitor website performance and end-to-end workflows. With Pingdom, IT teams can monitor websites from multiple regions across the globe and build more proactive incident response processes.

Start monitoring for free