Synthetic Monitoring

Simulate visitor interaction with your site to monitor the end user experience.

View Product Info

FEATURES

Simulate visitor interaction

Identify bottlenecks and speed up your website.

Learn More

Real User Monitoring

Enhance your site performance with data from actual site visitors

View Product Info

FEATURES

Real user insights in real time

Know how your site or web app is performing with real user insights

Learn More

Infrastructure Monitoring Powered by SolarWinds AppOptics

Instant visibility into servers, virtual hosts, and containerized environments

View Infrastructure Monitoring Info
Comprehensive set of turnkey infrastructure integrations

Including dozens of AWS and Azure services, container orchestrations like Docker and Kubernetes, and more 

Learn More

Application Performance Monitoring Powered by SolarWinds AppOptics

Comprehensive, full-stack visibility, and troubleshooting

View Application Performance Monitoring Info
Complete visibility into application issues

Pinpoint the root cause down to a poor-performing line of code

Learn More

Log Management and Analytics Powered by SolarWinds Loggly

Integrated, cost-effective, hosted, and scalable full-stack, multi-source log management

 View Log Management and Analytics Info
Collect, search, and analyze log data

Quickly jump into the relevant logs to accelerate troubleshooting

Learn More

Articles Home AWS Outage October 20th 2025 Recap: Latest News, Updates, and Status 

AWS Outage October 20th 2025 Recap: Latest News, Updates, and Status 

21 Oct 2025 | Pingdom Team

What is AWS? 

Amazon Web Services (AWS) is Amazon’s cloud computing platform, providing on-demand infrastructure, storage, databases, and higher-level services used by thousands of enterprises and consumer apps worldwide. Its US‑EAST‑1 region in Northern Virginia is one of the most critical hubs on the internet, hosting a dense concentration of AWS services and third‑party workloads.​ 

On October 20, 2025, a major incident in US‑EAST‑1 triggered a large-scale outage that took down or degraded many of the world’s most popular apps and services. 

What Happened on October 20, 2025? 

Monitoring data and customer reports show that issues began in the early hours of October 20, 2025, when users started seeing login failures, timeouts, and elevated error rates for applications hosted in AWS’s US‑EAST‑1 region. ThousandEyes observed initial packet loss at AWS edge nodes around 06:49 UTC, which evolved into widespread connection timeouts and HTTP 503 “Service Unavailable” errors as backend services became overloaded.​ 

AWS acknowledged “increased error rates and latencies” across multiple services, with partial recovery visible later in the morning and residual impact on some workloads lasting into October 21, while backlogs were drained and dependent services recovered.​ 

Scope of the Outage 

The incident was concentrated in the US‑EAST‑1 (Northern Virginia) region but had a global impact because so many critical applications depend on that region for primary or control‑plane services.​ 

Consumer and enterprise services affected included Snapchat, Reddit, Fortnite, Coinbase, Robinhood, Netflix, Starbucks, Disney+, and various banking, payments, and smart‑home/IoT platforms.​ 

Amazon’s own properties, including Amazon.com retail, Prime Video, Alexa, Ring, and internal warehouse systems, also experienced disruption, with some employees unable to use logistics and payroll tools during the event.​ 

Early correlated symptoms included packet loss at AWS edge routers, application‑layer timeouts, and spikes in HTTP 5xx error rates, indicating the problem was inside AWS’s service architecture rather than in customer networks or the public internet.​ 

What Was the Root Cause? 

Amazon’s post‑incident analysis attributes the outage to a race condition in AWS DynamoDB’s automated DNS management system. Two independent automation components attempted to update the same internal DNS data concurrently, resulting in an invalid or effectively empty DNS entry for critical DynamoDB endpoints in US-EAST-1.​ 

Because many AWS services—including control-plane operations and data-plane dependencies—rely on DynamoDB, the DNS failure cascaded into widespread service failures, EC2 launch issues, and throttling or unavailability in higher-level services such as Amazon Connect, STS, and Redshift. AWS has stated it is adding stronger safeguards around DNS automation, improving validation to prevent corrupt records, and enhancing regional isolation and recovery procedures to reduce the blast radius from similar bugs.​ 

Lessons for Web Teams: Building Resilience 

Major regional outages at hyperscale providers are rare but inevitable, and this event reinforces the importance of designing for provider, region, and dependency failure. Teams that relied heavily on US-EAST-1 as a single point of control or data experienced the longest downtime, while architectures with multi-region or multi-cloud failover generally degraded more gracefully.​ 

  • Automate monitoring for DNS, configuration, and data‑store health, with explicit checks for anomalous name resolution, timeouts, and 5xx spikes in critical dependencies like DynamoDB. 
  • Maintain region‑independent core services (identity, configuration, and control planes) and support rapid failover to alternative AWS regions or secondary providers where business‑critical SLAs require it. 
  • Regularly test failure scenarios (including regional control‑plane loss) through game days and chaos experiments to validate that applications can operate in degraded modes. 
  • Audit reliance on a single “anchor” region (often US‑EAST‑1) and reduce hidden couplings that could turn a regional event into a global outage for your users.​ 

How SolarWinds Pingdom Can Help Improve Your Website Monitoring 

SolarWinds Pingdom® is a simple yet powerful website availability and performance monitoring platform designed to help IT and website teams take a more proactive approach to incident response and improve uptime. Teams can use Pingdom to monitor websites and APIs from multiple regions worldwide, correlate real user experience with synthetic checks, and quickly distinguish between application issues and upstream cloud provider incidents.​ 

By combining detailed alerting, historical performance data, and multi‑region vantage points, Pingdom helps organizations detect provider‑level disruptions faster, communicate accurately with stakeholders, and validate recovery once services begin to stabilize. 

Start monitoring for free