What is AWS?
Amazon Web Services (AWS) is Amazon’s cloud computing platform, providing on-demand infrastructure, storage, databases, and higher-level services used by thousands of enterprises and consumer apps worldwide. Its US‑EAST‑1 region in Northern Virginia is one of the most critical hubs on the internet, hosting a dense concentration of AWS services and third‑party workloads.
On October 20, 2025, a major incident in US‑EAST‑1 triggered a large-scale outage that took down or degraded many of the world’s most popular apps and services.
What Happened on October 20, 2025?
Monitoring data and customer reports show that issues began in the early hours of October 20, 2025, when users started seeing login failures, timeouts, and elevated error rates for applications hosted in AWS’s US‑EAST‑1 region. ThousandEyes observed initial packet loss at AWS edge nodes around 06:49 UTC, which evolved into widespread connection timeouts and HTTP 503 “Service Unavailable” errors as backend services became overloaded.
AWS acknowledged “increased error rates and latencies” across multiple services, with partial recovery visible later in the morning and residual impact on some workloads lasting into October 21, while backlogs were drained and dependent services recovered.
Scope of the Outage
The incident was concentrated in the US‑EAST‑1 (Northern Virginia) region but had a global impact because so many critical applications depend on that region for primary or control‑plane services.
Consumer and enterprise services affected included Snapchat, Reddit, Fortnite, Coinbase, Robinhood, Netflix, Starbucks, Disney+, and various banking, payments, and smart‑home/IoT platforms.
Amazon’s own properties, including Amazon.com retail, Prime Video, Alexa, Ring, and internal warehouse systems, also experienced disruption, with some employees unable to use logistics and payroll tools during the event.
Early correlated symptoms included packet loss at AWS edge routers, application‑layer timeouts, and spikes in HTTP 5xx error rates, indicating the problem was inside AWS’s service architecture rather than in customer networks or the public internet.
What Was the Root Cause?
Amazon’s post‑incident analysis attributes the outage to a race condition in AWS DynamoDB’s automated DNS management system. Two independent automation components attempted to update the same internal DNS data concurrently, resulting in an invalid or effectively empty DNS entry for critical DynamoDB endpoints in US-EAST-1.
Because many AWS services—including control-plane operations and data-plane dependencies—rely on DynamoDB, the DNS failure cascaded into widespread service failures, EC2 launch issues, and throttling or unavailability in higher-level services such as Amazon Connect, STS, and Redshift. AWS has stated it is adding stronger safeguards around DNS automation, improving validation to prevent corrupt records, and enhancing regional isolation and recovery procedures to reduce the blast radius from similar bugs.
Lessons for Web Teams: Building Resilience
Major regional outages at hyperscale providers are rare but inevitable, and this event reinforces the importance of designing for provider, region, and dependency failure. Teams that relied heavily on US-EAST-1 as a single point of control or data experienced the longest downtime, while architectures with multi-region or multi-cloud failover generally degraded more gracefully.
- Automate monitoring for DNS, configuration, and data‑store health, with explicit checks for anomalous name resolution, timeouts, and 5xx spikes in critical dependencies like DynamoDB.
- Maintain region‑independent core services (identity, configuration, and control planes) and support rapid failover to alternative AWS regions or secondary providers where business‑critical SLAs require it.
- Regularly test failure scenarios (including regional control‑plane loss) through game days and chaos experiments to validate that applications can operate in degraded modes.
- Audit reliance on a single “anchor” region (often US‑EAST‑1) and reduce hidden couplings that could turn a regional event into a global outage for your users.
How SolarWinds Pingdom Can Help Improve Your Website Monitoring
SolarWinds Pingdom® is a simple yet powerful website availability and performance monitoring platform designed to help IT and website teams take a more proactive approach to incident response and improve uptime. Teams can use Pingdom to monitor websites and APIs from multiple regions worldwide, correlate real user experience with synthetic checks, and quickly distinguish between application issues and upstream cloud provider incidents.
By combining detailed alerting, historical performance data, and multi‑region vantage points, Pingdom helps organizations detect provider‑level disruptions faster, communicate accurately with stakeholders, and validate recovery once services begin to stabilize.