Synthetic Monitoring

Simulate visitor interaction with your site to monitor the end user experience.

View Product Info

FEATURES

Simulate visitor interaction

Identify bottlenecks and speed up your website.

Learn More

Real User Monitoring

Enhance your site performance with data from actual site visitors

View Product Info

FEATURES

Real user insights in real time

Know how your site or web app is performing with real user insights

Learn More

Infrastructure Monitoring Powered by SolarWinds AppOptics

Instant visibility into servers, virtual hosts, and containerized environments

View Infrastructure Monitoring Info
Comprehensive set of turnkey infrastructure integrations

Including dozens of AWS and Azure services, container orchestrations like Docker and Kubernetes, and more 

Learn More

Application Performance Monitoring Powered by SolarWinds AppOptics

Comprehensive, full-stack visibility, and troubleshooting

View Application Performance Monitoring Info
Complete visibility into application issues

Pinpoint the root cause down to a poor-performing line of code

Learn More

Log Management and Analytics Powered by SolarWinds Loggly

Integrated, cost-effective, hosted, and scalable full-stack, multi-source log management

 View Log Management and Analytics Info
Collect, search, and analyze log data

Quickly jump into the relevant logs to accelerate troubleshooting

Learn More

Articles Home Microsoft 365 Outage (Nov 2024): A Recap

Microsoft 365 Outage (Nov 2024): A Recap

29 May 2025 | Pingdom Team

Internet outage symbol

In Nov 2024, Microsoft 365 users experienced a two-day outage that impacted several popular productivity and collaboration tools, including Outlook and Teams. While some users jokingly welcomed the pre-Thanksgiving break, the incident significantly impacted Microsoft’s large user base. 

In this article, we’ll use Microsoft’s incident report (Issue ID MO941162) and other sources to explore the incident in detail. We’ll review three key takeaways that IT, site reliability engineering (SRE), and website administration teams can use to learn from the incident and improve their response and site uptime.  

Scope of the Outage 

The incident began just before 2:00 UTC on Monday, November 25, 2024, and lasted until approximately 11:00 UTC on Tuesday, November 26, 2024. Microsoft services impacted by the outage—and some associated issues users may have encountered—included:

  • Exchange Online: Mail transport delays, access issues across multiple interfaces (web, desktop client, REST API, and Exchange ActiveSync)
  • Microsoft Teams:Significant functionality issues (users unable to search, modify events, edit meetings, etc.)
  • Microsoft Purview:Purview Portal and Purview Solutions inaccessibility, delays with Adaptive Scope functions
  • Microsoft Fabric:Inability to export content and manage labels, issues exporting select artifacts with sensitivity labels
  • SharePoint Online: Search functionality issues
  • Microsoft Defender for Office 365 and XDR portal: Various access, reporting, notification, and management issues
  • Universal Print:Unable to print, view printers or printer shares, or register printers 
  • Power Automate for Desktop:Errors with cloud connections
  • Microsoft Bookings:Inaccessible bookings
  • Microsoft Viva Engage: Issues loading topics, searching, loading home feeds, and seeing usernames in Q&A experiences
  • Microsoft Copilot:Functionality issues related to meetings and post-meetings, unable to load Copilot in Viva Engage, issues submitting queries from OneNote chat panes  

According to Microsoft, issues could have impacted any Microsoft 365 user worldwide. The impact would have varied depending on how the user was routed across the affected infrastructure and the dependence of that product or service on the affected infrastructure. 

What was the Root Cause of the Outage?

Ultimately, the root cause was a process issue during the decommissioning of an internal Microsoft 365 backend service.  

The incident was triggered during a decommissioning workflow for an internal Microsoft 365 backend service. Before the service was decommissioned, traffic to the service was not disabled as expected. As a result, once the service was removed, other services continued to send traffic to it. 

There was logic in routing requests to a backup endpoint if the now-decommissioned service was unreachable. However, as the Microsoft Client Access Front End (CAFE) components managed these requests, their attempts to resolve requests headed for the decommissioned backend service included a synchronous call in an asynchronous code path. This led to threads being held for an extended time. 

As traffic ramped up at the beginning of the workday on Monday, available threads became exhausted, causing service availability issues. The extent of availability issues varied depending on how much a Microsoft product or service depended on CAFE routing. For example, Outlook on the web was more dependent on CAFE routing than Outlook desktop. Therefore, the incident impacted the web version of Microsoft 365 more significantly.

What can you Learn From the Outage?

Enterprises like Microsoft can experience incidents that lead to extended outages. Modern applications are often a web of complex dependencies; an issue in one component can have a significant cascading effect on the rest of the system. Even if your infrastructure isn’t as complex as that of Microsoft 365, you can still learn from this case study. 

Below are our three biggest takeaways from the Microsoft 365 outage in November 2024.

Takeaway #1: Implement change management practices commensurate with risk

If you’ve been following our outage case studies, you’ve likely noticed that change can lead to unexpected downtime. For example, not only was the Microsoft 365 outage related to a change (decommissioning of a service), but these other notable incidents were all change-induced as well:

Of course, completely avoiding change isn’t practical or recommended. Changes are necessary to patch security issues, add features, and maintain performance. The key is striking the right balance of change control, risk, and speed. There’s no one-size-fits-all answer to the sufficient level of rigor regarding website change management. However, teams must be intentional about their approach to change control.

If you don’t have any change management in place for your services or you’d like to learn more about change management, check out ITSM Change Management Best Practices.

Takeaway #2: Monitor from multiple perspectives 

Microsoft users were affected differently depending on how they were routed across Microsoft infrastructure. While most teams will have an internal infrastructure less complex than Microsoft’s, it’s still often true that users will take different routes—and therefore have different experiences—when accessing sites and services. 

Suppose you only focus on monitoring responses or specific metrics (CPU, memory, and disk) from a single region. In this case, you can miss out on problems directly impacting your users in other regions. Two great ways to help avoid these common site-availability monitoring missteps are:

  • Implement real user monitoring (RUM):RUM uses lightweight client-side JavaScript code to automatically monitor and report what happens in your users’ browsers. This helps teams learn about user experience problems before a customer needs to open a ticket.
  • Monitor from multiple regions: The public internet is big and messy. Just because traffic flows fine for users in one region doesn’t mean everything works well globally. Suppose users typically only access your site during business hours, and you’re not monitoring from their region. Your team might take hours (or days) to learn about issues impacting a specific geographical location. To avoid this problem, uptime monitoring from multiple points of presence around the globe should be implemented.
A Pingdom dashboard with multiple checks, including those from different regions. (Source)

Takeaway #3: Establish a plan for single points of failure

The Microsoft 365 outage had a broad impact because many apps depended on the same internal services. As much as practical, teams should try to architect to avoid single points of failure (SPOFs) by designing with redundancy and resilience in mind.

Granted, redundancy and resilience aren’t always cost-effective measures. In cases where you have a SPOF, ensure you have a well-defined disaster recovery plan in case it goes down. For example, if all your infrastructure runs in a single cloud provider region, consider how you could restore your site in a different region or with a different provider. The depth of your disaster recovery plan will vary depending on your ITSM maturity, but even small teams can identify SPOFs and decide whether they are willing to either:

  1. Accept the risk of a single dependency leading to an extended outage; or
  2. Make an intentional plan to restore service without it.

How Pingdom can Help Improve Your Site Uptime

SolarWinds® Pingdom® empowers teams to monitor websites from multiple regions across the globe so they can quickly detect downtime and performance issues. Teams can use Pingdom to take a more proactive approach to resolving site reliability issues, uncover bottlenecks, and address problems before users complain. For example, with Pingdom, teams benefit from:

Load times and file sizes displayed in Pingdom. (Source)

If you’d like to try SolarWinds Pingdom yourself, sign up for a free (no credit card required) 30-day trial today. 

Start monitoring for free