Microsoft 365 Outage (Nov 2024): A Recap -

In Nov 2024, Microsoft 365 users experienced a two-day outage that impacted several popular productivity and collaboration tools, including Outlook and Teams. While some users jokingly welcomed the pre-Thanksgiving break, the incident significantly impacted Microsoft’s large user base.

In this article, we’ll use Microsoft’s incident report (Issue ID MO941162) and other sources to explore the incident in detail. We’ll review three key takeaways that IT, site reliability engineering (SRE), and website administration teams can use to learn from the incident and improve their response and site uptime.

Scope of the Outage

The incident began just before 2:00 UTC on Monday, November 25, 2024, and lasted until approximately 11:00 UTC on Tuesday, November 26, 2024. Microsoft services impacted by the outage—and some associated issues users may have encountered—included:

Exchange Online: Mail transport delays, access issues across multiple interfaces (web, desktop client, REST API, and Exchange ActiveSync)
Microsoft Teams:Significant functionality issues (users unable to search, modify events, edit meetings, etc.)
Microsoft Purview:Purview Portal and Purview Solutions inaccessibility, delays with Adaptive Scope functions
Microsoft Fabric:Inability to export content and manage labels, issues exporting select artifacts with sensitivity labels
SharePoint Online: Search functionality issues
Microsoft Defender for Office 365 and XDR portal: Various access, reporting, notification, and management issues
Universal Print:Unable to print, view printers or printer shares, or register printers
Power Automate for Desktop:Errors with cloud connections
Microsoft Bookings:Inaccessible bookings
Microsoft Viva Engage: Issues loading topics, searching, loading home feeds, and seeing usernames in Q&A experiences
Microsoft Copilot:Functionality issues related to meetings and post-meetings, unable to load Copilot in Viva Engage, issues submitting queries from OneNote chat panes

According to Microsoft, issues could have impacted any Microsoft 365 user worldwide. The impact would have varied depending on how the user was routed across the affected infrastructure and the dependence of that product or service on the affected infrastructure.

What was the Root Cause of the Outage?

Ultimately, the root cause was a process issue during the decommissioning of an internal Microsoft 365 backend service.

The incident was triggered during a decommissioning workflow for an internal Microsoft 365 backend service. Before the service was decommissioned, traffic to the service was not disabled as expected. As a result, once the service was removed, other services continued to send traffic to it.

There was logic in routing requests to a backup endpoint if the now-decommissioned service was unreachable. However, as the Microsoft Client Access Front End (CAFE) components managed these requests, their attempts to resolve requests headed for the decommissioned backend service included a synchronous call in an asynchronous code path. This led to threads being held for an extended time.

As traffic ramped up at the beginning of the workday on Monday, available threads became exhausted, causing service availability issues. The extent of availability issues varied depending on how much a Microsoft product or service depended on CAFE routing. For example, Outlook on the web was more dependent on CAFE routing than Outlook desktop. Therefore, the incident impacted the web version of Microsoft 365 more significantly.

What can you Learn From the Outage?

Enterprises like Microsoft can experience incidents that lead to extended outages. Modern applications are often a web of complex dependencies; an issue in one component can have a significant cascading effect on the rest of the system. Even if your infrastructure isn’t as complex as that of Microsoft 365, you can still learn from this case study.

Below are our three biggest takeaways from the Microsoft 365 outage in November 2024.

Takeaway #1: Implement change management practices commensurate with risk

If you’ve been following our outage case studies, you’ve likely noticed that change can lead to unexpected downtime. For example, not only was the Microsoft 365 outage related to a change (decommissioning of a service), but these other notable incidents were all change-induced as well:

Roblox outage: Occurred after a new streaming feature was enabled
Azure China outage: Triggered by a nameserver configuration change
Teams and other Microsoft services outage: Caused by a WAN update

Of course, completely avoiding change isn’t practical or recommended. Changes are necessary to patch security issues, add features, and maintain performance. The key is striking the right balance of change control, risk, and speed. There’s no one-size-fits-all answer to the sufficient level of rigor regarding website change management. However, teams must be intentional about their approach to change control.

If you don’t have any change management in place for your services or you’d like to learn more about change management, check out ITSM Change Management Best Practices.

Takeaway #2: Monitor from multiple perspectives

Microsoft users were affected differently depending on how they were routed across Microsoft infrastructure. While most teams will have an internal infrastructure less complex than Microsoft’s, it’s still often true that users will take different routes—and therefore have different experiences—when accessing sites and services.

Suppose you only focus on monitoring responses or specific metrics (CPU, memory, and disk) from a single region. In this case, you can miss out on problems directly impacting your users in other regions. Two great ways to help avoid these common site-availability monitoring missteps are:

Implement real user monitoring (RUM):RUM uses lightweight client-side JavaScript code to automatically monitor and report what happens in your users’ browsers. This helps teams learn about user experience problems before a customer needs to open a ticket.
Monitor from multiple regions: The public internet is big and messy. Just because traffic flows fine for users in one region doesn’t mean everything works well globally. Suppose users typically only access your site during business hours, and you’re not monitoring from their region. Your team might take hours (or days) to learn about issues impacting a specific geographical location. To avoid this problem, uptime monitoring from multiple points of presence around the globe should be implemented.

*A Pingdom dashboard with multiple checks, including those from different regions. (Source*)

Takeaway #3: Establish a plan for single points of failure

The Microsoft 365 outage had a broad impact because many apps depended on the same internal services. As much as practical, teams should try to architect to avoid single points of failure (SPOFs) by designing with redundancy and resilience in mind.

Granted, redundancy and resilience aren’t always cost-effective measures. In cases where you have a SPOF, ensure you have a well-defined disaster recovery plan in case it goes down. For example, if all your infrastructure runs in a single cloud provider region, consider how you could restore your site in a different region or with a different provider. The depth of your disaster recovery plan will vary depending on your ITSM maturity, but even small teams can identify SPOFs and decide whether they are willing to either:

Accept the risk of a single dependency leading to an extended outage; or
Make an intentional plan to restore service without it.

How Pingdom can Help Improve Your Site Uptime

SolarWinds^® Pingdom^® empowers teams to monitor websites from multiple regions across the globe so they can quickly detect downtime and performance issues. Teams can use Pingdom to take a more proactive approach to resolving site reliability issues, uncover bottlenecks, and address problems before users complain. For example, with Pingdom, teams benefit from:

Transaction monitoring: Ensure key user journeys are functioning as expected
Page speed monitoring: Detect performance bottlenecks and degradations that can frustrate users
Public status pages: Keep users informed about issues and reduce ticket load

*Load times and file sizes displayed in Pingdom. (Source*)

If you’d like to try SolarWinds Pingdom yourself, sign up for a free (no credit card required) 30-day trial today.

Microsoft 365 Outage (Nov 2024): A Recap

Scope of the Outage

What was the Root Cause of the Outage?

What can you Learn From the Outage?

Below are our three biggest takeaways from the Microsoft 365 outage in November 2024.

Takeaway #1: Implement change management practices commensurate with risk

Takeaway #2: Monitor from multiple perspectives

How Pingdom can Help Improve Your Site Uptime

Google European Point of Presence Outage (August 2024) Recap

X Outage (Sept 7, 2024) Recap

Azure China Outage (April 2024) Recap

Microsoft 365 Outage (Nov 2024): A Recap

Scope of the Outage

What was the Root Cause of the Outage?

What can you Learn From the Outage?

Below are our three biggest takeaways from the Microsoft 365 outage in November 2024.Takeaway #1: Implement change management practices commensurate with risk

Takeaway #2: Monitor from multiple perspectives

How Pingdom can Help Improve Your Site Uptime

Related Articles

Google European Point of Presence Outage (August 2024) Recap

X Outage (Sept 7, 2024) Recap

Azure China Outage (April 2024) Recap

Below are our three biggest takeaways from the Microsoft 365 outage in November 2024.

Takeaway #1: Implement change management practices commensurate with risk