Microsoft recently worked to resolve an ongoing incident that temporarily prevented customers from accessing or utilizing some capabilities within the Microsoft Defender XDR portal. This disruption was noted in an admin center service alert and notably impacted features such as accessing threat hunting alerts and the visibility of devices within the portal. This type of outage, tagged by Microsoft as a critical incident due to its noticeable user impact, highlights the challenges in maintaining high availability for complex security services that rely on real-time data processing and access.
The root cause of the issue was identified as a spike in traffic directed toward the portal’s backend components. This unexpected surge led to high Central Processing Unit (CPU) utilization on the systems that facilitate core Microsoft Defender portal functionalities. High CPU utilization in this context means the servers were overwhelmed and unable to efficiently process the incoming requests, which manifested to users as blocked access and missing data within the security interface. Understanding the cause was the critical first step toward remediation.
Upon acknowledging the service degradation, Microsoft immediately began applying mitigation measures to restore normal operations. These efforts included increasing the processing throughput of the affected components, essentially scaling up the capacity to handle the elevated traffic levels. Initial telemetry quickly indicated that service availability had begun to recover for some of the affected customers, a positive sign that the immediate crisis was passing and that the implemented changes were having the desired effect on system performance.
As the situation stabilized, Microsoft continued to engage with a small number of customers who reported persistent issues, indicating the fix was not instantaneous or universal. The company actively coordinated with these organizations to collect additional diagnostics, specifically HTTP Archive (HAR) traces, which provide a detailed record of network interaction from the client side. This client-side data was crucial for a deeper analysis of the remaining impact, which, besides blocked access, confirmed issues like missing advanced threat-hunting alerts and devices not displaying correctly.
Ultimately, Microsoft successfully mitigated the incident for all affected customers. Following the confirmation from remaining impacted organizations that the service was fully restored and monitoring telemetry showed the service had stabilized for an extended period, the incident was closed. Microsoft is now scheduled to provide a preliminary Post-Incident Report and a final Post-Incident Report within a few business days to detail the incident, its resolution, and any lessons learned to prevent future recurrences.
Reference:






