On 29 May 2026, a severe thunderstorm rolled through the western United States and struck Microsoft’s West US 2 cloud region. The result was a service disruption that started at 04:27 UTC and was not fully resolved until 19:34 UTC – just over fifteen hours of degraded or unavailable services across one of Azure’s primary American regions.
What happened
Lightning strikes caused a loss of utility power across multiple datacentre buildings simultaneously. That breadth is important: Azure datacentres are designed with redundant power feeds and backup generation precisely to survive single points of failure. What the West US 2 event demonstrated is that simultaneous failure across multiple buildings at once can overwhelm those redundant systems.
Backup generators activated as designed when utility power dropped. The problem emerged in what came next. During the transition to sustained generator operation, a subset of generator systems failed to fully synchronise under the sudden facility-wide load. Others subsequently shut down due to thermal protection mechanisms as cooling systems were disrupted by the broader power failure. Cooling capacity and generator synchronisation are tightly coupled: when cooling goes down, generator thermal protections trigger shutdowns, which reduces available power further.
The cascade left Microsoft’s engineering teams unable to rely on the generators that were supposed to bridge the gap until utility power was restored. Initial signs of power restoration came at 10:00 UTC. Full restoration was confirmed at 18:15 UTC, with the last affected services coming back online just after 19:30 UTC.
What was affected
The list of impacted services covers a significant portion of a typical production architecture:
- Azure Kubernetes Service
- Azure Functions
- Azure SQL
- Azure Database for MySQL Flexible Server
- Azure Database for PostgreSQL Flexible Server
- Azure Databricks
- Redis Cache
- Azure Managed Grafana
- Virtual Machines and Virtual Machine Scale Sets
- Storage
- Application Insights
Any application running primarily in West US 2 and relying on these services would have faced connectivity failures, timeouts, and elevated error rates for the majority of the day. For organisations operating production workloads in a single region, the impact would have ranged from severe degradation to complete unavailability.
Why this keeps happening
Power-related cloud outages are not new. AWS US-EAST-1 had a thermal event in 2026 that disrupted multiple services. Azure has had prior power incidents in West US and East US. Google Cloud has had similar events. The pattern is consistent: physical infrastructure fails, backup systems partially compensate, and customers absorb the remaining impact.
What makes the West US 2 incident particularly instructive is the generator synchronisation failure. Organisations often reason that because their cloud provider runs datacentres with diesel backup generation, their workloads are safe from utility disruptions. The West US 2 incident is evidence that backup power systems can themselves fail under the conditions they exist to handle.
No amount of software-level redundancy – load balancers, health checks, auto-scaling groups – protects against a situation where the physical infrastructure underpinning the region is offline.
What multi-region architecture actually requires
The standard recommendation is also the correct one: critical production workloads should be distributed across at least two regions, with traffic routing capable of failing over automatically when one region becomes degraded. That sounds straightforward but has real implementation complexity:
Data replication. Databases that are not actively replicated across regions will not be available in a failover scenario. Azure SQL Geo-Replication and PostgreSQL read replicas in secondary regions need to be configured and tested before an incident, not during one.
State management. Stateful services – sessions, queues, caches – need cross-region equivalents or graceful degradation paths. Redis data that exists only in West US 2 is unavailable when West US 2 is unavailable.
DNS and routing. Failover only works if DNS and load balancing are configured to detect regional failure and route traffic. Azure Traffic Manager and Azure Front Door provide this capability, but the configuration needs to be tested against realistic failure scenarios, not just assumed to work.
Cost. Running active infrastructure in two regions costs more than running it in one. The calculation is whether the cost of that redundancy is less than the business impact of a fifteen-hour outage.
For many organisations, the honest answer is that they have not run that calculation, and their single-region deployments exist because nobody has formally assessed the risk.
If you want to review your Azure or multi-cloud architecture for regional resilience gaps – whether that means designing an active-active setup, configuring proper failover, or just understanding what your current single-region exposure looks like – contact Excello Digital. We help engineering teams build infrastructure that survives the kind of failure that happened in West US 2 yesterday.
