preloader

· aws cloud infrastructure outage resilience

AWS ‘Thermal Event’ in Northern Virginia Took Down a Full Availability Zone and Disrupted Coinbase, FanDuel, and CME Group for Hours

Source: IT Pro

Amazon Web Services confirmed that a cooling system shortfall at one of its Northern Virginia facilities caused elevated temperatures inside a data centre, forcing the company to throttle and reroute traffic away from Availability Zone use1-az4 in the us-east-1 region. EC2 instances and EBS volumes hosted on the affected hardware lost power during what AWS described as a thermal event. Recovery took longer than initially expected as the team worked to bring additional cooling capacity online safely before restoring the remaining infrastructure.

The disruption hit Coinbase, which reported core exchange functions offline for more than five hours. FanDuel and the CME Group trading platform were also affected. For a cryptocurrency exchange operating around the clock, five hours of outage represents significant financial and reputational exposure. For trading platforms more broadly, infrastructure unavailability during market hours carries direct commercial consequences.

us-east-1 is AWS’s oldest and most densely used region. It is also the region where a large number of organisations ran their first cloud workloads years ago and never moved them. Single-region and single-AZ architectures are common in environments that were built before multi-region resilience became a standard design requirement, or where the cost of redundancy was deferred.

The underlying cause is worth examining separately from the outage itself. Cooling failures in data centres are not exotic. As density increases with AI and GPU workloads, thermal management becomes a harder engineering problem. AWS and its peers are all expanding capacity rapidly, and the margin for error in thermal design at high density is narrow. This incident is unlikely to be the last of its kind across the industry.

The architecture lesson remains straightforward even if the infrastructure problem is complex: workloads that cannot tolerate a full AZ outage need to be distributed across at least two AZs, and workloads that cannot tolerate a regional outage need a secondary region with automated failover. Neither of those is expensive relative to the cost of five hours of downtime for a revenue-generating service.

If your business runs on AWS and you are not confident that your architecture would survive a full AZ or regional outage, contact Excello Digital. We will assess your current setup and design a resilience strategy that fits both your risk tolerance and your budget.

We’ll help you resolve your infrastructure challenges

Our team of experts is ready to help you with your infrastructure challenges. We’ll give you honest and personal treatment. Get in touch to learn more.

Get in touch!