When the Foundation Shakes – What the Recent Cloud Outages Reveal About Infrastructure

24 Nov 2025 · infrastructure, cloud, ha, architecture, strategy, core2code

In recent weeks, several major outages at AWS, Azure and Cloudflare have disrupted critical internet services, including identity platforms, storage systems, global routing layers and content delivery infrastructure. These incidents demonstrate the vulnerability of modern digital ecosystems, even when they are supported by the world's largest cloud providers.

Rather than focusing solely on the symptoms, we should examine the strategic implications. Who is responsible for availability? How should redundancy be planned? And why hyperscalers are neither saviours nor scapegoats.

High availability always comes at a cost

The fundamental truth is this: if you want high availability, you have to pay for it. Availability is the result of redundancy, and redundancy means maintaining a complete second infrastructure. Whether it runs in the public cloud or on your own premises, failover capacity effectively doubles the cost.

Cloud computing doesn’t eliminate these costs. It simply changes how they are billed. It does not replace responsibility. It only shifts it.

Strategy First: Define RPO and RTO before building architecture

The most important step is not technical. It’s strategic and business-driven.

How much data loss is acceptable (RPO)?
How quickly must a service recover (RTO)?
Which business processes depend on it?
What is the financial impact of downtime?

Without clear answers to these questions, architectural decisions will inevitably be either too expensive or too risky.

Availability is a strategic decision. Architecture is how that decision is realised.

Recent outages show that scale does not prevent failure

The recent outages experienced by major cloud providers highlight the fragility of even the most automated global systems.

AWS outage on 20 October 2025

A faulty or empty DNS entry within the DynamoDB automation stack triggered a cascading failure. Load balancers, DNS resolution and other base services were affected. A small error with a significant impact — typical of highly automated environments.

Azure outage on 29 October 2025

A misconfiguration in the global routing layer — specifically Azure Front Door and the CDN infrastructure — caused a worldwide outage that impacted Microsoft 365 too. Multiple redundancy layers were ineffective because the single global routing path became a single point of failure.

Cloudflare outage on 18 November 2025

An oversized configuration file in the Bot Management engine triggered software faults across Cloudflare’s infrastructure, taking down numerous major websites, including X and ChatGPT. The more central an infrastructure provider becomes, the wider the blast radius when things go wrong.

These incidents demonstrate that hyperscalers don’t just scale availability — they also scale failure.

A Hybrid Approach: Operate the baseline yourself, bursting into the cloud

Once RPO and RTO have been defined, architectures can be designed to balance risk and cost. One proven model is to operate the stable baseline workload in-house and use the cloud only for elastic peak load.

This reduces dependency, avoids overprovisioning and enables true resilience.

While having one provider is convenient, monoculture remains risky

Many companies appreciate the simplicity of working with one infrastructure provider. One contract, one support channel, no vendor ping-pong.

This works — until it doesn't. Developments at a major virtualisation vendor in recent years have shown how quickly strategic dependence can become a liability.

Consolidation is good. Monoculture, however, is dangerous. Finding the right balance is critical.

Availability is a process, not a product

Neither cloud providers nor internal IT teams are infallible. Availability is created through:

clear requirements and objectives
robust, pragmatic architecture
disciplined operations
organisational culture
continuous improvement

A tool cannot produce availability. Architecture enables it. Operations preserve it.

My own perspective — critically reflected

This perspective also requires scrutiny:

Operating baseline workloads yourself only works if the necessary expertise and operational discipline are in place.
Redundancy does not necessarily double the cost; some cloud-native architectures provide efficient redundancy models.
Multi-cloud can reduce risk, but poor integration can dramatically increase it.
Consolidation improves efficiency, but dependency can become a strategic liability.

The right architecture always depends on context — business priorities, organisational culture, team capabilities, regulatory requirements and risk appetite.

In the end, the core principle of spotlight core²code applies: Strategy to life – Architecture to code.

Sources

AWS Outage – October 20, 2025 The Guardian: “Amazon reveals cause of AWS outage” https://www.theguardian.com/technology/2025/oct/24/amazon-reveals-cause-of-aws-outage
Azure Outage – October 29, 2025 Wursta: “When Azure Went Dark: Why Last Week’s Microsoft Outage Is a Wake-Up Call for IT Leaders” https://wursta.com/when-azure-went-dark-why-last-weeks-microsoft-outage-is-a-wake-up-call-for-it-leaders
Cloudflare Outage – November 18, 2025 AP News: “Cloudflare outage disrupts major websites including X and ChatGPT” https://apnews.com/article/9335e8e0da2a0027d1fbac5eb97d11ae

When the Foundation Shakes – What the Recent Cloud Outages Reveal About Infrastructure

High availability always comes at a cost

Strategy First: Define RPO and RTO before building architecture

Recent outages show that scale does not prevent failure

AWS outage on 20 October 2025

Azure outage on 29 October 2025

Cloudflare outage on 18 November 2025

A Hybrid Approach: Operate the baseline yourself, bursting into the cloud

While having one provider is convenient, monoculture remains risky

Availability is a process, not a product

My own perspective — critically reflected

Sources

General project information

Initial situation / challenges

Solution / implementation

Results / customer value

Technical facts (optional, audience-dependent)

Presentation

General project information

Initial situation / challenges

Solution / implementation

Results / customer value

Technical facts

Presentation

Customer quote

General project information

Initial situation / challenges

Solution / implementation

Results / customer value

Technical facts

Presentation