When the Foundation Shakes – What the Recent Cloud Outages Reveal About Infrastructure
24 Nov 2025 · infrastructure, cloud, ha, architecture, strategy, core2code
In recent weeks, several major outages at AWS, Azure and Cloudflare have disrupted critical internet services, including identity platforms, storage systems, global routing layers and content delivery infrastructure. These incidents demonstrate the vulnerability of modern digital ecosystems, even when they are supported by the world's largest cloud providers.
Rather than focusing solely on the symptoms, we should examine the strategic implications. Who is responsible for availability? How should redundancy be planned? And why hyperscalers are neither saviours nor scapegoats.
High availability always comes at a cost
The fundamental truth is this: if you want high availability, you have to pay for it. Availability is the result of redundancy, and redundancy means maintaining a complete second infrastructure. Whether it runs in the public cloud or on your own premises, failover capacity effectively doubles the cost.
Cloud computing doesn’t eliminate these costs. It simply changes how they are billed. It does not replace responsibility. It only shifts it.
Strategy First: Define RPO and RTO before building architecture
The most important step is not technical. It’s strategic and business-driven.
- How much data loss is acceptable (RPO)?
- How quickly must a service recover (RTO)?
- Which business processes depend on it?
- What is the financial impact of downtime?
Without clear answers to these questions, architectural decisions will inevitably be either too expensive or too risky.
Availability is a strategic decision. Architecture is how that decision is realised.
Recent outages show that scale does not prevent failure
The recent outages experienced by major cloud providers highlight the fragility of even the most automated global systems.
AWS outage on 20 October 2025
A faulty or empty DNS entry within the DynamoDB automation stack triggered a cascading failure. Load balancers, DNS resolution and other base services were affected. A small error with a significant impact — typical of highly automated environments.
Azure outage on 29 October 2025
A misconfiguration in the global routing layer — specifically Azure Front Door and the CDN infrastructure — caused a worldwide outage that impacted Microsoft 365 too. Multiple redundancy layers were ineffective because the single global routing path became a single point of failure.
Cloudflare outage on 18 November 2025
An oversized configuration file in the Bot Management engine triggered software faults across Cloudflare’s infrastructure, taking down numerous major websites, including X and ChatGPT. The more central an infrastructure provider becomes, the wider the blast radius when things go wrong.
These incidents demonstrate that hyperscalers don’t just scale availability — they also scale failure.
A Hybrid Approach: Operate the baseline yourself, bursting into the cloud
Once RPO and RTO have been defined, architectures can be designed to balance risk and cost. One proven model is to operate the stable baseline workload in-house and use the cloud only for elastic peak load.
This reduces dependency, avoids overprovisioning and enables true resilience.
While having one provider is convenient, monoculture remains risky
Many companies appreciate the simplicity of working with one infrastructure provider. One contract, one support channel, no vendor ping-pong.
This works — until it doesn't. Developments at a major virtualisation vendor in recent years have shown how quickly strategic dependence can become a liability.
Consolidation is good. Monoculture, however, is dangerous. Finding the right balance is critical.
Availability is a process, not a product
Neither cloud providers nor internal IT teams are infallible. Availability is created through:
- clear requirements and objectives
- robust, pragmatic architecture
- disciplined operations
- organisational culture
- continuous improvement
A tool cannot produce availability. Architecture enables it. Operations preserve it.
My own perspective — critically reflected
This perspective also requires scrutiny:
- Operating baseline workloads yourself only works if the necessary expertise and operational discipline are in place.
- Redundancy does not necessarily double the cost; some cloud-native architectures provide efficient redundancy models.
- Multi-cloud can reduce risk, but poor integration can dramatically increase it.
- Consolidation improves efficiency, but dependency can become a strategic liability.
The right architecture always depends on context — business priorities, organisational culture, team capabilities, regulatory requirements and risk appetite.
In the end, the core principle of spotlight core²code applies: Strategy to life – Architecture to code.
Sources
-
AWS Outage – October 20, 2025 The Guardian: “Amazon reveals cause of AWS outage” https://www.theguardian.com/technology/2025/oct/24/amazon-reveals-cause-of-aws-outage
-
Azure Outage – October 29, 2025 Wursta: “When Azure Went Dark: Why Last Week’s Microsoft Outage Is a Wake-Up Call for IT Leaders” https://wursta.com/when-azure-went-dark-why-last-weeks-microsoft-outage-is-a-wake-up-call-for-it-leaders
-
Cloudflare Outage – November 18, 2025 AP News: “Cloudflare outage disrupts major websites including X and ChatGPT” https://apnews.com/article/9335e8e0da2a0027d1fbac5eb97d11ae