Spotlight core²code

When One Failure Breaks Everything – Why Building Blocks Are the Foundation of Resilient Systems

09 Dec 2025 · architecture, infrastructure, resilience, cloud, strategy, core2code


When One Failure Breaks Everything – Why Building Blocks Are the Foundation of Resilient Systems

When a Single Failure Takes Everything Down – Why True Building Blocks Are the Foundation of Resilient Systems

The Cloudflare outage on December 5 once again revealed how fragile even the most advanced Internet infrastructure can be. A single configuration error caused roughly a quarter of global HTTP traffic to become temporarily unreachable. Waves of 500 errors, broken dashboards, malfunctioning firewall logic — within minutes, large parts of the web were effectively offline.

Much has already been said about the specific bug. But from an architectural perspective, the real issue lies elsewhere: the problem was not the error itself, but the size of the unit in which it operated. The impact was so severe because the failure occurred on a globally shared layer — a so-called “building block” that, in reality, wasn’t one. It was a central component whose responsibilities were far too broad and far too interconnected.

This raises a fundamental question: How should modern systems be designed so that a single failure does not take down the entire platform?

The answer points directly to an architectural approach that is still far from standard in many organizations: a disciplined decomposition into independent building blocks with strict boundaries, minimal shared dependencies and fully isolated operational domains.


Building Blocks Are Not a Metaphor — They Are a Technical Requirement

In many architecture discussions, terms like “modules,” “services,” or “domains” are used loosely, often allowing for overlaps. From a resilience standpoint, that is insufficient.

A building block only deserves its name if it has real boundaries:

Only when all of these elements come together does true isolation emerge: a failure remains contained within its own block.

A real-world example: Many providers — Cloudflare included — operate proxy layers responsible for handling all inbound traffic. Even if the services behind them are distributed and redundant, the proxy layer itself often behaves like a global monolith. A configuration error in that layer has immediate, widespread consequences — because it functions as a single global block when it should be divided into multiple independent units.

The core insight is clear: Stability is not created through redundancy but through correct system boundaries.


Why Overlapping Building Blocks Are Dangerous

Architectures often fail not because components individually malfunction, but because of their silent overlaps. These may take the form of shared configuration sets, shared state, shared CI/CD systems or shared observability pipelines.

As long as building blocks rely on — or are managed through — shared operational structures, they remain implicitly coupled. In day-to-day operation, this coupling is invisible. In failure scenarios, it becomes catastrophic.

Typical anti-patterns include:

These shared operational surfaces create hidden coupling points. They remain unnoticed until the exact moment a failure spreads across them — producing a blast radius far larger than the originating fault.

This is why demanding operational and organizational separation of building blocks is not overengineering but fundamental architectural hygiene. A block that cannot be operated independently is not a block — it is a pseudo-module within a larger monolith.


Fault Domains, Cells and Bulkheads — The Theoretical Foundation

Fault Domains

A fault domain is a bounded area within which a failure is allowed to propagate. Architectural goals revolve around sizing these domains so that failures affect only a minimal part of the system. True building blocks map directly to independent fault domains.

Bulkhead Pattern

Borrowed from marine engineering: ships are divided into sealed compartments so that a leak does not sink the entire vessel.

Applied to software:

Cell-based Architecture

Hyperscalers such as AWS — and large platforms like Netflix — operate using “cells”: Each cell is a fully isolated stack, including compute, network, data, routing and observability. Users or tenants are distributed across cells. If one cell fails, only a fraction of users are affected.

These are not theoretical exercises — they are proven strategies in organizations that demand extremely high availability.


Why Separation Must Include the Management Layer

Building blocks must not overlap — not even in management or dependent services.

This implies:

The Cloudflare outage makes this painfully clear. The failure did not originate in the data plane but in the configuration distribution layer. It was deployed globally — which effectively meant that the management plane was a single fault domain.

When the operational control layer of a global system is built as one monolithic unit, the distribution of the underlying services becomes irrelevant. A distributed system with a centralized control plane is not a distributed system. It is a global monolith running on many machines.


How Realistic Is a System Without Overlapping Blocks?

Technically, it is absolutely realistic. Economically, prioritization is required.

The architectural challenge is not to isolate everything maximally, but to identify the right blocks and apply the right degree of separation.


Architectural Principles We Can Draw from This Incident

  1. Centralization is the biggest risk in modern platforms. Outages become severe not because components fail, but because control planes do.

  2. The size of a fault domain determines the impact of a failure. Smaller, well-defined building blocks lead to smaller outages.

  3. Independence has multiple dimensions. Technical separation without organizational separation is ineffective. Organizational separation without operational independence is equally insufficient.

  4. “Shared-Nothing” is not a purist ideal — it is a resilience principle. Duplication of critical elements costs far less than a global outage.

  5. Global configuration is more dangerous than global code. Config errors move faster, remain unnoticed longer and often take effect immediately.


Conclusion

The December 5 Cloudflare incident was not primarily a technical mishap but an architectural signal. The unusual part was not the existence of a bug — bugs occur daily — but where it executed: inside a global block that should never have been global.

Architectures built around genuine building blocks inherently limit the damage such failures can cause. The Internet is too large and too critical for global monoliths, regardless of their efficiency or performance.

The future belongs to systems composed of many small, independent cells, each capable of operating autonomously and safely — without being taken down by failures in neighboring domains.

Such architectures are not a luxury. They are the foundation for ensuring that a single error never again takes down a quarter of the web.


© 2025 by spotlight core²code