Resilience and Recovery
Core Idea
Examples and diagrams in this page follow the shared Hypothetical Scenario.
Resilience is the ability of a system to preserve acceptable behavior during faults. Recovery is the controlled return to normal operating state after disruption. Resilience reduces incident impact. Recovery reduces incident duration. Both are architecture responsibilities from the first design iteration.
In the scenario platform, users depend on timely vehicle recommendation and marketplace data. Temporary dependency outages should degrade capability in predictable ways. Critical flows need clear fallback or queue-based continuation paths. Architecture should define these paths before incidents happen.
Conceptual Overview
Fault Model and Failure Domains
Resilience starts with explicit fault assumptions. A service may fail. A network path may partition. A dependency may return stale data. An operator may ship incorrect configuration. Each fault class has distinct mitigation strategies.
Failure domain mapping is essential. A shared cache cluster is one domain. A payment integration path is another domain. An internal recommendation model service is another domain. Teams should map which user journeys depend on each domain. That map supports blast-radius analysis and runbook quality.
Steady State and Degraded State
Systems do not switch from healthy to failed in one step. Many incidents start in degraded state. Latency rises. Error percentages rise in one segment. Queues grow. A resilient system detects this phase early and sheds load or isolates fault domains.
Common controls:
- timeout budgets at every network boundary
- retry with bounded backoff and idempotency safeguards
- circuit breakers for repeated dependency failure
- bulkheads to isolate resource pools by capability
- admission control for overload conditions
These controls need policy values aligned with product SLO targets.
Recovery Engineering
Recovery is not a manual improvisation task. It is an engineered capability with predefined goals. Two goals are central.
- RTO: target time to restore acceptable service behavior
- RPO: acceptable data loss window after failure
RTO and RPO should be set per capability. Recommendation read models and marketplace transaction logs may require different targets. This is a business decision with technical consequences.
Recovery plans should include:
- automated dependency health checks
- state reconciliation workflows
- replay mechanisms for message-backed workflows
- checkpoint and restore procedures
- clear incident command roles
Observability for Resilience Decisions
Operational decisions depend on signal quality. Teams need real-time visibility into queue depth, timeout rates, retry amplification, and circuit breaker state. Distributed tracing should expose dependency path and latency contribution by hop.
Post-incident analysis should connect observations to architecture adjustments. A recurring timeout on one dependency is not a paging issue only. It is an architecture boundary issue that needs redesign or isolation.
Relationship to Data and Messaging
Resilience design links directly to State and Data Modeling and Event-Driven Messaging. If state transitions are not idempotent, retry policies can corrupt durable state. If compensation paths are weak, distributed workflows can stall in partial completion states. Recovery quality depends on these design choices.
Computing History
Fault tolerance became a central systems topic through work on highly available transaction systems in the 1970s and 1980s. Modern SRE practice formalized reliability planning with error budgets, service level objectives, and controlled incident response methods. This shifted reliability from ad hoc operations to measurable engineering policy.
Sources: Gray and Reuter (1992) and Beyer et al. (2016)
Quote
"A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable."
Source: Leslie Lamport, 1978
Practice Checklist
- Define explicit fault assumptions for each critical service boundary.
- Map failure domains and user journey dependencies.
- Set RTO and RPO targets by capability and review quarterly.
- Use bounded retries with idempotency keys and timeout budgets.
- Add circuit breakers and bulkheads for high-risk dependencies.
- Validate runbooks through scheduled recovery drills.
- Track error budget burn rate and escalation thresholds.
- Instrument queue depth, retry volume, timeout rate, and breaker state.
- Require post-incident actions with owner, due date, and verification test.
- Feed incident lessons into architecture and contract revisions.
Written by: Pedro Guzmán
See References for complete APA-style bibliographic entries used on this page.