Tutorials System Design Mastery
Disaster Recovery (DR): RTO, RPO, and Multi-Region failover
On this page
Disaster Recovery (DR)
What if an entire AWS region goes offline? (It happens). A global application must be able to failover to a different continent in minutes without losing user data.
1. Critical Metrics: RTO & RPO
- RTO (Recovery Time Objective): How long can the site be down? (e.g., "We must be back in 15 minutes").
- RPO (Recovery Point Objective): How much data can we lose? (e.g., "We can lose the last 5 minutes of data").
2. DR Strategies
- Pilot Light: Database is replicated to another region, but App Servers are off. (Cheap, slow recovery).
- Warm Standby: Small scale version of the app is always running in Region B. (More expensive, faster recovery).
- Active-Active: App is running at 100% in both regions simultaneously. (Most expensive, instant recovery).
4. Interview Mastery
Q: "How do you handle Database failover across regions?"
Architect Answer: "We use **Cross-Region Replication**. The Master in Region A sends logs to a Passive Master in Region B. During a disaster, we 'Promote' the slave in Region B to be the new Master. The biggest challenge is **Consistency**—if the network was severed, the slave in B might be missing the last few seconds of data. We must have a reconciliation process once the old Master comes back online."
Sign in to ask a question or upvote helpful answers.
No questions yet — be the first to ask!