Disaster Recovery (DR): RTO, RPO, and Multi-Region failover

Updated 6/26/2026

On this page

Disaster Recovery (DR)

What if an entire AWS region goes offline? (It happens). A global application must be able to failover to a different continent in minutes without losing user data.

1. Critical Metrics: RTO & RPO

RTO (Recovery Time Objective): How long can the site be down? (e.g., "We must be back in 15 minutes").
RPO (Recovery Point Objective): How much data can we lose? (e.g., "We can lose the last 5 minutes of data").

2. DR Strategies

Pilot Light: Database is replicated to another region, but App Servers are off. (Cheap, slow recovery).
Warm Standby: Small scale version of the app is always running in Region B. (More expensive, faster recovery).
Active-Active: App is running at 100% in both regions simultaneously. (Most expensive, instant recovery).

4. Interview Mastery

Q: "How do you handle Database failover across regions?"

Architect Answer: "We use **Cross-Region Replication**. The Master in Region A sends logs to a Passive Master in Region B. During a disaster, we 'Promote' the slave in Region B to be the new Master. The biggest challenge is **Consistency**—if the network was severed, the slave in B might be missing the last few seconds of data. We must have a reconciliation process once the old Master comes back online."

Questions on this lesson 0

No questions yet — be the first to ask!

Disaster Recovery (DR)

1. Critical Metrics: RTO & RPO

2. DR Strategies

4. Interview Mastery

System Design Mastery