Tutorials System Design Mastery

Disaster Recovery (DR): RTO, RPO, and Multi-Region failover

On this page

Disaster Recovery (DR)

What if an entire AWS region goes offline? (It happens). A global application must be able to failover to a different continent in minutes without losing user data.

1. Critical Metrics: RTO & RPO

  • RTO (Recovery Time Objective): How long can the site be down? (e.g., "We must be back in 15 minutes").
  • RPO (Recovery Point Objective): How much data can we lose? (e.g., "We can lose the last 5 minutes of data").

2. DR Strategies

  • Pilot Light: Database is replicated to another region, but App Servers are off. (Cheap, slow recovery).
  • Warm Standby: Small scale version of the app is always running in Region B. (More expensive, faster recovery).
  • Active-Active: App is running at 100% in both regions simultaneously. (Most expensive, instant recovery).

4. Interview Mastery

Q: "How do you handle Database failover across regions?"

Architect Answer: "We use **Cross-Region Replication**. The Master in Region A sends logs to a Passive Master in Region B. During a disaster, we 'Promote' the slave in Region B to be the new Master. The biggest challenge is **Consistency**—if the network was severed, the slave in B might be missing the last few seconds of data. We must have a reconciliation process once the old Master comes back online."

Questions on this lesson 0

Sign in to ask a question or upvote helpful answers.

No questions yet — be the first to ask!

System Design Mastery
Course syllabus
1. Distributed Systems Fundamentals
2. Database Scalability
3. Caching & CDN Strategies
4. Event-Driven Architecture
5. High Availability & Load Balancing
6. Microservices & API Gateway
7. Monitoring & Disaster Recovery
8. FAANG System Design Interview
Toolliyo Assistant
Ask about tutorials, ebooks, training, pricing, mentor services, and support. I use public site content only—not admin or internal tools.

care@toolliyo.com

Need callback? Share your details