Tutorials System Design Mastery
Chaos Engineering: Breaking things on purpose to stay strong
On this page
Chaos Engineering
The only way to know if your system is truly resilient is to Break it in Production. Popularized by Netflix's Chaos Monkey, this discipline proactively injects failures into the system to find weaknesses before they become disasters.
1. Types of Chaos
- Node Killer: Randomly shut down a healthy server. Does the Load Balancer handle it?
- Latency Injector: Add 2 seconds of delay to an internal API. Does the Circuit Breaker trip?
- Region Blackout: Simulate an entire region failure. Does GSLB work?
2. The Goal: Confidence
Chaos Engineering isn't about creating outages; it's about proving that your **Self-Healing** mechanisms actually work. If you are afraid to run Chaos Monkey, your system is fragile.
4. Interview Mastery
Q: "Should you run Chaos Experiments during peak traffic?"
Architect Answer: "Absolutely not initially. You start in a Staging environment. Once you are confident, you run it in Production during **Off-peak hours** with a 'Kill Switch' ready to stop the experiment instantly. The goal is to build resilience, not to torture your users or your SRE team."
Sign in to ask a question or upvote helpful answers.
No questions yet — be the first to ask!