Tutorials System Design Mastery
Observability: Metrics (Prometheus), Logging (ELK), and Traces (Jaeger)
On this page
Mastering Observability
In a complex system, "Monitoring" (knowing IF something is broken) is not enough. You need Observability (knowing WHY it is broken).
1. The Three Pillars
- Metrics (Prometheus): Numerical data over time (CPU %, Error Rate, Requests per second). Great for Dashboards and Alerts.
- Logging (ELK Stack): Every event that ever happened. Great for investigating specific user errors.
- Distributed Tracing (Jaeger): Tracks a single request as it jumps through 10 different microservices. Essential for finding the exact service causing a delay.
2. The "Four Golden Signals"
Focus on these 4 metrics: **Latency** (time), **Traffic** (demand), **Errors** (rate of failure), and **Saturation** (how "full" is your service?). If these are healthy, your system is healthy.
4. Interview Mastery
Q: "How do you handle 'Log Flooding' during an outage?"
Architect Answer: "During an outage, your services might generate 1,000x more logs than usual, which can crash your logging infra (Elasticsearch). We solve this with **Sampling**. In normal times, we record 100% of logs. During high load, we might only record 5% of 'Success' logs but 100% of 'Error' logs. We also use **Dynamic Log Levels** that allow us to change log verbosity from 'Info' to 'Debug' at runtime without restarting the service."
Sign in to ask a question or upvote helpful answers.
No questions yet — be the first to ask!