Tutorials DevOps & Cloud Architect Mastery
Site Reliability Engineering (SRE): Error Budgets and SLOs
On this page
SRE: Reliability as a Feature
SRE is what happens when you ask a software engineer to design an operations team. It is the discipline of treating operations as a Software Problem.
1. SLI vs SLO vs SLA
- SLI (Indicator): What we measure (e.g., Latency).
- SLO (Objective): Our target (e.g., "99.9% of requests < 100ms").
- SLA (Agreement): The legal contract with the user (e.g., "If we miss our SLO, we refund your money").
2. The Error Budget
If your SLO is 99.9%, you have 0.1% of "Permitted Failure" per month. This is your **Error Budget**. - If you have budget left: You can deploy risky new features. - If you are out of budget: You must STOP all new features and focus 100% on reliability until the budget refills.
4. Interview Mastery
Q: "How do you handle 'On-call' fatigue in SRE?"
Architect Answer: "We use **Toil Reduction**. Toil is manual, repetitive work (e.g., restarting a crashed server). If an engineer spends >50% of their time on toil, they aren't 'Engineering.' We use automation to handle routine failures and ensure that every on-call alert is 'Actionable.' If an alert doesn't require immediate human intervention, it shouldn't wake someone up at 3 AM."