Tutorials DevOps & Cloud Architect Mastery

Site Reliability Engineering (SRE): Error Budgets and SLOs

Updated 6/26/2026

On this page

SRE: Reliability as a Feature

SRE is what happens when you ask a software engineer to design an operations team. It is the discipline of treating operations as a Software Problem.

1. SLI vs SLO vs SLA

SLI (Indicator): What we measure (e.g., Latency).
SLO (Objective): Our target (e.g., "99.9% of requests < 100ms").
SLA (Agreement): The legal contract with the user (e.g., "If we miss our SLO, we refund your money").

2. The Error Budget

If your SLO is 99.9%, you have 0.1% of "Permitted Failure" per month. This is your **Error Budget**. - If you have budget left: You can deploy risky new features. - If you are out of budget: You must STOP all new features and focus 100% on reliability until the budget refills.

4. Interview Mastery

Q: "How do you handle 'On-call' fatigue in SRE?"

Architect Answer: "We use **Toil Reduction**. Toil is manual, repetitive work (e.g., restarting a crashed server). If an engineer spends >50% of their time on toil, they aren't 'Engineering.' We use automation to handle routine failures and ensure that every on-call alert is 'Actionable.' If an alert doesn't require immediate human intervention, it shouldn't wake someone up at 3 AM."