Engineering Fluency OS

Why This Matters

How reliable does your system need to be? The instinct is to say "100% uptime" but that is both impossible and unnecessary. Pursuing 100% means zero deployments, zero changes, zero progress. Instead, engineering teams define precise reliability targets and measure against them.

An SLI is a measurement (like error rate or latency). An SLO is a target for that measurement (like "99.9% of requests succeed"). An SLA is a contract with customers that includes consequences for missing the target. Together, these define how much unreliability you can tolerate -- your error budget. When things go wrong (and they will), a blameless postmortem asks "what went wrong and how do we prevent it" instead of "who is to blame." This is how great teams turn failures into improvements.

Define Terms

Visual Model

Measure SLIs

Set SLO Target

Error Budget

Budget Left?

Ship Features

Freeze & Fix

yes

The full process at a glance. Click Start tour to walk through each step.

SLIs measure, SLOs set targets, error budgets balance feature velocity with reliability.

Code Example

Code

// Calculate error budget and availability
function calculateErrorBudget(sloPercent) {
  const errorBudgetPercent = 100 - sloPercent;
  const minutesPerMonth = 30 * 24 * 60; // 43,200
  const downtimeMinutes = (errorBudgetPercent / 100) * minutesPerMonth;
  const downtimeHours = downtimeMinutes / 60;

  return {
    slo: `${sloPercent}%`,
    errorBudget: `${errorBudgetPercent}%`,
    downtimePerMonth: `${downtimeMinutes.toFixed(1)} minutes`,
    downtimePerYear: `${(downtimeHours * 12).toFixed(1)} hours`
  };
}

// The nines of availability
console.log(calculateErrorBudget(99));     // 7.2 hours/year
console.log(calculateErrorBudget(99.9));   // ~43 min/month
console.log(calculateErrorBudget(99.99));  // ~4.3 min/month
console.log(calculateErrorBudget(99.999)); // ~26 sec/month

// Track SLI: success rate over a time window
function calculateSLI(requests) {
  const total = requests.length;
  const successful = requests.filter(r => r.status < 500).length;
  const sli = (successful / total) * 100;
  return { total, successful, sli: sli.toFixed(3) + "%" };
}

const recentRequests = [
  { status: 200 }, { status: 200 }, { status: 200 },
  { status: 500 }, { status: 200 }, { status: 200 },
  { status: 200 }, { status: 200 }, { status: 503 },
  { status: 200 }
];
console.log("SLI:", calculateSLI(recentRequests));

Interactive Experiment

Try these exercises:

Calculate the error budget for your favorite service. If it has a 99.9% SLO, how many minutes of downtime per month is acceptable?
Write a function that tracks a running SLI over a stream of request results. After each request, recalculate the success rate.
Create a simple "budget burn" tracker: start with 43 minutes of budget per month, subtract downtime from each incident, and alert when the budget drops below 25%.
Draft a postmortem for a hypothetical incident: "The checkout page returned 500 errors for 15 minutes because a database migration added a lock on the orders table."

Quick Quiz

Coding Challenge

Error Budget Tracker

Write a function called `createBudgetTracker` that takes an `sloPercent` (e.g., 99.9) and `windowMinutes` (e.g., 43200 for 30 days). It returns an object with: `totalBudgetMinutes` (the error budget in minutes), `consumeBudget(minutes)` which subtracts downtime from the remaining budget, `remaining()` which returns remaining minutes rounded to 1 decimal, and `canShip()` which returns true if more than 25% of the budget remains.

Loading editor...

Real-World Usage

SLOs and postmortems are core practices at top engineering organizations:

Google SRE pioneered error budgets. When a team burns through their error budget, they freeze feature development and focus on reliability until the budget recovers.
99.9% availability (three nines) is the standard target for most web services. That allows about 8.7 hours of downtime per year -- sounds like a lot, but it adds up fast across partial outages and degraded performance.
Blameless postmortems at Etsy, Stripe, and Netflix are published internally (and sometimes publicly). They document the timeline, root cause, impact, and action items.
Status pages (like status.github.com) communicate SLA compliance to customers. When an incident happens, the status page updates in real time.
Error budget policies define automatic responses: if the budget drops below 50%, increase testing requirements; below 25%, freeze all feature deployments.

SLOs & Postmortems