distributed systems20 min

Retries & Backoff

Handling transient failures with smart retry strategies

0/9Not Started

Why This Matters

In a distributed system, failures are not exceptional -- they are routine. A server might be temporarily overloaded, a network link might be briefly congested, or a database connection might time out. A well-designed retry strategy handles these transient failures gracefully, while a naive one can make things catastrophically worse.

If every client retries immediately and simultaneously, you get a thundering herd that overwhelms the recovering server. Exponential backoff with jitter spaces out retries intelligently, giving the system time to recover.

Define Terms

Visual Model

Client
Attempt 1Fails
Wait 1s
Attempt 2Fails
Wait 2s
Attempt 3Fails
Wait 4s
Attempt 4Success!
Server
Backoff
+jitter
Backoff
+jitter
Backoff
+jitter
200 OK

The full process at a glance. Click Start tour to walk through each step.

Exponential backoff with jitter: wait longer between each retry and add randomness to avoid thundering herds.

Code Example

Code
// Exponential backoff with jitter

function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

async function retryWithBackoff(fn, options = {}) {
  const {
    maxRetries = 3,
    baseDelay = 1000,  // 1 second
    maxDelay = 30000,  // 30 seconds
  } = options;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      return await fn();
    } catch (error) {
      if (attempt === maxRetries) {
        console.log(`All ${maxRetries} retries exhausted`);
        throw error;
      }

      // Exponential backoff: 1s, 2s, 4s, 8s...
      const exponentialDelay = baseDelay * Math.pow(2, attempt);

      // Add jitter: random value between 0 and the delay
      const jitter = Math.random() * exponentialDelay;
      const delay = Math.min(exponentialDelay + jitter, maxDelay);

      console.log(`Attempt ${attempt + 1} failed. Retrying in ${Math.round(delay)}ms`);
      await sleep(delay);
    }
  }
}

// Usage example
let callCount = 0;
async function unreliableService() {
  callCount++;
  if (callCount < 3) {
    throw new Error("Service temporarily unavailable");
  }
  return { data: "success" };
}

// This will fail twice, then succeed on the third attempt
retryWithBackoff(unreliableService).then(console.log);

Interactive Experiment

Try these modifications to build intuition:

  • Remove the jitter and run 10 simulated clients retrying simultaneously. Notice how they all retry at the same time (thundering herd).
  • Add jitter back and observe how retries spread out.
  • Change the base delay and max retries. What happens with a very short base delay? A very long one?
  • Add a check for retryable errors: only retry on specific error codes (503, 429) and immediately fail on others (400, 404).

Quick Quiz

Coding Challenge

Backoff Calculator

Write a function called `calculateBackoff` that takes the attempt number (0-based), a base delay in milliseconds, and returns the backoff delay with jitter. Use the formula: delay = baseDelay * 2^attempt, then add random jitter between 0 and delay. Cap the result at 30000ms.

Loading editor...

Real-World Usage

Retry strategies are built into every major cloud SDK and API client:

  • AWS SDK: All AWS SDKs use exponential backoff with jitter by default. The retry config is tunable per service.
  • gRPC: Has built-in retry policies with configurable backoff. The gRPC spec defines retry semantics for different status codes.
  • Stripe API: Returns Retry-After headers when rate-limited. Their client libraries implement automatic backoff.
  • HTTP 429 (Too Many Requests): The standard HTTP status code that tells clients to slow down, often including a Retry-After header.
  • Kubernetes: Pod restart backoff uses exponential backoff (10s, 20s, 40s...) up to 5 minutes when containers crash repeatedly.

Connections