Why This Matters
In a distributed system, failures are not exceptional -- they are routine. A server might be temporarily overloaded, a network link might be briefly congested, or a database connection might time out. A well-designed retry strategy handles these transient failures gracefully, while a naive one can make things catastrophically worse.
If every client retries immediately and simultaneously, you get a thundering herd that overwhelms the recovering server. Exponential backoff with jitter spaces out retries intelligently, giving the system time to recover.
Define Terms
Visual Model
The full process at a glance. Click Start tour to walk through each step.
Exponential backoff with jitter: wait longer between each retry and add randomness to avoid thundering herds.
Code Example
// Exponential backoff with jitter
function sleep(ms) {
return new Promise(resolve => setTimeout(resolve, ms));
}
async function retryWithBackoff(fn, options = {}) {
const {
maxRetries = 3,
baseDelay = 1000, // 1 second
maxDelay = 30000, // 30 seconds
} = options;
for (let attempt = 0; attempt <= maxRetries; attempt++) {
try {
return await fn();
} catch (error) {
if (attempt === maxRetries) {
console.log(`All ${maxRetries} retries exhausted`);
throw error;
}
// Exponential backoff: 1s, 2s, 4s, 8s...
const exponentialDelay = baseDelay * Math.pow(2, attempt);
// Add jitter: random value between 0 and the delay
const jitter = Math.random() * exponentialDelay;
const delay = Math.min(exponentialDelay + jitter, maxDelay);
console.log(`Attempt ${attempt + 1} failed. Retrying in ${Math.round(delay)}ms`);
await sleep(delay);
}
}
}
// Usage example
let callCount = 0;
async function unreliableService() {
callCount++;
if (callCount < 3) {
throw new Error("Service temporarily unavailable");
}
return { data: "success" };
}
// This will fail twice, then succeed on the third attempt
retryWithBackoff(unreliableService).then(console.log);Interactive Experiment
Try these modifications to build intuition:
- Remove the jitter and run 10 simulated clients retrying simultaneously. Notice how they all retry at the same time (thundering herd).
- Add jitter back and observe how retries spread out.
- Change the base delay and max retries. What happens with a very short base delay? A very long one?
- Add a check for retryable errors: only retry on specific error codes (503, 429) and immediately fail on others (400, 404).
Quick Quiz
Coding Challenge
Write a function called `calculateBackoff` that takes the attempt number (0-based), a base delay in milliseconds, and returns the backoff delay with jitter. Use the formula: delay = baseDelay * 2^attempt, then add random jitter between 0 and delay. Cap the result at 30000ms.
Real-World Usage
Retry strategies are built into every major cloud SDK and API client:
- AWS SDK: All AWS SDKs use exponential backoff with jitter by default. The retry config is tunable per service.
- gRPC: Has built-in retry policies with configurable backoff. The gRPC spec defines retry semantics for different status codes.
- Stripe API: Returns
Retry-Afterheaders when rate-limited. Their client libraries implement automatic backoff. - HTTP 429 (Too Many Requests): The standard HTTP status code that tells clients to slow down, often including a
Retry-Afterheader. - Kubernetes: Pod restart backoff uses exponential backoff (10s, 20s, 40s...) up to 5 minutes when containers crash repeatedly.