Why This Matters
In a group of servers, someone needs to be in charge. The leader coordinates writes, assigns tasks, or manages shared resources. But what happens when the leader crashes? The remaining nodes must elect a new leader -- quickly, correctly, and without disagreement. If two nodes both think they are the leader, you get a split brain, one of the most dangerous situations in distributed systems.
Leader election is used in database replication, distributed locks, job scheduling, and consensus protocols. Understanding it is key to building systems that recover automatically from failures.
Define Terms
Visual Model
The full process at a glance. Click Start tour to walk through each step.
Leader election: detect failure, elect a new leader, and use fencing tokens to prevent split brain.
Code Example
// Simulating leader election with the bully algorithm
class Node {
constructor(id) {
this.id = id;
this.isAlive = true;
this.isLeader = false;
this.leaderId = null;
}
}
class Cluster {
constructor(nodeCount) {
this.nodes = [];
this.fencingToken = 0;
for (let i = 0; i < nodeCount; i++) {
this.nodes.push(new Node(i));
}
// Highest ID starts as leader
this.electLeader();
}
electLeader() {
// Bully algorithm: highest alive ID wins
const alive = this.nodes.filter(n => n.isAlive);
if (alive.length === 0) {
console.log("No nodes alive!");
return;
}
// Reset all leadership
for (const node of this.nodes) {
node.isLeader = false;
}
const winner = alive.reduce((a, b) => a.id > b.id ? a : b);
winner.isLeader = true;
this.fencingToken++;
for (const node of alive) {
node.leaderId = winner.id;
}
console.log(`Node ${winner.id} elected as leader (token: ${this.fencingToken})`);
}
killNode(id) {
this.nodes[id].isAlive = false;
console.log(`Node ${id} crashed`);
if (this.nodes[id].isLeader) {
console.log("Leader is down! Starting election...");
this.electLeader();
}
}
}
const cluster = new Cluster(5);
cluster.killNode(4); // Kill the leader
cluster.killNode(3); // Kill the new leaderInteractive Experiment
Try these modifications:
- Add a
reviveNodemethod. What should happen when a crashed node comes back online with an old fencing token? - Implement a heartbeat system: the leader sends heartbeats every second, and followers start an election after 3 missed heartbeats.
- Simulate a network partition where nodes 0-2 cannot communicate with nodes 3-4. What happens with the bully algorithm?
- Why is it dangerous to use a very short heartbeat timeout? What about a very long one?
Quick Quiz
Coding Challenge
Write a function called `validateOperation` that takes the current valid fencing token and an operation object with a `token` and `action` field. Return true only if the operation's token matches the current valid token. This prevents stale leaders from executing operations.
Real-World Usage
Leader election is critical in production distributed systems:
- ZooKeeper: Uses a variant of Zab (ZooKeeper Atomic Broadcast) to elect a leader that coordinates all writes across the ensemble.
- etcd/Raft: Uses the Raft consensus algorithm where candidates request votes and a node with a majority becomes the leader.
- Kafka: Each partition has a leader broker that handles all reads and writes for that partition. If the leader fails, a follower is promoted.
- Redis Sentinel: Monitors Redis instances and automatically promotes a replica to master when the current master fails.
- Kubernetes: Uses etcd leader election for control plane components to ensure only one scheduler and controller manager are active.