Engineering Fluency OS

Why This Matters

What happens when your database server crashes? If it is the only copy of your data, your application is down until the server is fixed -- and if the disk fails, the data may be gone forever. Replication copies your data to multiple servers so that if one fails, another can take over immediately.

But replication is not just about reliability. If your app serves users worldwide, a primary-replica setup lets read queries go to the nearest replica, cutting response times dramatically. The tradeoff is replication lag -- the delay between a write on the primary and its appearance on replicas. Understanding this tradeoff is essential for building resilient, globally distributed systems.

Define Terms

Visual Model

Application

WritesINSERT / UPDATE

ReadsSELECT

PrimaryLeader / Master

Replica 1sync replication

Replica 2async replication

Read

sync WAL

async WAL

The full process at a glance. Click Start tour to walk through each step.

Primary handles writes, replicas handle reads. Sync replication ensures zero data loss; async trades consistency for speed.

Code Example

Code

-- PostgreSQL: Setting up streaming replication

-- On the PRIMARY server:
-- postgresql.conf
-- wal_level = replica
-- max_wal_senders = 3
-- synchronous_standby_names = replica1

-- Create a replication user
CREATE ROLE replicator WITH REPLICATION LOGIN PASSWORD secret;

-- On the REPLICA server:
-- Use pg_basebackup to clone the primary
-- $ pg_basebackup -h primary-host -U replicator -D /var/lib/postgresql/data -Fp -Xs -R

-- Application routing: writes to primary, reads to replica
-- Primary connection (for writes)
-- host=primary-db port=5432 dbname=myapp
INSERT INTO users (name, email)
VALUES (Alice, alice@test.com);

-- Replica connection (for reads)
-- host=replica-db port=5432 dbname=myapp
SELECT * FROM users WHERE email = alice@test.com;
-- WARNING: may not see Alice yet if replication lag > 0!

-- Check replication lag on the primary
SELECT client_addr,
       state,
       sent_lsn,
       write_lsn,
       flush_lsn,
       replay_lsn,
       (sent_lsn - replay_lsn) AS replication_lag
FROM pg_stat_replication;

-- Synchronous vs Asynchronous:
-- Synchronous: primary waits for replica to confirm write
--   Slower writes, zero data loss on failover
-- Asynchronous: primary does NOT wait
--   Faster writes, possible data loss on failover

Interactive Experiment

Try these exercises:

Simulate replication with two dictionaries: a primary store and a replica store. Write to primary, then copy changes to replica with a simulated delay.
Demonstrate replication lag: write to primary, immediately read from replica (it should not have the new data yet), wait, then read again.
Simulate failover: "crash" the primary (disable writes) and promote the replica to primary.
Implement read-your-own-writes: after a write, route the next read to the primary instead of the replica.

Quick Quiz

Coding Challenge

Simulate Primary-Replica Replication

Write a class called `ReplicatedDB` with a primary store and a replica store. `write(key, value)` writes to the primary immediately. `replicate()` copies all unreplicated changes to the replica (simulating WAL streaming). `readPrimary(key)` reads from the primary. `readReplica(key)` reads from the replica (which may be behind). Track replication lag as the number of unreplicated writes.

Loading editor...

Real-World Usage

Replication is the foundation of reliable production systems:

High availability: Services like Amazon RDS and Google Cloud SQL automatically maintain replicas and handle failover. Your app stays online even when hardware fails.
Global distribution: Companies like Netflix and Spotify place read replicas in multiple regions (US, EU, Asia) so users read from the nearest server.
Read scaling: Twitter, with hundreds of thousands of reads per second, uses replicas to distribute the read load across many servers.
Disaster recovery: Cross-region replication ensures data survives even if an entire data center goes offline.
Blue-green deployments: Replicas can be used to test new application versions against a copy of production data before switching traffic.

Replication