databases25 min

Normalization

Organizing data to reduce redundancy and improve integrity

0/9Not Started

Why This Matters

Imagine a spreadsheet where every order row also contains the customer's name, address, and phone number. When a customer moves, you have to update their address in every single order row. Miss one, and now you have conflicting addresses for the same person. This is the problem normalization solves -- it organizes data to eliminate redundancy and prevent inconsistencies.

Normalization is a formal process with specific rules called normal forms. It is how database designers decide which columns belong in which tables. Get it right, and your data stays consistent and your queries stay clean. Get it wrong, and you will fight data corruption and update anomalies for the life of your application.

Define Terms

Visual Model

Denormalizedname, email, city, product, price
1NFAtomic values only
2NFNo partial deps
3NFNo transitive deps
Customersid | name | email
Ordersid | cust_id | date
Productsid | name | price
split arrays
remove partials
remove transitive

The full process at a glance. Click Start tour to walk through each step.

Normalization removes redundancy step by step. Each normal form eliminates a class of data anomalies.

Code Example

Code
-- UNNORMALIZED: customer data repeated in every order
CREATE TABLE orders_bad (
  order_id      INTEGER PRIMARY KEY,
  customer_name TEXT,
  customer_email TEXT,
  customer_city TEXT,
  product_name  TEXT,
  product_price DECIMAL(10,2),
  quantity      INTEGER
);
-- Problem: if Alice moves cities, update EVERY order row
-- Problem: if product price changes, old orders show wrong price

-- NORMALIZED (3NF): separate concerns into distinct tables
CREATE TABLE customers (
  id    INTEGER PRIMARY KEY,
  name  TEXT NOT NULL,
  email TEXT UNIQUE NOT NULL,
  city  TEXT
);

CREATE TABLE products (
  id    INTEGER PRIMARY KEY,
  name  TEXT NOT NULL,
  price DECIMAL(10,2) NOT NULL
);

CREATE TABLE orders (
  id          INTEGER PRIMARY KEY,
  customer_id INTEGER REFERENCES customers(id),
  created_at  TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE order_items (
  id         INTEGER PRIMARY KEY,
  order_id   INTEGER REFERENCES orders(id),
  product_id INTEGER REFERENCES products(id),
  quantity   INTEGER NOT NULL,
  unit_price DECIMAL(10,2) NOT NULL  -- snapshot at time of order
);

-- Now: update customer city in ONE place
UPDATE customers SET city = New York WHERE id = 1;

-- Query with JOINs to reconstruct the full picture
SELECT c.name, p.name AS product, oi.quantity, oi.unit_price
FROM orders o
JOIN customers c ON o.customer_id = c.id
JOIN order_items oi ON o.id = oi.order_id
JOIN products p ON oi.product_id = p.id
WHERE o.id = 1;

Interactive Experiment

Try these exercises:

  • Take a flat spreadsheet with repeated data (e.g., order rows with customer info) and normalize it into 3NF by creating separate tables.
  • Identify the update anomaly: change a customer's email in the unnormalized version. How many rows must you update?
  • Create the normalized version and make the same change. How many rows now?
  • Think of a case where denormalization makes sense. When would you store redundant data on purpose?

Quick Quiz

Coding Challenge

Normalize Flat Data

Write a function called `normalize` that takes an array of flat order objects (each with customer_name, customer_email, product_name, product_price, quantity) and returns a normalized structure with separate customers, products, and orders collections. Each entity should have a unique ID, and orders should reference customers and products by ID.

Loading editor...

Real-World Usage

Normalization decisions affect every aspect of application design:

  • Schema design: Database architects normalize to 3NF by default, then selectively denormalize based on query patterns.
  • ORMs: Object-relational mappers expect normalized schemas. Each model class maps to one table, and relationships map to foreign keys.
  • Data warehousing: Analytics databases often denormalize into "star schemas" with a central fact table surrounded by dimension tables, optimized for aggregate queries.
  • Caching layers: Denormalized cache entries (Redis, Memcached) store pre-joined data to avoid expensive real-time JOINs.
  • API design: REST resources often map to normalized tables. A /users/1/orders endpoint performs the JOIN for you.

Connections