Engineering Fluency OS

Why This Matters

Your CPU can execute billions of instructions per second, but it can only work on data that is right next to it. A single trip to main memory takes roughly 100 nanoseconds -- during which a modern processor could have completed hundreds of operations. A trip to an SSD takes tens of microseconds, and a spinning hard drive takes milliseconds. The speed gap between the CPU and storage is enormous.

The memory hierarchy exists to bridge this gap. It places small, extremely fast memory (registers, cache) close to the CPU and progressively larger, slower, cheaper memory (RAM, storage devices) further away. When your code runs slowly, it is often not because of bad algorithms but because of poor memory access patterns that defeat the cache. Understanding this hierarchy is essential for writing performant software at every level, from loop optimization to database tuning to system architecture.

Define Terms

Visual Model

Registers~0.3 ns | ~1 KB

L1 Cache~1 ns | 64-128 KB

L2 Cache~4 ns | 256 KB-1 MB

L3 Cache~10 ns | 8-64 MB

RAM (DRAM)~100 ns | 8-128 GB

SSD~25,000 ns | 256 GB-4 TB

HDD~5,000,000 ns | 1-20 TB

The full process at a glance. Click Start tour to walk through each step.

The memory hierarchy pyramid: each level is faster, smaller, and more expensive than the one below it.

Code Example

Code

// Demonstrating cache-friendly vs cache-unfriendly access
// Sequential access (cache-friendly)
function sumRowMajor(matrix, rows, cols) {
  let sum = 0;
  for (let r = 0; r < rows; r++) {
    for (let c = 0; c < cols; c++) {
      sum += matrix[r * cols + c]; // Sequential in memory
    }
  }
  return sum;
}

// Strided access (cache-unfriendly)
function sumColMajor(matrix, rows, cols) {
  let sum = 0;
  for (let c = 0; c < cols; c++) {
    for (let r = 0; r < rows; r++) {
      sum += matrix[r * cols + c]; // Jumps by cols each step
    }
  }
  return sum;
}

// Create a large flat array simulating a 2D matrix
const rows = 1000;
const cols = 1000;
const matrix = new Array(rows * cols);
for (let i = 0; i < matrix.length; i++) matrix[i] = i;

console.time("Row-major (cache-friendly)");
sumRowMajor(matrix, rows, cols);
console.timeEnd("Row-major (cache-friendly)");

console.time("Column-major (cache-unfriendly)");
sumColMajor(matrix, rows, cols);
console.timeEnd("Column-major (cache-unfriendly)");
// Row-major is often 2-5x faster due to cache locality

Interactive Experiment

Try these exercises:

Run the row-major vs column-major benchmark above. How much faster is the cache-friendly version on your machine?
Increase the matrix size from 1000x1000 to 5000x5000. Does the speed difference grow or shrink? Why?
Try summing every other element (i += 2 stride) vs every element. How does stride length affect performance?
In the row-major version, what happens if you process the array backwards? Is it still cache-friendly?

Quick Quiz

Coding Challenge

Cache-Friendly Matrix Transpose

Write a function called `transpose` that takes a 2D array (array of arrays) and returns its transpose (rows become columns, columns become rows). The catch: use a cache-friendly access pattern. Iterate so that writes are sequential in memory. For a matrix [[1,2,3],[4,5,6]], the transpose is [[1,4],[2,5],[3,6]].

Loading editor...

Real-World Usage

The memory hierarchy shapes the design of every high-performance system:

DDR5 RAM doubles the bandwidth of DDR4, reaching 6400 MT/s and beyond. Modern CPUs pair multi-channel DDR5 with ever-larger L3 caches (Apple M-series chips have up to 192 MB of system-level cache).
NVMe SSDs connect directly to the CPU via PCIe, bypassing the old SATA bottleneck. They deliver sequential reads exceeding 7 GB/s, closing the gap between storage and RAM.
CPU cache wars: Intel and AMD compete on L3 cache size. AMD 3D V-Cache technology stacks extra cache vertically on the chip, tripling L3 to 96 MB+ and delivering 15-25% gaming performance gains purely from better cache hit rates.
Database design is fundamentally shaped by the hierarchy. B-trees are designed so each node fits in a cache line. Column-oriented databases (ClickHouse, DuckDB) achieve high throughput because sequential scans exploit spatial locality.

Memory Hierarchy