Why This Matters
A single attention layer is powerful, but not enough to build an LLM. The transformer combines attention with feed-forward networks, residual connections, and layer normalization into a single block that can be stacked dozens or hundreds of times.
GPT-4 is rumored to have over 100 transformer blocks. Claude, BERT, and Llama are all built by stacking these same blocks. Understanding the full block — not just attention — is what separates someone who has heard of transformers from someone who truly understands how language models work.
Define Terms
Visual Model
The full process at a glance. Click Start tour to walk through each step.
A transformer block: Input, Multi-Head Attention, Add and Norm, FFN, Add and Norm, Output. Stack N times.
Code Example
// Simplified transformer block components
// Layer normalization
function layerNorm(vector) {
const mean = vector.reduce((a, b) => a + b, 0) / vector.length;
const variance = vector.reduce((a, b) => a + (b - mean) ** 2, 0) / vector.length;
const std = Math.sqrt(variance + 1e-6);
return vector.map(x => (x - mean) / std);
}
// Feed-forward network (expand then compress)
function feedForward(x, w1, b1, w2, b2) {
// First layer: expand (e.g., 4x) with ReLU
const hidden = w1.map((row, i) =>
Math.max(0, row.reduce((s, w, j) => s + w * x[j], 0) + b1[i])
);
// Second layer: compress back
return w2.map((row, i) =>
row.reduce((s, w, j) => s + w * hidden[j], 0) + b2[i]
);
}
// Residual connection: add input to layer output
function residualAdd(input, layerOutput) {
return input.map((val, i) => val + layerOutput[i]);
}
// Demo
const input = [1.0, -0.5, 0.3, 0.8];
console.log("Input:", input);
console.log("After LayerNorm:", layerNorm(input));
const afterAttention = [0.2, 0.1, -0.1, 0.3]; // simulated
const afterResidual = residualAdd(input, afterAttention);
console.log("After residual add:", afterResidual);
console.log("After norm again:", layerNorm(afterResidual));Interactive Experiment
Try these exercises:
- Apply
layerNormto[100, 200, 300, 400]and then to[1, 2, 3, 4]. Are the outputs the same? Why? - Remove the residual connection (don't add input back). Pass a vector through 10 rounds of normalization + simulated attention. What happens to the values?
- Increase the FFN hidden dimension from 4x to 8x. How does this change the number of parameters?
- What would happen if you used 1 attention head instead of 12? What information might be lost?
Quick Quiz
Coding Challenge
Write a function called `layerNorm` that takes a vector of numbers and returns the layer-normalized version: subtract the mean, divide by the standard deviation (add epsilon=1e-6 for numerical stability). The output should have approximately zero mean and unit variance.
Real-World Usage
The transformer block is the repeating unit behind the most powerful AI systems:
- GPT-4 / Claude: Stack 80+ transformer blocks to process and generate language with remarkable fluency.
- BERT: Uses encoder transformer blocks for bidirectional understanding, powering search and classification.
- Stable Diffusion: Uses transformer blocks in its U-Net architecture to generate images from text descriptions.
- Whisper: Applies transformer encoder-decoder blocks to convert speech audio into text.
- AlphaFold 2: Uses a variant called Evoformer — specialized transformer blocks that predict 3D protein structures.