deep learning llms30 min

Transformer Blocks

The repeating building block that powers every modern language model

0/9Not Started

Why This Matters

A single attention layer is powerful, but not enough to build an LLM. The transformer combines attention with feed-forward networks, residual connections, and layer normalization into a single block that can be stacked dozens or hundreds of times.

GPT-4 is rumored to have over 100 transformer blocks. Claude, BERT, and Llama are all built by stacking these same blocks. Understanding the full block — not just attention — is what separates someone who has heard of transformers from someone who truly understands how language models work.

Define Terms

Visual Model

InputEmbeddings + Position
LayerNorm
Multi-Head Attention
Add & NormResidual
FFNExpand then compress
Add & NormResidual
OutputTo next block
skip
skip

The full process at a glance. Click Start tour to walk through each step.

A transformer block: Input, Multi-Head Attention, Add and Norm, FFN, Add and Norm, Output. Stack N times.

Code Example

Code
// Simplified transformer block components

// Layer normalization
function layerNorm(vector) {
  const mean = vector.reduce((a, b) => a + b, 0) / vector.length;
  const variance = vector.reduce((a, b) => a + (b - mean) ** 2, 0) / vector.length;
  const std = Math.sqrt(variance + 1e-6);
  return vector.map(x => (x - mean) / std);
}

// Feed-forward network (expand then compress)
function feedForward(x, w1, b1, w2, b2) {
  // First layer: expand (e.g., 4x) with ReLU
  const hidden = w1.map((row, i) =>
    Math.max(0, row.reduce((s, w, j) => s + w * x[j], 0) + b1[i])
  );
  // Second layer: compress back
  return w2.map((row, i) =>
    row.reduce((s, w, j) => s + w * hidden[j], 0) + b2[i]
  );
}

// Residual connection: add input to layer output
function residualAdd(input, layerOutput) {
  return input.map((val, i) => val + layerOutput[i]);
}

// Demo
const input = [1.0, -0.5, 0.3, 0.8];
console.log("Input:", input);
console.log("After LayerNorm:", layerNorm(input));

const afterAttention = [0.2, 0.1, -0.1, 0.3]; // simulated
const afterResidual = residualAdd(input, afterAttention);
console.log("After residual add:", afterResidual);
console.log("After norm again:", layerNorm(afterResidual));

Interactive Experiment

Try these exercises:

  • Apply layerNorm to [100, 200, 300, 400] and then to [1, 2, 3, 4]. Are the outputs the same? Why?
  • Remove the residual connection (don't add input back). Pass a vector through 10 rounds of normalization + simulated attention. What happens to the values?
  • Increase the FFN hidden dimension from 4x to 8x. How does this change the number of parameters?
  • What would happen if you used 1 attention head instead of 12? What information might be lost?

Quick Quiz

Coding Challenge

Implement Layer Normalization

Write a function called `layerNorm` that takes a vector of numbers and returns the layer-normalized version: subtract the mean, divide by the standard deviation (add epsilon=1e-6 for numerical stability). The output should have approximately zero mean and unit variance.

Loading editor...

Real-World Usage

The transformer block is the repeating unit behind the most powerful AI systems:

  • GPT-4 / Claude: Stack 80+ transformer blocks to process and generate language with remarkable fluency.
  • BERT: Uses encoder transformer blocks for bidirectional understanding, powering search and classification.
  • Stable Diffusion: Uses transformer blocks in its U-Net architecture to generate images from text descriptions.
  • Whisper: Applies transformer encoder-decoder blocks to convert speech audio into text.
  • AlphaFold 2: Uses a variant called Evoformer — specialized transformer blocks that predict 3D protein structures.

Connections