Why This Matters
Before attention, neural networks processed sequences by reading them one step at a time, left to right. This meant that by the time the model reached the end of a long sentence, it had already forgotten the beginning. The attention mechanism solved this by letting every token look at every other token simultaneously.
The 2017 paper "Attention Is All You Need" replaced recurrence entirely with attention, creating the transformer architecture. This single idea is the foundation of GPT, Claude, BERT, and every modern language model. Understanding attention is understanding how LLMs think.
Define Terms
Visual Model
The full process at a glance. Click Start tour to walk through each step.
Attention computes which tokens are relevant to each other, then creates context-aware representations through weighted sums.
Code Example
// Scaled dot-product attention from scratch
function dotProduct(a, b) {
return a.reduce((sum, val, i) => sum + val * b[i], 0);
}
function softmax(arr) {
const max = Math.max(...arr);
const exps = arr.map(x => Math.exp(x - max));
const sum = exps.reduce((a, b) => a + b, 0);
return exps.map(x => x / sum);
}
function attention(queries, keys, values) {
const dk = keys[0].length;
const scale = Math.sqrt(dk);
const outputs = [];
for (const q of queries) {
// Score each key against this query
const scores = keys.map(k => dotProduct(q, k) / scale);
// Normalize with softmax
const weights = softmax(scores);
console.log("Attention weights:", weights.map(w => w.toFixed(3)));
// Weighted sum of values
const output = values[0].map((_, dim) =>
values.reduce((sum, v, i) => sum + weights[i] * v[dim], 0)
);
outputs.push(output);
}
return outputs;
}
// 3 tokens, each with 4-dimensional embeddings
const Q = [[1, 0, 1, 0], [0, 1, 0, 1], [1, 1, 0, 0]];
const K = [[1, 0, 1, 0], [0, 1, 0, 1], [1, 1, 0, 0]];
const V = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]];
const result = attention(Q, K, V);
console.log("Output:", result);Interactive Experiment
Try these exercises:
- Make two tokens have identical Q and K vectors. What happens to the attention weights between them?
- Set one token's Key to all zeros. Does any Query attend to it? Why or why not?
- Remove the
/ scaledivision. Rerun with the same inputs. How do the attention weights change? (They become more extreme.) - In a sentence like "The bank by the river", which word should "bank" attend to most? Think about how Q/K similarity encodes this.
Quick Quiz
Coding Challenge
Write a function called `attentionWeights` that takes a single query vector and an array of key vectors, and returns the attention weights (after scaling by sqrt(d_k) and applying softmax). The weights should sum to 1.
Real-World Usage
Attention is the core mechanism behind the most powerful AI systems:
- Large language models: GPT and Claude use self-attention in every layer to understand context across thousands of tokens.
- Machine translation: Attention lets the model align source and target words (e.g., knowing "chat" in French maps to "cat" in English).
- Image recognition: Vision Transformers (ViT) apply attention to image patches, letting the model focus on relevant regions.
- Code completion: Copilot attends to your existing code, comments, and function signatures to predict what comes next.
- Protein folding: AlphaFold uses attention to determine which amino acids interact across a protein sequence.