Engineering Fluency OS

Why This Matters

Before attention, neural networks processed sequences by reading them one step at a time, left to right. This meant that by the time the model reached the end of a long sentence, it had already forgotten the beginning. The attention mechanism solved this by letting every token look at every other token simultaneously.

The 2017 paper "Attention Is All You Need" replaced recurrence entirely with attention, creating the transformer architecture. This single idea is the foundation of GPT, Claude, BERT, and every modern language model. Understanding attention is understanding how LLMs think.

Define Terms

Visual Model

Query (Q)What am I looking for?

Key (K)What do I contain?

Value (V)What info do I carry?

Q * K^TAttention scores

/ sqrt(d_k)

SoftmaxProbabilities

Weighted Sumsoftmax * V

OutputContext-aware

The full process at a glance. Click Start tour to walk through each step.

Attention computes which tokens are relevant to each other, then creates context-aware representations through weighted sums.

Code Example

Code

// Scaled dot-product attention from scratch
function dotProduct(a, b) {
  return a.reduce((sum, val, i) => sum + val * b[i], 0);
}

function softmax(arr) {
  const max = Math.max(...arr);
  const exps = arr.map(x => Math.exp(x - max));
  const sum = exps.reduce((a, b) => a + b, 0);
  return exps.map(x => x / sum);
}

function attention(queries, keys, values) {
  const dk = keys[0].length;
  const scale = Math.sqrt(dk);
  const outputs = [];

  for (const q of queries) {
    // Score each key against this query
    const scores = keys.map(k => dotProduct(q, k) / scale);
    // Normalize with softmax
    const weights = softmax(scores);
    console.log("Attention weights:", weights.map(w => w.toFixed(3)));
    // Weighted sum of values
    const output = values[0].map((_, dim) =>
      values.reduce((sum, v, i) => sum + weights[i] * v[dim], 0)
    );
    outputs.push(output);
  }
  return outputs;
}

// 3 tokens, each with 4-dimensional embeddings
const Q = [[1, 0, 1, 0], [0, 1, 0, 1], [1, 1, 0, 0]];
const K = [[1, 0, 1, 0], [0, 1, 0, 1], [1, 1, 0, 0]];
const V = [[1, 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]];

const result = attention(Q, K, V);
console.log("Output:", result);

Interactive Experiment

Try these exercises:

Make two tokens have identical Q and K vectors. What happens to the attention weights between them?
Set one token's Key to all zeros. Does any Query attend to it? Why or why not?
Remove the / scale division. Rerun with the same inputs. How do the attention weights change? (They become more extreme.)
In a sentence like "The bank by the river", which word should "bank" attend to most? Think about how Q/K similarity encodes this.

Quick Quiz

Coding Challenge

Compute Attention Weights

Write a function called `attentionWeights` that takes a single query vector and an array of key vectors, and returns the attention weights (after scaling by sqrt(d_k) and applying softmax). The weights should sum to 1.

Loading editor...

Real-World Usage

Attention is the core mechanism behind the most powerful AI systems:

Large language models: GPT and Claude use self-attention in every layer to understand context across thousands of tokens.
Machine translation: Attention lets the model align source and target words (e.g., knowing "chat" in French maps to "cat" in English).
Image recognition: Vision Transformers (ViT) apply attention to image patches, letting the model focus on relevant regions.
Code completion: Copilot attends to your existing code, comments, and function signatures to predict what comes next.
Protein folding: AlphaFold uses attention to determine which amino acids interact across a protein sequence.

Attention