Why This Matters
Knowing that a loss function measures error is not enough — you need a strategy to actually reduce that error. Stochastic Gradient Descent (SGD) and its variants are the engines that power virtually all modern machine learning. From training a simple linear regression to fine-tuning a billion-parameter language model, gradient descent is doing the heavy lifting.
Understanding how optimizers work gives you the ability to debug training problems, tune hyperparameters, and reason about why a model is learning slowly, oscillating, or getting stuck.
Define Terms
Visual Model
The full process at a glance. Click Start tour to walk through each step.
The training loop: forward pass, compute loss, backward pass, update parameters, repeat.
Code Example
// Mini-batch gradient descent for linear regression
// y = weight * x + bias
function trainLinearModel(data, lr = 0.01, epochs = 100, batchSize = 4) {
let weight = 0;
let bias = 0;
for (let epoch = 0; epoch < epochs; epoch++) {
// Shuffle data each epoch
const shuffled = [...data].sort(() => Math.random() - 0.5);
// Process mini-batches
for (let i = 0; i < shuffled.length; i += batchSize) {
const batch = shuffled.slice(i, i + batchSize);
let dWeight = 0;
let dBias = 0;
for (const { x, y } of batch) {
const pred = weight * x + bias;
const error = pred - y;
dWeight += (2 * error * x) / batch.length;
dBias += (2 * error) / batch.length;
}
// Update parameters
weight -= lr * dWeight;
bias -= lr * dBias;
}
if (epoch % 25 === 0) {
const totalLoss = data.reduce((s, { x, y }) =>
s + (weight * x + bias - y) ** 2, 0) / data.length;
console.log(`Epoch ${epoch}: loss=${totalLoss.toFixed(4)}`);
}
}
return { weight: +weight.toFixed(3), bias: +bias.toFixed(3) };
}
const data = [{x:1,y:3},{x:2,y:5},{x:3,y:7},{x:4,y:9},{x:5,y:11}];
console.log(trainLinearModel(data));
// Converges toward weight=2, bias=1 (y = 2x + 1)Interactive Experiment
Try modifying the code above:
- Change the learning rate to 0.001 (slower) and 0.5 (faster). What happens to convergence?
- Set batch size to 1 (pure SGD) vs. the full dataset size (batch gradient descent). Compare the loss curve smoothness.
- Try training on non-linear data like
y = x^2. Can linear regression learn it? What happens to the loss? - Add more epochs. How many does it take to get loss very close to zero?
Quick Quiz
Coding Challenge
Implement a function called `sgdWithMomentum` that trains a single-variable linear model (y = w*x) using SGD with momentum. The function takes training data, a learning rate, a momentum factor (beta), and the number of epochs. Momentum maintains a velocity that accumulates past gradients: velocity = beta * velocity + gradient, then w = w - lr * velocity.
Real-World Usage
Gradient descent optimizers are the backbone of all deep learning training:
- Large language models: Models like GPT are trained using AdamW (Adam with weight decay) on billions of tokens of text.
- Image classifiers: CNNs use SGD with momentum to learn visual features from millions of labeled images.
- Reinforcement learning: Policy gradient methods use variants of gradient descent to optimize agent behavior.
- Learning rate schedulers: Production training runs often warm up the learning rate, then decay it, following a schedule tuned for the specific task.
- Distributed training: Gradient descent is parallelized across hundreds of GPUs, each processing different mini-batches and synchronizing gradients.