machine learning30 min

Gradient Descent in Practice

How optimization algorithms train machine learning models by iteratively reducing loss

0/9Not Started

Why This Matters

Knowing that a loss function measures error is not enough — you need a strategy to actually reduce that error. Stochastic Gradient Descent (SGD) and its variants are the engines that power virtually all modern machine learning. From training a simple linear regression to fine-tuning a billion-parameter language model, gradient descent is doing the heavy lifting.

Understanding how optimizers work gives you the ability to debug training problems, tune hyperparameters, and reason about why a model is learning slowly, oscillating, or getting stuck.

Define Terms

Visual Model

ParametersWeights + biases
Forward PassPredict
Loss
Backward PassGradients
Updatew = w - lr * grad
RepeatNext mini-batch
adjust

The full process at a glance. Click Start tour to walk through each step.

The training loop: forward pass, compute loss, backward pass, update parameters, repeat.

Code Example

Code
// Mini-batch gradient descent for linear regression
// y = weight * x + bias

function trainLinearModel(data, lr = 0.01, epochs = 100, batchSize = 4) {
  let weight = 0;
  let bias = 0;

  for (let epoch = 0; epoch < epochs; epoch++) {
    // Shuffle data each epoch
    const shuffled = [...data].sort(() => Math.random() - 0.5);

    // Process mini-batches
    for (let i = 0; i < shuffled.length; i += batchSize) {
      const batch = shuffled.slice(i, i + batchSize);
      let dWeight = 0;
      let dBias = 0;

      for (const { x, y } of batch) {
        const pred = weight * x + bias;
        const error = pred - y;
        dWeight += (2 * error * x) / batch.length;
        dBias += (2 * error) / batch.length;
      }

      // Update parameters
      weight -= lr * dWeight;
      bias -= lr * dBias;
    }

    if (epoch % 25 === 0) {
      const totalLoss = data.reduce((s, { x, y }) =>
        s + (weight * x + bias - y) ** 2, 0) / data.length;
      console.log(`Epoch ${epoch}: loss=${totalLoss.toFixed(4)}`);
    }
  }
  return { weight: +weight.toFixed(3), bias: +bias.toFixed(3) };
}

const data = [{x:1,y:3},{x:2,y:5},{x:3,y:7},{x:4,y:9},{x:5,y:11}];
console.log(trainLinearModel(data));
// Converges toward weight=2, bias=1 (y = 2x + 1)

Interactive Experiment

Try modifying the code above:

  • Change the learning rate to 0.001 (slower) and 0.5 (faster). What happens to convergence?
  • Set batch size to 1 (pure SGD) vs. the full dataset size (batch gradient descent). Compare the loss curve smoothness.
  • Try training on non-linear data like y = x^2. Can linear regression learn it? What happens to the loss?
  • Add more epochs. How many does it take to get loss very close to zero?

Quick Quiz

Coding Challenge

SGD with Momentum

Implement a function called `sgdWithMomentum` that trains a single-variable linear model (y = w*x) using SGD with momentum. The function takes training data, a learning rate, a momentum factor (beta), and the number of epochs. Momentum maintains a velocity that accumulates past gradients: velocity = beta * velocity + gradient, then w = w - lr * velocity.

Loading editor...

Real-World Usage

Gradient descent optimizers are the backbone of all deep learning training:

  • Large language models: Models like GPT are trained using AdamW (Adam with weight decay) on billions of tokens of text.
  • Image classifiers: CNNs use SGD with momentum to learn visual features from millions of labeled images.
  • Reinforcement learning: Policy gradient methods use variants of gradient descent to optimize agent behavior.
  • Learning rate schedulers: Production training runs often warm up the learning rate, then decay it, following a schedule tuned for the specific task.
  • Distributed training: Gradient descent is parallelized across hundreds of GPUs, each processing different mini-batches and synchronizing gradients.

Connections