Engineering Fluency OS

Why This Matters

Knowing that a loss function measures error is not enough — you need a strategy to actually reduce that error. Stochastic Gradient Descent (SGD) and its variants are the engines that power virtually all modern machine learning. From training a simple linear regression to fine-tuning a billion-parameter language model, gradient descent is doing the heavy lifting.

Understanding how optimizers work gives you the ability to debug training problems, tune hyperparameters, and reason about why a model is learning slowly, oscillating, or getting stuck.

Define Terms

Visual Model

ParametersWeights + biases

Forward PassPredict

Loss

Backward PassGradients

Updatew = w - lr * grad

RepeatNext mini-batch

adjust

The full process at a glance. Click Start tour to walk through each step.

The training loop: forward pass, compute loss, backward pass, update parameters, repeat.

Code Example

Code

// Mini-batch gradient descent for linear regression
// y = weight * x + bias

function trainLinearModel(data, lr = 0.01, epochs = 100, batchSize = 4) {
  let weight = 0;
  let bias = 0;

  for (let epoch = 0; epoch < epochs; epoch++) {
    // Shuffle data each epoch
    const shuffled = [...data].sort(() => Math.random() - 0.5);

    // Process mini-batches
    for (let i = 0; i < shuffled.length; i += batchSize) {
      const batch = shuffled.slice(i, i + batchSize);
      let dWeight = 0;
      let dBias = 0;

      for (const { x, y } of batch) {
        const pred = weight * x + bias;
        const error = pred - y;
        dWeight += (2 * error * x) / batch.length;
        dBias += (2 * error) / batch.length;
      }

      // Update parameters
      weight -= lr * dWeight;
      bias -= lr * dBias;
    }

    if (epoch % 25 === 0) {
      const totalLoss = data.reduce((s, { x, y }) =>
        s + (weight * x + bias - y) ** 2, 0) / data.length;
      console.log(`Epoch ${epoch}: loss=${totalLoss.toFixed(4)}`);
    }
  }
  return { weight: +weight.toFixed(3), bias: +bias.toFixed(3) };
}

const data = [{x:1,y:3},{x:2,y:5},{x:3,y:7},{x:4,y:9},{x:5,y:11}];
console.log(trainLinearModel(data));
// Converges toward weight=2, bias=1 (y = 2x + 1)

Interactive Experiment

Try modifying the code above:

Change the learning rate to 0.001 (slower) and 0.5 (faster). What happens to convergence?
Set batch size to 1 (pure SGD) vs. the full dataset size (batch gradient descent). Compare the loss curve smoothness.
Try training on non-linear data like y = x^2. Can linear regression learn it? What happens to the loss?
Add more epochs. How many does it take to get loss very close to zero?

Quick Quiz

Coding Challenge

SGD with Momentum

Implement a function called `sgdWithMomentum` that trains a single-variable linear model (y = w*x) using SGD with momentum. The function takes training data, a learning rate, a momentum factor (beta), and the number of epochs. Momentum maintains a velocity that accumulates past gradients: velocity = beta * velocity + gradient, then w = w - lr * velocity.

Loading editor...

Real-World Usage

Gradient descent optimizers are the backbone of all deep learning training:

Large language models: Models like GPT are trained using AdamW (Adam with weight decay) on billions of tokens of text.
Image classifiers: CNNs use SGD with momentum to learn visual features from millions of labeled images.
Reinforcement learning: Policy gradient methods use variants of gradient descent to optimize agent behavior.
Learning rate schedulers: Production training runs often warm up the learning rate, then decay it, following a schedule tuned for the specific task.
Distributed training: Gradient descent is parallelized across hundreds of GPUs, each processing different mini-batches and synchronizing gradients.

Gradient Descent in Practice