mathematics30 min

Gradient Descent

The algorithm that trains neural networks — stepping downhill to minimize loss

0/9Not Started

Why This Matters

You know that the derivative tells you which direction is uphill. Now flip it: go the opposite direction to walk downhill. That is gradient descent -- the algorithm that trains virtually every neural network and ML model in production today.

The gradient is the multi-dimensional version of the derivative. It is a vector of partial derivatives that points in the direction of steepest ascent. Gradient descent takes a step in the opposite direction, scaled by a learning rate that controls step size. Too big a step and you overshoot. Too small and training takes forever. Getting this right is the art of ML engineering, and it starts with understanding the math.

Define Terms

Visual Model

Loss SurfaceL(w)
Current Position
Compute Gradient
Take Stepw = w - lr * grad
New Position
Converged?
Minimum
no
yes

The full process at a glance. Click Start tour to walk through each step.

Gradient descent iteratively steps downhill until the loss function reaches a minimum.

Code Example

Code
// Gradient descent for f(x) = x^2
// Minimum is at x = 0
function gradientDescent1D(df, x0, lr, steps) {
  let x = x0;
  const history = [x];
  for (let i = 0; i < steps; i++) {
    const grad = df(x);     // derivative at current x
    x = x - lr * grad;      // step opposite to gradient
    history.push(+x.toFixed(6));
  }
  return { result: +x.toFixed(6), history };
}

// f(x) = x^2, derivative = 2x
const result1 = gradientDescent1D(x => 2 * x, 10, 0.1, 20);
console.log("Minimum at x =", result1.result);    // ~0
console.log("Path:", result1.history.slice(0, 6)); // [10, 8, 6.4, ...]

// 2D gradient descent: f(w1, w2) = w1^2 + w2^2
function gradientDescent2D(gradFn, w0, lr, steps) {
  let w = [...w0];
  for (let i = 0; i < steps; i++) {
    const grad = gradFn(w);
    w = w.map((wi, j) => wi - lr * grad[j]);
  }
  return w.map(v => +v.toFixed(6));
}

// Gradient of f(w1,w2) = w1^2 + w2^2 is [2*w1, 2*w2]
const result2 = gradientDescent2D(
  w => [2 * w[0], 2 * w[1]],
  [5, 3],   // start
  0.1,      // learning rate
  50        // steps
);
console.log("2D min at:", result2); // ~[0, 0]

// Effect of learning rate
const slow = gradientDescent1D(x => 2 * x, 10, 0.01, 20);
const fast = gradientDescent1D(x => 2 * x, 10, 0.1, 20);
const tooBig = gradientDescent1D(x => 2 * x, 10, 1.1, 20);
console.log("slow (lr=0.01):", slow.result);    // still far from 0
console.log("good (lr=0.1):", fast.result);     // close to 0
console.log("too big (lr=1.1):", tooBig.result); // diverges!

Interactive Experiment

Try these exercises:

  • Run gradient descent on f(x) = x^2 starting from x = 10 with learning rate 0.1. How many steps does it take to get within 0.01 of the minimum?
  • Try learning rates of 0.01, 0.1, 0.5, and 1.0. How does each affect the speed of convergence?
  • What happens with a learning rate greater than 1 for f(x) = x^2? Why does it diverge?
  • Run 2D gradient descent on f(x, y) = (x - 3)^2 + (y + 2)^2. Does it find (3, -2)?
  • Try starting from 100 different random points. Does gradient descent always find the same minimum for a convex function?

Quick Quiz

Coding Challenge

Linear Regression via Gradient Descent

Write a function called `linearRegression` that uses gradient descent to find the best slope (m) and intercept (b) for a set of data points. The loss function is mean squared error: L = (1/n) * sum((y_i - (m * x_i + b))^2). Start with m = 0, b = 0. Use learning rate 0.01 and 1000 iterations. Return an object {m, b} with values rounded to 2 decimal places.

Loading editor...

Real-World Usage

Gradient descent is the engine behind modern AI:

  • Neural network training: Every deep learning model -- GPT, DALL-E, AlphaFold -- is trained using variants of gradient descent (Adam, SGD with momentum, AdaGrad).
  • Recommendation systems: Collaborative filtering models optimize user-item interaction predictions with gradient descent on embedding vectors.
  • Natural language processing: Word2Vec, BERT, and transformer models learn word representations by minimizing a loss function via gradient descent.
  • Autonomous vehicles: Perception models that detect lanes, signs, and obstacles are all trained with gradient-based optimization.
  • Scientific computing: Protein folding, drug discovery, and climate modeling use gradient-based optimization to fit parameters to observed data.

Connections