Why This Matters
You know that the derivative tells you which direction is uphill. Now flip it: go the opposite direction to walk downhill. That is gradient descent -- the algorithm that trains virtually every neural network and ML model in production today.
The gradient is the multi-dimensional version of the derivative. It is a vector of partial derivatives that points in the direction of steepest ascent. Gradient descent takes a step in the opposite direction, scaled by a learning rate that controls step size. Too big a step and you overshoot. Too small and training takes forever. Getting this right is the art of ML engineering, and it starts with understanding the math.
Define Terms
Visual Model
The full process at a glance. Click Start tour to walk through each step.
Gradient descent iteratively steps downhill until the loss function reaches a minimum.
Code Example
// Gradient descent for f(x) = x^2
// Minimum is at x = 0
function gradientDescent1D(df, x0, lr, steps) {
let x = x0;
const history = [x];
for (let i = 0; i < steps; i++) {
const grad = df(x); // derivative at current x
x = x - lr * grad; // step opposite to gradient
history.push(+x.toFixed(6));
}
return { result: +x.toFixed(6), history };
}
// f(x) = x^2, derivative = 2x
const result1 = gradientDescent1D(x => 2 * x, 10, 0.1, 20);
console.log("Minimum at x =", result1.result); // ~0
console.log("Path:", result1.history.slice(0, 6)); // [10, 8, 6.4, ...]
// 2D gradient descent: f(w1, w2) = w1^2 + w2^2
function gradientDescent2D(gradFn, w0, lr, steps) {
let w = [...w0];
for (let i = 0; i < steps; i++) {
const grad = gradFn(w);
w = w.map((wi, j) => wi - lr * grad[j]);
}
return w.map(v => +v.toFixed(6));
}
// Gradient of f(w1,w2) = w1^2 + w2^2 is [2*w1, 2*w2]
const result2 = gradientDescent2D(
w => [2 * w[0], 2 * w[1]],
[5, 3], // start
0.1, // learning rate
50 // steps
);
console.log("2D min at:", result2); // ~[0, 0]
// Effect of learning rate
const slow = gradientDescent1D(x => 2 * x, 10, 0.01, 20);
const fast = gradientDescent1D(x => 2 * x, 10, 0.1, 20);
const tooBig = gradientDescent1D(x => 2 * x, 10, 1.1, 20);
console.log("slow (lr=0.01):", slow.result); // still far from 0
console.log("good (lr=0.1):", fast.result); // close to 0
console.log("too big (lr=1.1):", tooBig.result); // diverges!Interactive Experiment
Try these exercises:
- Run gradient descent on
f(x) = x^2starting from x = 10 with learning rate 0.1. How many steps does it take to get within 0.01 of the minimum? - Try learning rates of 0.01, 0.1, 0.5, and 1.0. How does each affect the speed of convergence?
- What happens with a learning rate greater than 1 for f(x) = x^2? Why does it diverge?
- Run 2D gradient descent on
f(x, y) = (x - 3)^2 + (y + 2)^2. Does it find (3, -2)? - Try starting from 100 different random points. Does gradient descent always find the same minimum for a convex function?
Quick Quiz
Coding Challenge
Write a function called `linearRegression` that uses gradient descent to find the best slope (m) and intercept (b) for a set of data points. The loss function is mean squared error: L = (1/n) * sum((y_i - (m * x_i + b))^2). Start with m = 0, b = 0. Use learning rate 0.01 and 1000 iterations. Return an object {m, b} with values rounded to 2 decimal places.
Real-World Usage
Gradient descent is the engine behind modern AI:
- Neural network training: Every deep learning model -- GPT, DALL-E, AlphaFold -- is trained using variants of gradient descent (Adam, SGD with momentum, AdaGrad).
- Recommendation systems: Collaborative filtering models optimize user-item interaction predictions with gradient descent on embedding vectors.
- Natural language processing: Word2Vec, BERT, and transformer models learn word representations by minimizing a loss function via gradient descent.
- Autonomous vehicles: Perception models that detect lanes, signs, and obstacles are all trained with gradient-based optimization.
- Scientific computing: Protein folding, drug discovery, and climate modeling use gradient-based optimization to fit parameters to observed data.