machine learning30 min

Overfitting & Regularization

Understanding why models memorize training data and how to prevent it

0/9Not Started

Why This Matters

You built a model that achieves 99% accuracy on your training data. You deploy it. It fails miserably on real-world data. This is overfitting — the most common and most costly mistake in machine learning.

Overfitting means the model has memorized the training data instead of learning generalizable patterns. Regularization techniques are the antidote: they constrain the model so it focuses on real patterns rather than noise. Understanding this tradeoff is what separates a working ML system from an impressive-looking demo that collapses in production.

Define Terms

Visual Model

Training Data
No Regularization
With Regularization
OverfitsWiggly / memorizes
GeneralizesSmooth / robust
High train accLow test acc
Balanced accTrain ~ Test

The full process at a glance. Click Start tour to walk through each step.

Two paths: without regularization the model overfits; with regularization it generalizes.

Code Example

Code
// Demonstrating overfitting vs regularization
// Polynomial regression with and without L2 regularization

// Generate noisy data from y = 2x + 1
function generateData(n) {
  const data = [];
  for (let i = 0; i < n; i++) {
    const x = i / n * 10;
    const y = 2 * x + 1 + (Math.random() - 0.5) * 4; // noise
    data.push({ x, y });
  }
  return data;
}

// Simple linear fit (appropriate complexity)
function linearFit(data) {
  const n = data.length;
  const sumX = data.reduce((s, d) => s + d.x, 0);
  const sumY = data.reduce((s, d) => s + d.y, 0);
  const sumXY = data.reduce((s, d) => s + d.x * d.y, 0);
  const sumX2 = data.reduce((s, d) => s + d.x * d.x, 0);
  const slope = (n * sumXY - sumX * sumY) / (n * sumX2 - sumX * sumX);
  const intercept = (sumY - slope * sumX) / n;
  return { slope: +slope.toFixed(2), intercept: +intercept.toFixed(2) };
}

// Evaluate: MSE on data
function mse(data, slope, intercept) {
  return data.reduce((s, d) =>
    s + (slope * d.x + intercept - d.y) ** 2, 0) / data.length;
}

const train = generateData(20);
const test = generateData(10);
const { slope, intercept } = linearFit(train);
console.log(`Model: y = ${slope}x + ${intercept}`);
console.log(`Train MSE: ${mse(train, slope, intercept).toFixed(2)}`);
console.log(`Test MSE:  ${mse(test, slope, intercept).toFixed(2)}`);
// A good fit has similar train and test MSE

Interactive Experiment

Try these exercises to see overfitting in action:

  • Generate only 5 training points and fit the model. How does test MSE compare to train MSE?
  • Increase training data to 1000 points. Does the gap between train and test MSE shrink?
  • Increase the noise multiplier from 4 to 20. How does more noise affect the model?
  • Add an L2 regularization term: penalize large weights by adding lambda * (slope^2) to the loss. Does it improve test performance?

Quick Quiz

Coding Challenge

L2 Regularized Linear Regression

Write a function called `ridgeRegression` that performs linear regression with L2 regularization. Given training data and a regularization strength `lambda`, it should learn weight and bias by running gradient descent where the weight gradient includes a penalty term: gradient = normal_gradient + 2 * lambda * weight. Return the trained weight and bias as an object.

Loading editor...

Real-World Usage

Overfitting and regularization are central concerns in every production ML system:

  • Deep learning: Dropout is used in nearly every neural network to prevent co-adaptation of neurons during training.
  • Natural language processing: Weight decay (L2 regularization) is standard when fine-tuning large language models on small datasets.
  • Computer vision: Data augmentation (flipping, rotating, cropping images) acts as implicit regularization by expanding the effective training set.
  • Medical AI: With limited patient data, regularization is critical to prevent models from memorizing individual patients rather than learning disease patterns.
  • Ensemble methods: Random forests and gradient boosting use tree depth limits and minimum sample sizes as regularization to prevent overfitting.

Connections