production engineering25 min

Production Observability

Understanding what your application is doing in production through logs, metrics, and traces

0/9Not Started

Why This Matters

Your application is deployed. Users are hitting it. Then something goes wrong: response times spike, errors increase, a feature stops working. How do you figure out what happened? You cannot attach a debugger to production. You cannot add console.log and redeploy mid-incident. You need observability -- the ability to understand what your system is doing by examining its outputs.

Observability rests on three pillars: logs (what happened), metrics (how much), and traces (the path a request took). Together, they let you answer the question every on-call engineer dreads at 2 AM: "What is broken and why?" Good structured logging and alerting mean you find out about problems before your users do.

Define Terms

Visual Model

Application
Logswhat happened
Metricshow much
Tracesrequest path
Dashboard & Alerts
Engineer Acts

The full process at a glance. Click Start tour to walk through each step.

The three pillars of observability: logs, metrics, and traces flow into dashboards and alerts.

Code Example

Code
// Structured logging (JSON format)
const log = (level, message, data = {}) => {
  const entry = {
    timestamp: new Date().toISOString(),
    level,
    message,
    service: "payment-api",
    ...data
  };
  console.log(JSON.stringify(entry));
};

// Log levels indicate severity
log("info", "Server started", { port: 3000 });
log("info", "Payment processed", { userId: "u123", amount: 49.99 });
log("warn", "Slow database query", { queryMs: 2500, table: "orders" });
log("error", "Payment failed", { userId: "u456", error: "Card declined" });

// Metrics: track request duration
const startTime = Date.now();
// ... handle request ...
const durationMs = Date.now() - startTime;
log("info", "Request completed", {
  method: "POST",
  path: "/api/pay",
  statusCode: 200,
  durationMs
});

// Simple metrics counter
const metrics = { requests: 0, errors: 0 };
function trackRequest(success) {
  metrics.requests++;
  if (!success) metrics.errors++;
  const errorRate = (metrics.errors / metrics.requests * 100).toFixed(2);
  console.log(`Error rate: ${errorRate}%`);
}

Interactive Experiment

Try these exercises:

  • Add structured JSON logging to an existing project. Include timestamp, level, message, and relevant data fields. Filter the output with jq.
  • Build a simple request counter that tracks total requests, errors, and average response time. Print a summary every 10 requests.
  • Create a log function that only outputs messages at or above a configured level (e.g., setting level to "warn" suppresses "info" and "debug").
  • Time a database query or API call. Log the duration and flag anything over 1 second as a warning.

Quick Quiz

Coding Challenge

Log Analyzer

Write a function called `analyzeLogs` that takes an array of log entry objects, each with `level` ('info', 'warn', 'error') and `durationMs` (number). Return an object with: `total` (total log count), `errors` (count of error-level logs), `errorRate` (percentage of errors, rounded to 1 decimal), `avgDuration` (average durationMs, rounded to nearest integer), and `slowRequests` (count of entries with durationMs > 1000).

Loading editor...

Real-World Usage

Observability is non-negotiable for production systems:

  • Datadog, New Relic, and Grafana are observability platforms that ingest logs, metrics, and traces from thousands of services, providing dashboards and alerting.
  • Prometheus scrapes metrics from application endpoints and stores time-series data. Grafana visualizes Prometheus data in real-time dashboards.
  • PagerDuty and OpsGenie receive alerts and route them to the right on-call engineer via phone call, SMS, or Slack.
  • Distributed tracing tools like Jaeger and Honeycomb follow requests across microservices, revealing that a slow checkout was caused by a slow inventory service call.
  • Log levels in production are typically set to "info" or "warn". Debug logging is enabled temporarily when investigating specific issues.

Connections