Observability Overview

Metrics, logs, traces — what each is for, where they overlap, the cardinality rules that keep your storage bill sane, and the minimum signal set every service should emit.

If you only remember six things
  • Metrics answer "is something wrong and how bad?" — cheap, aggregatable, low-cardinality.
  • Logs answer "what exactly happened?" — high-cardinality context, expensive per byte.
  • Traces answer "where in the call graph is the time going?" — required the moment you have more than two services.
  • Every metric label is a cross-product. customer_id as a label will bankrupt you. status_code won't.
  • Instrument at the edge of your service (inbound requests, outbound dependencies). Not every function.
  • An SLO you can't compute from your existing telemetry is aspirational, not an SLO.

The three pillars

You will hear "metrics, logs, and traces" in every observability talk. It's not wrong, but the useful framing is about the question each answers.

PillarThe question it answersCardinality toleranceCost shape
MetricsIs it up? Is it slow? How much of it?Low — every label combination is a time seriesCheap per series, explodes with labels
LogsWhat exactly did this request do, in order?High — every line is free-formPay per byte ingested + retention
TracesWhere in a distributed call did the time go?Medium — per-span attributesPay per span; sample aggressively

When each is the right tool

Where they overlap

Modern stacks blur the lines on purpose:

Cardinality — the one rule you must not break

A metric's cost isn't driven by how often you emit it. It's driven by how many distinct label combinations you emit. Each unique combination is a new time series your TSDB has to index, churn on compaction, and keep in memory for the active window.

The rule: metric labels must be bounded. If a label can take on more than a few hundred distinct values over the life of the series, it doesn't belong on a metric.

Good labels

Bad labels

Logs are the opposite: they tolerate unbounded context. Put the user_id on the log line. Put the trace ID on the log line. Leave the metric counter with just {route, method, status}.

A quick check on Prometheus: prometheus_tsdb_head_series is the live series count. If it climbs without the fleet growing, something is emitting a high-cardinality label. topk(10, count by (__name__)({__name__=~".+"})) shows the worst offenders.

SLI and SLO framing

The compressed version of the Google SRE book's Chapter 4:

A useful SLI has two properties:

  1. Observable from outside the service. If the only way to tell the SLI is failing is to check the same service that's failing, you've built a tautology. Measure at the load balancer, at the client, at the edge.
  2. Computable from your telemetry today. An SLO you can't actually measure is aspirational at best and dishonest at worst.

See SLOs and On-Call for burn-rate alerting, alert hygiene, and the on-call handover template.

Signals every service should emit

Before you design a dashboard, make sure the telemetry exists. The list below is the minimum a production service should emit. If you can't tick all of these, that's your next sprint.

Inbound request signals (RED)

Resource signals (USE)

Dependency signals

Every outbound call should have its own RED metrics, labeled by dependency name, not URL:

# Good: bounded by the list of dependencies you have.
http_client_requests_total{dep="users-api", status="200"}
http_client_request_duration_seconds_bucket{dep="billing", le="0.1"}

# Bad: url is unbounded.
http_client_requests_total{url="https://api.example.com/v1/users/14821"}

Business signals

A handful of counters that matter to humans, not engineers. Orders placed, signups completed, invoices sent. When these go to zero, something is wrong even if every technical dashboard is green.

Logs

Traces

The telemetry stack, layer by layer

You rarely build this from a clean slate. You layer it.

Layer 1 — host metrics

node_exporter on every Linux host. CPU, memory, disk, network, filesystem, systemd units. Installed by your config-management tool and monitored with a handful of simple alerts (filesystem full, load average high, up == 0).

Layer 2 — application metrics

A Prometheus client library in your app, or an OpenTelemetry SDK exporting to Prometheus. Expose /metrics, scrape it. Emit the RED + USE + dependency signals above.

Layer 3 — logs

Apps write structured logs to stdout (in containers) or journald/rsyslog (on VMs). A shipper (Promtail, Grafana Agent, Vector, Fluent Bit) forwards to the store. See rsyslog, rsyslog forwarding, and Loki.

Layer 4 — traces

OpenTelemetry SDK in the app, OTEL Collector in the middle, Tempo/Jaeger/Zipkin as the backend. Start with trace propagation and a sampling policy; add span attributes as you need them.

Layer 5 — dashboards and alerts

Grafana for humans; Prometheus alerting rules + Alertmanager (or Grafana unified alerting) for the pager. Every alert links to a runbook — no silent knowledge.

Anti-patterns

Anti-patternWhy it's badFix
Alerting on log stringsExpensive, slow, fires on harmless grep matches after a message-text changeEmit a metric counter, alert on the metric
customer_id as a metric labelCardinality explosion; TSDB OOMLog it, don't label it
"We'll add tracing later"By the time you have 10 services, retrofitting is painfulAdd OTEL SDKs on day one, even if the backend is a no-op
Dashboards with 40 panelsNobody reads them; the signal drownsTop panel: SLO status. Second: RED. Details below.
Alerts without runbook linksOn-call googles in a panic; tribal knowledgeEvery alert annotation has a runbook_url
One giant "app" dashboardCognitive load; diff-spotting impossibleOne per service, linked from a top-level index
100% trace sampling in prodNetwork and storage cost; no benefit over 5%Tail-based sampling keeping all errors + a fraction of success
Logging secrets, tokens, PIICompliance nightmare; one grep away from a breachRedact at the logger; fail CI if authorization appears verbatim

Where to go next

The per-tool guides in this series: