Observability Overview

Metrics, logs, traces — what each is for, where they overlap, the cardinality rules that keep your storage bill sane, and the minimum signal set every service should emit.

If you only remember six things

Metrics answer "is something wrong and how bad?" — cheap, aggregatable, low-cardinality.
Logs answer "what exactly happened?" — high-cardinality context, expensive per byte.
Traces answer "where in the call graph is the time going?" — required the moment you have more than two services.
Every metric label is a cross-product. customer_id as a label will bankrupt you. status_code won't.
Instrument at the edge of your service (inbound requests, outbound dependencies). Not every function.
An SLO you can't compute from your existing telemetry is aspirational, not an SLO.

On this page

The three pillars
Cardinality — the one rule you must not break
SLI and SLO framing
Signals every service should emit
The telemetry stack, layer by layer
Anti-patterns
Where to go next

The three pillars

You will hear "metrics, logs, and traces" in every observability talk. It's not wrong, but the useful framing is about the question each answers.

Pillar	The question it answers	Cardinality tolerance	Cost shape
Metrics	Is it up? Is it slow? How much of it?	Low — every label combination is a time series	Cheap per series, explodes with labels
Logs	What exactly did this request do, in order?	High — every line is free-form	Pay per byte ingested + retention
Traces	Where in a distributed call did the time go?	Medium — per-span attributes	Pay per span; sample aggressively

When each is the right tool

Alert on metrics. You want every instance to contribute to the same series. Paging someone because one log line appeared is how you get paged at 3am for a warning nobody reads.
Debug with logs. Once metrics tell you something is wrong, logs tell you what the program actually did at that moment. Correlate with the trace ID.
Split latency with traces. "The API is slow" and you have five downstream services? Traces are non-negotiable. Without them you're grepping timestamps across hosts by hand.

Where they overlap

Modern stacks blur the lines on purpose:

Metrics from logs. LogQL's rate(), Elastic's metric aggregations, and Splunk's stats all let you derive a metric from a log stream. Useful for things you forgot to instrument, not a replacement for native metrics (too slow, too expensive for alerting).
Exemplars. A metric sample that carries a trace ID. Click the latency spike, jump to a real example trace. Prometheus supports them, Grafana renders them.
Span events as structured logs. OpenTelemetry span events let you attach "log lines" to a trace span. They are better than free-text logs for anything that is clearly part of a request.

Cardinality — the one rule you must not break

A metric's cost isn't driven by how often you emit it. It's driven by how many distinct label combinations you emit. Each unique combination is a new time series your TSDB has to index, churn on compaction, and keep in memory for the active window.

The rule: metric labels must be bounded. If a label can take on more than a few hundred distinct values over the life of the series, it doesn't belong on a metric.

Good labels

method (GET, POST, …)
status_code (200, 404, 500 — 60-ish values)
route (template, not expanded — /users/:id, not /users/14821)
service, env, region, instance

Bad labels

user_id / customer_id / order_id — unbounded
url with query string — unbounded
error_message — free text, will contain timestamps and UUIDs
host_ip in a dynamic infrastructure — churns as pods come and go

Logs are the opposite: they tolerate unbounded context. Put the user_id on the log line. Put the trace ID on the log line. Leave the metric counter with just {route, method, status}.

A quick check on Prometheus: prometheus_tsdb_head_series is the live series count. If it climbs without the fleet growing, something is emitting a high-cardinality label. topk(10, count by (__name__)({__name__=~".+"})) shows the worst offenders.

SLI and SLO framing

The compressed version of the Google SRE book's Chapter 4:

SLI (Service Level Indicator): a measurement. "Fraction of requests that returned 2xx/3xx in under 300 ms."
SLO (Service Level Objective): a target for that measurement over a window. "99.5% of requests over a rolling 30 days."
SLA (Service Level Agreement): the contractual version, with money attached. If you aren't billing or being billed on it, you don't have an SLA, you have an SLO.
Error budget: 1 - SLO. For 99.5% over 30 days, that's 0.5% × 30 × 24 × 60 = 216 minutes of badness allowed. When you've burned through it, ship fewer features and more reliability.

A useful SLI has two properties:

Observable from outside the service. If the only way to tell the SLI is failing is to check the same service that's failing, you've built a tautology. Measure at the load balancer, at the client, at the edge.
Computable from your telemetry today. An SLO you can't actually measure is aspirational at best and dishonest at worst.

See SLOs and On-Call for burn-rate alerting, alert hygiene, and the on-call handover template.

Signals every service should emit

Before you design a dashboard, make sure the telemetry exists. The list below is the minimum a production service should emit. If you can't tick all of these, that's your next sprint.

Inbound request signals (RED)

Rate: http_requests_total{route,method,status} — counter.
Errors: derivable from the above: rate(http_requests_total{status=~"5.."}[5m]).
Duration: http_request_duration_seconds_bucket{route,method,le} — histogram, so you can compute p50/p95/p99.

Resource signals (USE)

Utilization: CPU %, memory used / memory limit, GC pause ratio, disk %.
Saturation: request queue depth, connection pool in-use count, thread pool queue length.
Errors: panics, OOM kills, restart count (kube_pod_container_status_restarts_total), failed fsyncs.

Dependency signals

Every outbound call should have its own RED metrics, labeled by dependency name, not URL:

# Good: bounded by the list of dependencies you have.
http_client_requests_total{dep="users-api", status="200"}
http_client_request_duration_seconds_bucket{dep="billing", le="0.1"}

# Bad: url is unbounded.
http_client_requests_total{url="https://api.example.com/v1/users/14821"}

Business signals

A handful of counters that matter to humans, not engineers. Orders placed, signups completed, invoices sent. When these go to zero, something is wrong even if every technical dashboard is green.

Logs

Structured (JSON or logfmt), not free text. Your aggregator can parse either; free text hobbles both.
Include trace_id, span_id, user_id (in logs, not in metric labels), request_id, service, env.
Levels: DEBUG off in prod, INFO for business events, WARN for recoverable problems, ERROR for things a human should see.

Traces

Inbound spans for every request; outbound spans for every RPC/HTTP/SQL call.
Propagate traceparent (W3C) headers on every outbound call. One missing hop and the trace dies.
Sample: 100% on errors, 1–10% on success in prod, 100% in dev. Head-based or tail-based — pick one.

The telemetry stack, layer by layer

You rarely build this from a clean slate. You layer it.

Layer 1 — host metrics

node_exporter on every Linux host. CPU, memory, disk, network, filesystem, systemd units. Installed by your config-management tool and monitored with a handful of simple alerts (filesystem full, load average high, up == 0).

Layer 2 — application metrics

A Prometheus client library in your app, or an OpenTelemetry SDK exporting to Prometheus. Expose /metrics, scrape it. Emit the RED + USE + dependency signals above.

Layer 3 — logs

Apps write structured logs to stdout (in containers) or journald/rsyslog (on VMs). A shipper (Promtail, Grafana Agent, Vector, Fluent Bit) forwards to the store. See rsyslog, rsyslog forwarding, and Loki.

Layer 4 — traces

OpenTelemetry SDK in the app, OTEL Collector in the middle, Tempo/Jaeger/Zipkin as the backend. Start with trace propagation and a sampling policy; add span attributes as you need them.

Layer 5 — dashboards and alerts

Grafana for humans; Prometheus alerting rules + Alertmanager (or Grafana unified alerting) for the pager. Every alert links to a runbook — no silent knowledge.

Anti-patterns

Anti-pattern	Why it's bad	Fix
Alerting on log strings	Expensive, slow, fires on harmless grep matches after a message-text change	Emit a metric counter, alert on the metric
`customer_id` as a metric label	Cardinality explosion; TSDB OOM	Log it, don't label it
"We'll add tracing later"	By the time you have 10 services, retrofitting is painful	Add OTEL SDKs on day one, even if the backend is a no-op
Dashboards with 40 panels	Nobody reads them; the signal drowns	Top panel: SLO status. Second: RED. Details below.
Alerts without runbook links	On-call googles in a panic; tribal knowledge	Every alert annotation has a `runbook_url`
One giant "app" dashboard	Cognitive load; diff-spotting impossible	One per service, linked from a top-level index
100% trace sampling in prod	Network and storage cost; no benefit over 5%	Tail-based sampling keeping all errors + a fraction of success
Logging secrets, tokens, PII	Compliance nightmare; one `grep` away from a breach	Redact at the logger; fail CI if `authorization` appears verbatim

Where to go next

The per-tool guides in this series:

Prometheus & node_exporter — the collector and the host-level exporter.
Grafana Basics — dashboards, variables, unified alerting.
Loki Logs — log aggregation with LogQL.
OpenTelemetry Traces — distributed tracing: spans, context propagation, OTLP, sampling.
SLOs and On-Call — burn-rate alerts, alert hygiene, handover template.
rsyslog and rsyslog forwarding — the unglamorous pipe that moves logs off the host.