Observability Overview
- Metrics answer "is something wrong and how bad?" — cheap, aggregatable, low-cardinality.
- Logs answer "what exactly happened?" — high-cardinality context, expensive per byte.
- Traces answer "where in the call graph is the time going?" — required the moment you have more than two services.
- Every metric label is a cross-product.
customer_idas a label will bankrupt you.status_codewon't. - Instrument at the edge of your service (inbound requests, outbound dependencies). Not every function.
- An SLO you can't compute from your existing telemetry is aspirational, not an SLO.
The three pillars
You will hear "metrics, logs, and traces" in every observability talk. It's not wrong, but the useful framing is about the question each answers.
| Pillar | The question it answers | Cardinality tolerance | Cost shape |
|---|---|---|---|
| Metrics | Is it up? Is it slow? How much of it? | Low — every label combination is a time series | Cheap per series, explodes with labels |
| Logs | What exactly did this request do, in order? | High — every line is free-form | Pay per byte ingested + retention |
| Traces | Where in a distributed call did the time go? | Medium — per-span attributes | Pay per span; sample aggressively |
When each is the right tool
- Alert on metrics. You want every instance to contribute to the same series. Paging someone because one log line appeared is how you get paged at 3am for a warning nobody reads.
- Debug with logs. Once metrics tell you something is wrong, logs tell you what the program actually did at that moment. Correlate with the trace ID.
- Split latency with traces. "The API is slow" and you have five downstream services? Traces are non-negotiable. Without them you're grepping timestamps across hosts by hand.
Where they overlap
Modern stacks blur the lines on purpose:
- Metrics from logs. LogQL's
rate(), Elastic's metric aggregations, and Splunk'sstatsall let you derive a metric from a log stream. Useful for things you forgot to instrument, not a replacement for native metrics (too slow, too expensive for alerting). - Exemplars. A metric sample that carries a trace ID. Click the latency spike, jump to a real example trace. Prometheus supports them, Grafana renders them.
- Span events as structured logs. OpenTelemetry span events let you attach "log lines" to a trace span. They are better than free-text logs for anything that is clearly part of a request.
Cardinality — the one rule you must not break
A metric's cost isn't driven by how often you emit it. It's driven by how many distinct label combinations you emit. Each unique combination is a new time series your TSDB has to index, churn on compaction, and keep in memory for the active window.
Good labels
method(GET, POST, …)status_code(200, 404, 500 — 60-ish values)route(template, not expanded —/users/:id, not/users/14821)service,env,region,instance
Bad labels
user_id/customer_id/order_id— unboundedurlwith query string — unboundederror_message— free text, will contain timestamps and UUIDshost_ipin a dynamic infrastructure — churns as pods come and go
Logs are the opposite: they tolerate unbounded context. Put the user_id on the log line. Put the trace ID on the log line. Leave the metric counter with just {route, method, status}.
prometheus_tsdb_head_series is the live series count. If it climbs without the fleet growing, something is emitting a high-cardinality label. topk(10, count by (__name__)({__name__=~".+"})) shows the worst offenders.
SLI and SLO framing
The compressed version of the Google SRE book's Chapter 4:
- SLI (Service Level Indicator): a measurement. "Fraction of requests that returned 2xx/3xx in under 300 ms."
- SLO (Service Level Objective): a target for that measurement over a window. "99.5% of requests over a rolling 30 days."
- SLA (Service Level Agreement): the contractual version, with money attached. If you aren't billing or being billed on it, you don't have an SLA, you have an SLO.
- Error budget:
1 - SLO. For 99.5% over 30 days, that's 0.5% × 30 × 24 × 60 = 216 minutes of badness allowed. When you've burned through it, ship fewer features and more reliability.
A useful SLI has two properties:
- Observable from outside the service. If the only way to tell the SLI is failing is to check the same service that's failing, you've built a tautology. Measure at the load balancer, at the client, at the edge.
- Computable from your telemetry today. An SLO you can't actually measure is aspirational at best and dishonest at worst.
See SLOs and On-Call for burn-rate alerting, alert hygiene, and the on-call handover template.
Signals every service should emit
Before you design a dashboard, make sure the telemetry exists. The list below is the minimum a production service should emit. If you can't tick all of these, that's your next sprint.
Inbound request signals (RED)
- Rate:
http_requests_total{route,method,status}— counter. - Errors: derivable from the above:
rate(http_requests_total{status=~"5.."}[5m]). - Duration:
http_request_duration_seconds_bucket{route,method,le}— histogram, so you can compute p50/p95/p99.
Resource signals (USE)
- Utilization: CPU %, memory used / memory limit, GC pause ratio, disk %.
- Saturation: request queue depth, connection pool in-use count, thread pool queue length.
- Errors: panics, OOM kills, restart count (
kube_pod_container_status_restarts_total), failed fsyncs.
Dependency signals
Every outbound call should have its own RED metrics, labeled by dependency name, not URL:
# Good: bounded by the list of dependencies you have.
http_client_requests_total{dep="users-api", status="200"}
http_client_request_duration_seconds_bucket{dep="billing", le="0.1"}
# Bad: url is unbounded.
http_client_requests_total{url="https://api.example.com/v1/users/14821"}
Business signals
A handful of counters that matter to humans, not engineers. Orders placed, signups completed, invoices sent. When these go to zero, something is wrong even if every technical dashboard is green.
Logs
- Structured (JSON or logfmt), not free text. Your aggregator can parse either; free text hobbles both.
- Include
trace_id,span_id,user_id(in logs, not in metric labels),request_id,service,env. - Levels:
DEBUGoff in prod,INFOfor business events,WARNfor recoverable problems,ERRORfor things a human should see.
Traces
- Inbound spans for every request; outbound spans for every RPC/HTTP/SQL call.
- Propagate
traceparent(W3C) headers on every outbound call. One missing hop and the trace dies. - Sample: 100% on errors, 1–10% on success in prod, 100% in dev. Head-based or tail-based — pick one.
The telemetry stack, layer by layer
You rarely build this from a clean slate. You layer it.
Layer 1 — host metrics
node_exporter on every Linux host. CPU, memory, disk, network, filesystem, systemd units. Installed by your config-management tool and monitored with a handful of simple alerts (filesystem full, load average high, up == 0).
Layer 2 — application metrics
A Prometheus client library in your app, or an OpenTelemetry SDK exporting to Prometheus. Expose /metrics, scrape it. Emit the RED + USE + dependency signals above.
Layer 3 — logs
Apps write structured logs to stdout (in containers) or journald/rsyslog (on VMs). A shipper (Promtail, Grafana Agent, Vector, Fluent Bit) forwards to the store. See rsyslog, rsyslog forwarding, and Loki.
Layer 4 — traces
OpenTelemetry SDK in the app, OTEL Collector in the middle, Tempo/Jaeger/Zipkin as the backend. Start with trace propagation and a sampling policy; add span attributes as you need them.
Layer 5 — dashboards and alerts
Grafana for humans; Prometheus alerting rules + Alertmanager (or Grafana unified alerting) for the pager. Every alert links to a runbook — no silent knowledge.
Anti-patterns
| Anti-pattern | Why it's bad | Fix |
|---|---|---|
| Alerting on log strings | Expensive, slow, fires on harmless grep matches after a message-text change | Emit a metric counter, alert on the metric |
customer_id as a metric label | Cardinality explosion; TSDB OOM | Log it, don't label it |
| "We'll add tracing later" | By the time you have 10 services, retrofitting is painful | Add OTEL SDKs on day one, even if the backend is a no-op |
| Dashboards with 40 panels | Nobody reads them; the signal drowns | Top panel: SLO status. Second: RED. Details below. |
| Alerts without runbook links | On-call googles in a panic; tribal knowledge | Every alert annotation has a runbook_url |
| One giant "app" dashboard | Cognitive load; diff-spotting impossible | One per service, linked from a top-level index |
| 100% trace sampling in prod | Network and storage cost; no benefit over 5% | Tail-based sampling keeping all errors + a fraction of success |
| Logging secrets, tokens, PII | Compliance nightmare; one grep away from a breach | Redact at the logger; fail CI if authorization appears verbatim |
Where to go next
The per-tool guides in this series:
- Prometheus & node_exporter — the collector and the host-level exporter.
- Grafana Basics — dashboards, variables, unified alerting.
- Loki Logs — log aggregation with LogQL.
- OpenTelemetry Traces — distributed tracing: spans, context propagation, OTLP, sampling.
- SLOs and On-Call — burn-rate alerts, alert hygiene, handover template.
- rsyslog and rsyslog forwarding — the unglamorous pipe that moves logs off the host.