OpenTelemetry Traces

Traces and spans, context propagation, OTLP, sampling, and how to connect traces to your metrics and logs. Read Observability overview for pillars and cost trade-offs; this page is the practical do tracing read. Traces connect to Loki via trace_id in logs and to SLOs & on-call when you tie error budgets back to examples.

If you only remember four things

A trace is a tree of spans for one end-to-end request. Spans are the unit you store, query, and pay for.
Context must propagate on every hop (W3C traceparent). One service that drops the header breaks the tree into orphan spans.
Sample in production. 100% of everything is a bill and a liability; keep errors, sample successes.
Put the trace ID in structured logs. That is the glue between "slow span in Jaeger" and "exact log lines for that request".

On this page

Traces, spans, and attributes
Context propagation (W3C)
End-to-end: app → OTLP → collector → backend
OTLP and the OpenTelemetry Collector
Head vs tail sampling and when to use what
Production foot-guns
Backends and querying
Tying together metrics, logs, and traces
Troubleshooting
Where to go next

Traces, spans, and attributes

A span is one unit of work with a start time, end time, name, and optional structured attributes. A trace is the set of spans sharing a root — usually one inbound HTTP or RPC that fans out to databases and internal services. Parent/child links form the call tree. That tree answers: "The checkout API is slow — is it LDAP, the cart service, or Postgres?"

Span attributes should be bounded, like metric labels. Good: db.name, rpc.service, http.route as a template, error = true / false. Bad: full SQL text, raw URLs with query parameters, or user identifiers as primary grouping — those belong in logs tied by trace ID, not in every span.

Instrument boundaries, not every function: one span per inbound request, one per meaningful outbound call (HTTP client, gRPC, DB query over some threshold), plus optional spans for long internal phases you actually debug.

Context propagation (W3C)

Services pass two ideas: which trace (trace id) and parent span id so the next hop can attach as a child. The common wire format is the W3C trace context header:

traceparent: 00-<trace-id>-<parent-span-id>-<flags>

Middleware in each framework must: parse incoming traceparent (or start a new trace if absent), create a span for the request, and on every outbound call inject the header on that call's client. If you have async jobs, the trace context has to be carried in the job payload, not only thread-local state.

One broken hop = a useless trace. When debugging missing spans, the first check is: does the service between A and B forward traceparent to C? Nginx, Envoy, and language HTTP clients are common culprits.

End-to-end: app → OTLP → collector → backend

One working picture you can copy to a whiteboard. Adjust hostnames, but keep the single ingress for telemetry so the app stays dumb about vendors:

Application — OpenTelemetry SDK / auto-instrumentation creates spans and exports over OTLP (gRPC or HTTP) to a known host/port (often localhost:4317 to a local agent).
Collector (recommended) — Receives OTLP, batches, retries, may tail-sample or redact attributes, then fans out. Running one per host or one per cluster is normal.
Trace backend — For example Tempo or Jaeger over their native ingest or OTLP. The backend stores spans and answers trace-id queries; dashboards stay in Grafana.

Example shape (conceptual) — a single collector pipeline receiving from apps and writing traces:

# opentelemetry-collector config (excerpt) — not exhaustive
receivers:
  otlp:
    protocols:
      grpc:
      http:

processors:
  batch: {}
  memory_limiter: {}
  # tail_sampling lives here in advanced setups

exporters:
  otlp:
    endpoint: tempo.internal:4317
    tls:
      insecure: false

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

Metrics and logs can run through the same collector with different pipelines, but trace IDs in logs (see Loki) are what bind the story together during an incident.

OTLP and the OpenTelemetry Collector

OTLP (OpenTelemetry Protocol) is the standard way the SDK talks to a backend, usually gRPC or HTTP/protobuf to a collector. The collector is a single place to: receive from many agents, batch and retry, resample, redact or drop attributes, fan out to Tempo, Jaeger, and commercial vendors, and export metrics in parallel.

A typical path: app SDK → (localhost) OTLP → collector → storage. The SDK should not need to know whether the backend is Grafana Tempo or Jaeger; the collector’s exporters do.

Head vs tail sampling and when to use what

Head-based — decide in the SDK whether to keep the trace, before you finish the request. Simple, but you cannot know yet if a deep child will 500.
Tail-based — the collector (or a dedicated processor) holds spans until the trace completes, then keeps, for example, "all error traces" + "1% of success traces". This is the pattern you want in large prod fleets if you can afford the buffering.

Sampling decisions in practice

Situation	Reasonable default	Why
Local dev, CI, and most staging	Always on or very high rate	Cheap at low volume; you want the full tree when debugging without fighting odds.
Prod, modest traffic, simple ops	Head: parent-based + higher rate for errors	Low moving parts; accept that a rare 500 deep in the tree might be missed if the root looked healthy at decision time.
Prod, high QPS, cost- or PII-sensitive	Tail in the collector (keep errors + latency outliers + a small % of success)	You pay storage for meaningful traces, not a random cross-section of healthy traffic. Requires memory/depth limits you monitor.
Strict retention / compliance	Explicitly drop or hash attributes at the collector	Sampling does not remove legal obligations; pair policy with log and trace redaction.

Rule of thumb: 100% in dev and staging; in prod, all errors and a low fraction of success (or tail sampling if the stack supports it and you sized the collector). Never ship "record everything" with no cap — that becomes both a bill and a cardinality problem.

Production foot-guns

These show up in real outages and postmortems. None are hypothetical.

Unbounded high-cardinality labels — Raw URLs, user ids, or shard keys on every span as tags explode index cost in the trace store (same failure mode as bad metric labels; see observability overview). Prefer route templates, bounded enums, and move detail to logs with trace_id.
Tail sampling without capacity planning — Buffers that hold almost-complete traces under spike load can OOM the collector or drop data silently. Set limits, watch memory, and alert on drop / refused reasons.
PII in span attributes — Email, name, and free-text fields often violate policy and bloat storage. Scrub or hash at the collector; default SDK attributes are not automatically safe.
Head-only sampling on deep graphs — If the root span always looks fine, you may never record a failing leaf (for example, a 500 in a service three hops away). Mitigate with error-aware samplers, tail sampling, or error metrics that link via exemplars to traces.
Trace–log–SLO disconnect — If on-call has SLOs in Grafana but traces lack exemplars and logs lack trace_id, the team burns time jumping tools. Standardize the three strings: service.name, trace_id, and the error-budget view in one workspace.

Backends and querying

Tempo (Grafana), Jaeger, and Zipkin all store the same model: trace id → span list. The UX differs (Grafana Explore vs Jaeger UI). Pick the one that fits your Grafana/Prometheus world; the OpenTelemetry part is the same.

When searching: start from a known trace id (from a log line or a metric exemplar) rather than "find slow things in the last hour" — that needs good service and operation names on your spans, which comes back to clean instrumentation at the edge.

Tying together metrics, logs, and traces

Logs — every structured log line for a request should include trace_id (and often span_id). Loki LogQL can pivot from a trace you found in Grafana Explore to the exact log lines for the same request.
Metrics — exemplars on histograms can carry a trace id so a latency spike in Prometheus or Grafana points at a real trace. Prometheus & node_exporter and Grafana cover the graph side; the exemplar is the bridge from SLI to span.
SLOs and incidents — burn-rate alerts in SLOs & on-call should lead to a trace or to a runbook, not a dead end. In practice: SLI from metrics, drill to exemplar or span, confirm with logs.

The full mental model of the three pillars and when not to add a signal lives in observability overview. This page assumes you are already committed to distributed tracing; use that page to sanity-check cost and team workflow.

Troubleshooting

Symptom	Likely cause	What to check
Traces appear only in one service	Context not injected on one outbound call	Client library, manual HTTP, or a proxy stripping headers
Enormous span volume, storage climbing	Sampling off or 100% in prod	SDK sampler + collector policy; cap span attributes
“Unknown” parent in UI	Clock skew, late spans, or wrong parent id	NTP/chrony; ensure single trace id across services
Can't jump from log to trace	`trace_id` not in log schema	Structured logging + one field name everywhere
High cardinality in span attributes	Raw URLs or user ids on every span	Move to logs; use templates on `http.route`

Where to go next

SLOs and On-Call — next in the observability track.
Loki Logs — correlate logs with trace IDs.
Grafana Basics — Explore, exemplars, trace-to-log.