OpenTelemetry Traces
- A trace is a tree of spans for one end-to-end request. Spans are the unit you store, query, and pay for.
- Context must propagate on every hop (W3C
traceparent). One service that drops the header breaks the tree into orphan spans. - Sample in production. 100% of everything is a bill and a liability; keep errors, sample successes.
- Put the trace ID in structured logs. That is the glue between "slow span in Jaeger" and "exact log lines for that request".
Traces, spans, and attributes
A span is one unit of work with a start time, end time, name, and optional structured attributes. A trace is the set of spans sharing a root — usually one inbound HTTP or RPC that fans out to databases and internal services. Parent/child links form the call tree. That tree answers: "The checkout API is slow — is it LDAP, the cart service, or Postgres?"
Span attributes should be bounded, like metric labels. Good: db.name, rpc.service, http.route as a template, error = true / false. Bad: full SQL text, raw URLs with query parameters, or user identifiers as primary grouping — those belong in logs tied by trace ID, not in every span.
Instrument boundaries, not every function: one span per inbound request, one per meaningful outbound call (HTTP client, gRPC, DB query over some threshold), plus optional spans for long internal phases you actually debug.
Context propagation (W3C)
Services pass two ideas: which trace (trace id) and parent span id so the next hop can attach as a child. The common wire format is the W3C trace context header:
traceparent: 00-<trace-id>-<parent-span-id>-<flags>
Middleware in each framework must: parse incoming traceparent (or start a new trace if absent), create a span for the request, and on every outbound call inject the header on that call's client. If you have async jobs, the trace context has to be carried in the job payload, not only thread-local state.
traceparent to C? Nginx, Envoy, and language HTTP clients are common culprits.
End-to-end: app → OTLP → collector → backend
One working picture you can copy to a whiteboard. Adjust hostnames, but keep the single ingress for telemetry so the app stays dumb about vendors:
- Application — OpenTelemetry SDK / auto-instrumentation creates spans and exports over OTLP (gRPC or HTTP) to a known host/port (often
localhost:4317to a local agent). - Collector (recommended) — Receives OTLP, batches, retries, may tail-sample or redact attributes, then fans out. Running one per host or one per cluster is normal.
- Trace backend — For example Tempo or Jaeger over their native ingest or OTLP. The backend stores spans and answers trace-id queries; dashboards stay in Grafana.
Example shape (conceptual) — a single collector pipeline receiving from apps and writing traces:
# opentelemetry-collector config (excerpt) — not exhaustive
receivers:
otlp:
protocols:
grpc:
http:
processors:
batch: {}
memory_limiter: {}
# tail_sampling lives here in advanced setups
exporters:
otlp:
endpoint: tempo.internal:4317
tls:
insecure: false
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
Metrics and logs can run through the same collector with different pipelines, but trace IDs in logs (see Loki) are what bind the story together during an incident.
OTLP and the OpenTelemetry Collector
OTLP (OpenTelemetry Protocol) is the standard way the SDK talks to a backend, usually gRPC or HTTP/protobuf to a collector. The collector is a single place to: receive from many agents, batch and retry, resample, redact or drop attributes, fan out to Tempo, Jaeger, and commercial vendors, and export metrics in parallel.
A typical path: app SDK → (localhost) OTLP → collector → storage. The SDK should not need to know whether the backend is Grafana Tempo or Jaeger; the collector’s exporters do.
Head vs tail sampling and when to use what
- Head-based — decide in the SDK whether to keep the trace, before you finish the request. Simple, but you cannot know yet if a deep child will 500.
- Tail-based — the collector (or a dedicated processor) holds spans until the trace completes, then keeps, for example, "all error traces" + "1% of success traces". This is the pattern you want in large prod fleets if you can afford the buffering.
Sampling decisions in practice
| Situation | Reasonable default | Why |
|---|---|---|
| Local dev, CI, and most staging | Always on or very high rate | Cheap at low volume; you want the full tree when debugging without fighting odds. |
| Prod, modest traffic, simple ops | Head: parent-based + higher rate for errors | Low moving parts; accept that a rare 500 deep in the tree might be missed if the root looked healthy at decision time. |
| Prod, high QPS, cost- or PII-sensitive | Tail in the collector (keep errors + latency outliers + a small % of success) | You pay storage for meaningful traces, not a random cross-section of healthy traffic. Requires memory/depth limits you monitor. |
| Strict retention / compliance | Explicitly drop or hash attributes at the collector | Sampling does not remove legal obligations; pair policy with log and trace redaction. |
Rule of thumb: 100% in dev and staging; in prod, all errors and a low fraction of success (or tail sampling if the stack supports it and you sized the collector). Never ship "record everything" with no cap — that becomes both a bill and a cardinality problem.
Production foot-guns
These show up in real outages and postmortems. None are hypothetical.
- Unbounded high-cardinality labels — Raw URLs, user ids, or shard keys on every span as tags explode index cost in the trace store (same failure mode as bad metric labels; see observability overview). Prefer route templates, bounded enums, and move detail to logs with
trace_id. - Tail sampling without capacity planning — Buffers that hold almost-complete traces under spike load can OOM the collector or drop data silently. Set limits, watch memory, and alert on drop / refused reasons.
- PII in span attributes — Email, name, and free-text fields often violate policy and bloat storage. Scrub or hash at the collector; default SDK attributes are not automatically safe.
- Head-only sampling on deep graphs — If the root span always looks fine, you may never record a failing leaf (for example, a 500 in a service three hops away). Mitigate with error-aware samplers, tail sampling, or error metrics that link via exemplars to traces.
- Trace–log–SLO disconnect — If on-call has SLOs in Grafana but traces lack exemplars and logs lack
trace_id, the team burns time jumping tools. Standardize the three strings:service.name,trace_id, and the error-budget view in one workspace.
Backends and querying
Tempo (Grafana), Jaeger, and Zipkin all store the same model: trace id → span list. The UX differs (Grafana Explore vs Jaeger UI). Pick the one that fits your Grafana/Prometheus world; the OpenTelemetry part is the same.
When searching: start from a known trace id (from a log line or a metric exemplar) rather than "find slow things in the last hour" — that needs good service and operation names on your spans, which comes back to clean instrumentation at the edge.
Tying together metrics, logs, and traces
- Logs — every structured log line for a request should include
trace_id(and oftenspan_id). Loki LogQL can pivot from a trace you found in Grafana Explore to the exact log lines for the same request. - Metrics — exemplars on histograms can carry a trace id so a latency spike in Prometheus or Grafana points at a real trace. Prometheus & node_exporter and Grafana cover the graph side; the exemplar is the bridge from SLI to span.
- SLOs and incidents — burn-rate alerts in SLOs & on-call should lead to a trace or to a runbook, not a dead end. In practice: SLI from metrics, drill to exemplar or span, confirm with logs.
The full mental model of the three pillars and when not to add a signal lives in observability overview. This page assumes you are already committed to distributed tracing; use that page to sanity-check cost and team workflow.
Troubleshooting
| Symptom | Likely cause | What to check |
|---|---|---|
| Traces appear only in one service | Context not injected on one outbound call | Client library, manual HTTP, or a proxy stripping headers |
| Enormous span volume, storage climbing | Sampling off or 100% in prod | SDK sampler + collector policy; cap span attributes |
| “Unknown” parent in UI | Clock skew, late spans, or wrong parent id | NTP/chrony; ensure single trace id across services |
| Can't jump from log to trace | trace_id not in log schema | Structured logging + one field name everywhere |
| High cardinality in span attributes | Raw URLs or user ids on every span | Move to logs; use templates on http.route |
Where to go next
- SLOs and On-Call — next in the observability track.
- Loki Logs — correlate logs with trace IDs.
- Grafana Basics — Explore, exemplars, trace-to-log.