Loki Logs

Loki architecture, which shipper to run, LogQL from stream selectors to metric queries, deriving SLI metrics out of access logs, and retention and compaction that don't surprise you on the invoice.

If you only remember six things
  • Loki indexes labels, not content. Keep the label set small; make the line itself searchable with |=.
  • Start with the single-binary deployment. Only split into microservices once you have a reason.
  • Stream labels are the cardinality unit. service, env, level — good. request_id as a label — death.
  • Pick one shipper and stop. Promtail is fine; Grafana Agent and Vector are also fine. Running all three is not.
  • LogQL can compute metrics (rate, count_over_time). Use it for ad-hoc SLIs, not for alerting — native metrics are cheaper.
  • Retention + compaction only work if they're both enabled. Check loki_compactor_apply_retention_last_successful_run_timestamp_seconds.

Architecture

Loki has three deployment modes:

The components exist in every mode — they're just collapsed. When you read Loki docs, the role names (distributor, ingester, querier, compactor) apply even when you're running one process.

What flows where

Shipper → distributor (label validation, sharding) → ingester (chunks in RAM, flushed to object storage) → object storage (S3/GCS/Azure Blob) + index (boltdb-shipper or TSDB). Queries hit the querier, which fans out to ingesters (fresh data) and object storage (old data), merging results.

Storage layout — BoltDB-shipper vs TSDB

Two index formats exist. Use TSDB on a new install; BoltDB-shipper is the legacy option.

# loki.yaml — single binary, TSDB index, S3 for chunks
auth_enabled: false

common:
  path_prefix: /var/lib/loki
  replication_factor: 1
  storage:
    s3:
      endpoint: s3.eu-west-1.amazonaws.com
      bucketnames: logs-loki-prod
      region: eu-west-1
      s3forcepathstyle: false

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: loki_index_
        period: 24h

limits_config:
  retention_period: 30d
  ingestion_rate_mb: 12
  ingestion_burst_size_mb: 24
  max_label_name_length: 1024
  max_label_value_length: 2048
  max_label_names_per_series: 15

compactor:
  working_directory: /var/lib/loki/compactor
  retention_enabled: true
  delete_request_store: s3

ruler:
  storage:
    type: local
    local:
      directory: /etc/loki/rules
retention_enabled: true on the compactor is required. The retention_period in limits_config is only the intent; without the compactor actually doing the deletion, chunks sit in S3 forever.

Pick a shipper

ShipperBest atWorst at
PromtailJournald + file tailing, Loki-native, tiny binaryDeprecated as of 2024 — Grafana recommends migrating
Grafana Agent / AlloyOne agent for metrics + logs + traces, Loki-nativeLarger memory footprint; steeper config
VectorTransforms, multiple sinks, great observability of itselfNot Loki-specific — one of many outputs
Fluent BitKubernetes-native, low footprint, widely deployedLoki output historically lagged; check feature parity

Promtail example — journald + nginx access logs

server:
  http_listen_port: 9080

positions:
  filename: /var/lib/promtail/positions.yaml

clients:
  - url: https://loki.internal/loki/api/v1/push
    basic_auth:
      username: fleet
      password_file: /etc/promtail/loki.pass

scrape_configs:
  - job_name: journal
    journal:
      max_age: 12h
      labels:
        job: systemd-journal
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: unit
      - source_labels: ['__journal__hostname']
        target_label: host

  - job_name: nginx
    static_configs:
      - targets: [localhost]
        labels:
          job: nginx
          host: "{{ inventory_hostname }}"
          __path__: /var/log/nginx/access.log
    pipeline_stages:
      - json:
          expressions:
            method: method
            status: status
            path: request_uri
            duration: request_time
      - labels:
          method:
          status:
      - metrics:
          nginx_requests_total:
            type: Counter
            description: "HTTP requests from access log"
            source: status
            config:
              action: inc
Note that method and status become labels, but path does not. Paths are unbounded per user — leave them in the log line and search with |= "/users/" instead.

LogQL basics

A LogQL query is stream selector + filter chain, optionally wrapped in a metric function.

Stream selector

{job="nginx", env="prod"}

That already selects log lines. Prometheus-style matchers: =, !=, =~, !~.

Line filters — fast path

{job="nginx"} |= "500"         # contains
{job="nginx"} != "GET"         # does not contain
{job="nginx"} |~ "5\\d\\d"     # regex match
{job="nginx"} !~ "/health"     # regex not-match

Line filters are evaluated before parsing. Use them first; they are by far the cheapest part of a query.

Parsers — extract structured fields

{job="app"} | json                              # parse as JSON
{job="app"} | logfmt                            # parse as logfmt
{job="app"} | regexp `user=(?P<user>\\S+)`      # custom regex
{job="app"} | pattern `<_> <method> <path> <status>`

Label filters after parsing

{job="app"} | json | status >= 500 | duration > 500ms

Line format — rewrite the output

{job="nginx"} | json | line_format `{{.method}} {{.path}} -> {{.status}}`

Metrics from logs

Wrap any stream in a range aggregation to get a metric.

# Error rate per route
sum by (path) (
  rate({job="nginx"} | json | status =~ "5.." [5m])
)

# p99 latency from access logs
quantile_over_time(0.99,
  {job="nginx"} | json | unwrap duration [5m]
) by (path)

# Lines per minute per service
sum by (service) (count_over_time({env="prod"}[1m]))
These are on-the-fly computations. They work, but they're slow and re-run on every dashboard refresh. For alerting, emit a real Prometheus counter from the app, or use a Promtail metric stage to write it once. LogQL metrics are for exploration, not for the pager.

Recording rules

Loki's ruler can turn LogQL metric queries into Prometheus recording rules, pushed via remote_write:

# /etc/loki/rules/fake-tenant/nginx.yml
groups:
  - name: nginx
    interval: 1m
    rules:
      - record: nginx:5xx:rate5m
        expr: |
          sum by (host) (
            rate({job="nginx"} | json | status =~ "5.." [5m])
          )

Retention and compaction

Three knobs:

limits_config:
  retention_period: 30d
  retention_stream:
    - selector: '{category="audit"}'
      priority: 1
      period: 365d
    - selector: '{env="dev"}'
      priority: 2
      period: 7d

Monitor:

time() - loki_compactor_apply_retention_last_successful_run_timestamp_seconds > 24*3600

Disk slowly filling despite retention set? Compactor isn't running. Check its logs — it's usually an object-storage permissions problem.

Useful query patterns

Everything for one request, by trace id

{env="prod"} |= "trace_id=8f3a2b1c"

Top 10 error messages in the last hour

topk(10,
  sum by (msg) (
    count_over_time({env="prod", level="error"}
      | json | line_format "{{.msg}}" [1h])
  )
)

Requests from one IP that got a 401

{job="nginx"} | json | remote_addr="203.0.113.42" | status="401"

Bytes shipped per service per hour

sum by (service) (
  bytes_over_time({env="prod"}[1h])
)

Tail live, filtered

In Grafana Explore, set the query and click "Live". LogQL tail endpoint streams matching lines via a websocket. Useful for watching a deploy.

Troubleshooting

SymptomLikely causeFix
"too many outstanding requests"Query fans out, ingester queue fullShorten time range; add a line filter; raise max_outstanding_per_tenant
"parse error: too many label-value pairs"Cardinality bomb; some label has > max distinct valueslogcli labels; find the culprit; rewrite shipper pipeline to drop or move to line
Queries fast for recent, slow for oldIngesters have fresh data in memory; old data reads S3 + indexNormal — use query-frontend with splitting + caching; add index-gateway at scale
Logs visible in Loki UI but not in GrafanaData source uid mismatch, or CORS/proxy misconfiguredAlways access Loki via Grafana proxy (access: proxy), never browser-direct
Disk filling despite retention 30dCompactor not enabled, or not running, or missing S3 delete permissionloki_compactor_* metrics; S3 policy needs s3:DeleteObject
Intermittent "entry out of order"Clock skew on a shipper host, or concurrent writers to the same streamSync clocks; check for two Promtails with the same labels on one host

See also rsyslog for the host-side log pipeline, and rsyslog forwarding if you prefer syslog → Loki over Promtail. Next in this track: OpenTelemetry Traces, then SLOs and On-Call.