Loki Logs

Loki architecture, which shipper to run, LogQL from stream selectors to metric queries, deriving SLI metrics out of access logs, and retention and compaction that don't surprise you on the invoice.

If you only remember six things

Loki indexes labels, not content. Keep the label set small; make the line itself searchable with |=.
Start with the single-binary deployment. Only split into microservices once you have a reason.
Stream labels are the cardinality unit. service, env, level — good. request_id as a label — death.
Pick one shipper and stop. Promtail is fine; Grafana Agent and Vector are also fine. Running all three is not.
LogQL can compute metrics (rate, count_over_time). Use it for ad-hoc SLIs, not for alerting — native metrics are cheaper.
Retention + compaction only work if they're both enabled. Check loki_compactor_apply_retention_last_successful_run_timestamp_seconds.

On this page

Architecture
Storage layout — BoltDB-shipper vs TSDB
Pick a shipper
LogQL basics
Metrics from logs
Retention and compaction
Useful query patterns
Troubleshooting

Architecture

Loki has three deployment modes:

Monolithic / single binary — everything in one process. Works up to a few TB/day on a single box. Start here.
Simple scalable — read, write, and backend split into three deployments. Good sweet spot for 1–100 TB/day without microservices pain.
Microservices — distributor, ingester, querier, query-frontend, compactor, ruler, index-gateway. Required at scale; expensive to run.

The components exist in every mode — they're just collapsed. When you read Loki docs, the role names (distributor, ingester, querier, compactor) apply even when you're running one process.

What flows where

Shipper → distributor (label validation, sharding) → ingester (chunks in RAM, flushed to object storage) → object storage (S3/GCS/Azure Blob) + index (boltdb-shipper or TSDB). Queries hit the querier, which fans out to ingesters (fresh data) and object storage (old data), merging results.

Storage layout — BoltDB-shipper vs TSDB

Two index formats exist. Use TSDB on a new install; BoltDB-shipper is the legacy option.

# loki.yaml — single binary, TSDB index, S3 for chunks
auth_enabled: false

common:
  path_prefix: /var/lib/loki
  replication_factor: 1
  storage:
    s3:
      endpoint: s3.eu-west-1.amazonaws.com
      bucketnames: logs-loki-prod
      region: eu-west-1
      s3forcepathstyle: false

schema_config:
  configs:
    - from: 2024-01-01
      store: tsdb
      object_store: s3
      schema: v13
      index:
        prefix: loki_index_
        period: 24h

limits_config:
  retention_period: 30d
  ingestion_rate_mb: 12
  ingestion_burst_size_mb: 24
  max_label_name_length: 1024
  max_label_value_length: 2048
  max_label_names_per_series: 15

compactor:
  working_directory: /var/lib/loki/compactor
  retention_enabled: true
  delete_request_store: s3

ruler:
  storage:
    type: local
    local:
      directory: /etc/loki/rules

retention_enabled: true on the compactor is required. The retention_period in limits_config is only the intent; without the compactor actually doing the deletion, chunks sit in S3 forever.

Pick a shipper

Shipper	Best at	Worst at
Promtail	Journald + file tailing, Loki-native, tiny binary	Deprecated as of 2024 — Grafana recommends migrating
Grafana Agent / Alloy	One agent for metrics + logs + traces, Loki-native	Larger memory footprint; steeper config
Vector	Transforms, multiple sinks, great observability of itself	Not Loki-specific — one of many outputs
Fluent Bit	Kubernetes-native, low footprint, widely deployed	Loki output historically lagged; check feature parity

Promtail example — journald + nginx access logs

server:
  http_listen_port: 9080

positions:
  filename: /var/lib/promtail/positions.yaml

clients:
  - url: https://loki.internal/loki/api/v1/push
    basic_auth:
      username: fleet
      password_file: /etc/promtail/loki.pass

scrape_configs:
  - job_name: journal
    journal:
      max_age: 12h
      labels:
        job: systemd-journal
    relabel_configs:
      - source_labels: ['__journal__systemd_unit']
        target_label: unit
      - source_labels: ['__journal__hostname']
        target_label: host

  - job_name: nginx
    static_configs:
      - targets: [localhost]
        labels:
          job: nginx
          host: "{{ inventory_hostname }}"
          __path__: /var/log/nginx/access.log
    pipeline_stages:
      - json:
          expressions:
            method: method
            status: status
            path: request_uri
            duration: request_time
      - labels:
          method:
          status:
      - metrics:
          nginx_requests_total:
            type: Counter
            description: "HTTP requests from access log"
            source: status
            config:
              action: inc

Note that method and status become labels, but path does not. Paths are unbounded per user — leave them in the log line and search with |= "/users/" instead.

LogQL basics

A LogQL query is stream selector + filter chain, optionally wrapped in a metric function.

Stream selector

{job="nginx", env="prod"}

That already selects log lines. Prometheus-style matchers: =, !=, =~, !~.

Line filters — fast path

{job="nginx"} |= "500"         # contains
{job="nginx"} != "GET"         # does not contain
{job="nginx"} |~ "5\\d\\d"     # regex match
{job="nginx"} !~ "/health"     # regex not-match

Line filters are evaluated before parsing. Use them first; they are by far the cheapest part of a query.

Parsers — extract structured fields

{job="app"} | json                              # parse as JSON
{job="app"} | logfmt                            # parse as logfmt
{job="app"} | regexp `user=(?P<user>\\S+)`      # custom regex
{job="app"} | pattern `<_> <method> <path> <status>`

Label filters after parsing

{job="app"} | json | status >= 500 | duration > 500ms

Line format — rewrite the output

{job="nginx"} | json | line_format `{{.method}} {{.path}} -> {{.status}}`

Metrics from logs

Wrap any stream in a range aggregation to get a metric.

# Error rate per route
sum by (path) (
  rate({job="nginx"} | json | status =~ "5.." [5m])
)

# p99 latency from access logs
quantile_over_time(0.99,
  {job="nginx"} | json | unwrap duration [5m]
) by (path)

# Lines per minute per service
sum by (service) (count_over_time({env="prod"}[1m]))

These are on-the-fly computations. They work, but they're slow and re-run on every dashboard refresh. For alerting, emit a real Prometheus counter from the app, or use a Promtail metric stage to write it once. LogQL metrics are for exploration, not for the pager.

Recording rules

Loki's ruler can turn LogQL metric queries into Prometheus recording rules, pushed via remote_write:

# /etc/loki/rules/fake-tenant/nginx.yml
groups:
  - name: nginx
    interval: 1m
    rules:
      - record: nginx:5xx:rate5m
        expr: |
          sum by (host) (
            rate({job="nginx"} | json | status =~ "5.." [5m])
          )

Retention and compaction

Three knobs:

limits_config.retention_period — global default.
limits_config.retention_stream — per-stream overrides (e.g. keep audit logs 1y, everything else 30d).
compactor.retention_enabled: true — enables deletion.

limits_config:
  retention_period: 30d
  retention_stream:
    - selector: '{category="audit"}'
      priority: 1
      period: 365d
    - selector: '{env="dev"}'
      priority: 2
      period: 7d

Monitor:

time() - loki_compactor_apply_retention_last_successful_run_timestamp_seconds > 24*3600

Disk slowly filling despite retention set? Compactor isn't running. Check its logs — it's usually an object-storage permissions problem.

Useful query patterns

Everything for one request, by trace id

{env="prod"} |= "trace_id=8f3a2b1c"

Top 10 error messages in the last hour

topk(10,
  sum by (msg) (
    count_over_time({env="prod", level="error"}
      | json | line_format "{{.msg}}" [1h])
  )
)

Requests from one IP that got a 401

{job="nginx"} | json | remote_addr="203.0.113.42" | status="401"

Bytes shipped per service per hour

sum by (service) (
  bytes_over_time({env="prod"}[1h])
)

Tail live, filtered

In Grafana Explore, set the query and click "Live". LogQL tail endpoint streams matching lines via a websocket. Useful for watching a deploy.

Troubleshooting

Symptom	Likely cause	Fix
"too many outstanding requests"	Query fans out, ingester queue full	Shorten time range; add a line filter; raise `max_outstanding_per_tenant`
"parse error: too many label-value pairs"	Cardinality bomb; some label has > max distinct values	`logcli labels`; find the culprit; rewrite shipper pipeline to drop or move to line
Queries fast for recent, slow for old	Ingesters have fresh data in memory; old data reads S3 + index	Normal — use query-frontend with splitting + caching; add index-gateway at scale
Logs visible in Loki UI but not in Grafana	Data source uid mismatch, or CORS/proxy misconfigured	Always access Loki via Grafana proxy (`access: proxy`), never browser-direct
Disk filling despite retention 30d	Compactor not enabled, or not running, or missing S3 delete permission	`loki_compactor_*` metrics; S3 policy needs `s3:DeleteObject`
Intermittent "entry out of order"	Clock skew on a shipper host, or concurrent writers to the same stream	Sync clocks; check for two Promtails with the same labels on one host

See also rsyslog for the host-side log pipeline, and rsyslog forwarding if you prefer syslog → Loki over Promtail. Next in this track: OpenTelemetry Traces, then SLOs and On-Call.