Loki Logs
- Loki indexes labels, not content. Keep the label set small; make the line itself searchable with
|=. - Start with the single-binary deployment. Only split into microservices once you have a reason.
- Stream labels are the cardinality unit.
service,env,level— good.request_idas a label — death. - Pick one shipper and stop. Promtail is fine; Grafana Agent and Vector are also fine. Running all three is not.
- LogQL can compute metrics (
rate,count_over_time). Use it for ad-hoc SLIs, not for alerting — native metrics are cheaper. - Retention + compaction only work if they're both enabled. Check
loki_compactor_apply_retention_last_successful_run_timestamp_seconds.
Architecture
Loki has three deployment modes:
- Monolithic / single binary — everything in one process. Works up to a few TB/day on a single box. Start here.
- Simple scalable — read, write, and backend split into three deployments. Good sweet spot for 1–100 TB/day without microservices pain.
- Microservices — distributor, ingester, querier, query-frontend, compactor, ruler, index-gateway. Required at scale; expensive to run.
What flows where
Shipper → distributor (label validation, sharding) → ingester (chunks in RAM, flushed to object storage) → object storage (S3/GCS/Azure Blob) + index (boltdb-shipper or TSDB). Queries hit the querier, which fans out to ingesters (fresh data) and object storage (old data), merging results.
Storage layout — BoltDB-shipper vs TSDB
Two index formats exist. Use TSDB on a new install; BoltDB-shipper is the legacy option.
# loki.yaml — single binary, TSDB index, S3 for chunks
auth_enabled: false
common:
path_prefix: /var/lib/loki
replication_factor: 1
storage:
s3:
endpoint: s3.eu-west-1.amazonaws.com
bucketnames: logs-loki-prod
region: eu-west-1
s3forcepathstyle: false
schema_config:
configs:
- from: 2024-01-01
store: tsdb
object_store: s3
schema: v13
index:
prefix: loki_index_
period: 24h
limits_config:
retention_period: 30d
ingestion_rate_mb: 12
ingestion_burst_size_mb: 24
max_label_name_length: 1024
max_label_value_length: 2048
max_label_names_per_series: 15
compactor:
working_directory: /var/lib/loki/compactor
retention_enabled: true
delete_request_store: s3
ruler:
storage:
type: local
local:
directory: /etc/loki/rules
retention_enabled: true on the compactor is required. The retention_period in limits_config is only the intent; without the compactor actually doing the deletion, chunks sit in S3 forever.
Pick a shipper
| Shipper | Best at | Worst at |
|---|---|---|
| Promtail | Journald + file tailing, Loki-native, tiny binary | Deprecated as of 2024 — Grafana recommends migrating |
| Grafana Agent / Alloy | One agent for metrics + logs + traces, Loki-native | Larger memory footprint; steeper config |
| Vector | Transforms, multiple sinks, great observability of itself | Not Loki-specific — one of many outputs |
| Fluent Bit | Kubernetes-native, low footprint, widely deployed | Loki output historically lagged; check feature parity |
Promtail example — journald + nginx access logs
server:
http_listen_port: 9080
positions:
filename: /var/lib/promtail/positions.yaml
clients:
- url: https://loki.internal/loki/api/v1/push
basic_auth:
username: fleet
password_file: /etc/promtail/loki.pass
scrape_configs:
- job_name: journal
journal:
max_age: 12h
labels:
job: systemd-journal
relabel_configs:
- source_labels: ['__journal__systemd_unit']
target_label: unit
- source_labels: ['__journal__hostname']
target_label: host
- job_name: nginx
static_configs:
- targets: [localhost]
labels:
job: nginx
host: "{{ inventory_hostname }}"
__path__: /var/log/nginx/access.log
pipeline_stages:
- json:
expressions:
method: method
status: status
path: request_uri
duration: request_time
- labels:
method:
status:
- metrics:
nginx_requests_total:
type: Counter
description: "HTTP requests from access log"
source: status
config:
action: inc
method and status become labels, but path does not. Paths are unbounded per user — leave them in the log line and search with |= "/users/" instead.
LogQL basics
A LogQL query is stream selector + filter chain, optionally wrapped in a metric function.
Stream selector
{job="nginx", env="prod"}
That already selects log lines. Prometheus-style matchers: =, !=, =~, !~.
Line filters — fast path
{job="nginx"} |= "500" # contains
{job="nginx"} != "GET" # does not contain
{job="nginx"} |~ "5\\d\\d" # regex match
{job="nginx"} !~ "/health" # regex not-match
Line filters are evaluated before parsing. Use them first; they are by far the cheapest part of a query.
Parsers — extract structured fields
{job="app"} | json # parse as JSON
{job="app"} | logfmt # parse as logfmt
{job="app"} | regexp `user=(?P<user>\\S+)` # custom regex
{job="app"} | pattern `<_> <method> <path> <status>`
Label filters after parsing
{job="app"} | json | status >= 500 | duration > 500ms
Line format — rewrite the output
{job="nginx"} | json | line_format `{{.method}} {{.path}} -> {{.status}}`
Metrics from logs
Wrap any stream in a range aggregation to get a metric.
# Error rate per route
sum by (path) (
rate({job="nginx"} | json | status =~ "5.." [5m])
)
# p99 latency from access logs
quantile_over_time(0.99,
{job="nginx"} | json | unwrap duration [5m]
) by (path)
# Lines per minute per service
sum by (service) (count_over_time({env="prod"}[1m]))
Recording rules
Loki's ruler can turn LogQL metric queries into Prometheus recording rules, pushed via remote_write:
# /etc/loki/rules/fake-tenant/nginx.yml
groups:
- name: nginx
interval: 1m
rules:
- record: nginx:5xx:rate5m
expr: |
sum by (host) (
rate({job="nginx"} | json | status =~ "5.." [5m])
)
Retention and compaction
Three knobs:
limits_config.retention_period— global default.limits_config.retention_stream— per-stream overrides (e.g. keepauditlogs 1y, everything else 30d).compactor.retention_enabled: true— enables deletion.
limits_config:
retention_period: 30d
retention_stream:
- selector: '{category="audit"}'
priority: 1
period: 365d
- selector: '{env="dev"}'
priority: 2
period: 7d
Monitor:
time() - loki_compactor_apply_retention_last_successful_run_timestamp_seconds > 24*3600
Disk slowly filling despite retention set? Compactor isn't running. Check its logs — it's usually an object-storage permissions problem.
Useful query patterns
Everything for one request, by trace id
{env="prod"} |= "trace_id=8f3a2b1c"
Top 10 error messages in the last hour
topk(10,
sum by (msg) (
count_over_time({env="prod", level="error"}
| json | line_format "{{.msg}}" [1h])
)
)
Requests from one IP that got a 401
{job="nginx"} | json | remote_addr="203.0.113.42" | status="401"
Bytes shipped per service per hour
sum by (service) (
bytes_over_time({env="prod"}[1h])
)
Tail live, filtered
In Grafana Explore, set the query and click "Live". LogQL tail endpoint streams matching lines via a websocket. Useful for watching a deploy.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| "too many outstanding requests" | Query fans out, ingester queue full | Shorten time range; add a line filter; raise max_outstanding_per_tenant |
| "parse error: too many label-value pairs" | Cardinality bomb; some label has > max distinct values | logcli labels; find the culprit; rewrite shipper pipeline to drop or move to line |
| Queries fast for recent, slow for old | Ingesters have fresh data in memory; old data reads S3 + index | Normal — use query-frontend with splitting + caching; add index-gateway at scale |
| Logs visible in Loki UI but not in Grafana | Data source uid mismatch, or CORS/proxy misconfigured | Always access Loki via Grafana proxy (access: proxy), never browser-direct |
| Disk filling despite retention 30d | Compactor not enabled, or not running, or missing S3 delete permission | loki_compactor_* metrics; S3 policy needs s3:DeleteObject |
| Intermittent "entry out of order" | Clock skew on a shipper host, or concurrent writers to the same stream | Sync clocks; check for two Promtails with the same labels on one host |
See also rsyslog for the host-side log pipeline, and rsyslog forwarding if you prefer syslog → Loki over Promtail. Next in this track: OpenTelemetry Traces, then SLOs and On-Call.