Grafana Basics
- Data sources are provisioned from files, not clicked into the UI. Same for dashboards if you want them to survive.
- A dashboard with more than 12 panels is two dashboards.
- Variables exist to make one dashboard serve many hosts/environments — don't build per-host dashboards.
- Unified alerting (Grafana 10+) replaces both legacy Grafana alerting and an external Alertmanager for small shops.
- Pick units.
Bytesvsbytes (IEC)vsbytes (SI)are different — wrong unit = wrong graph. - Library panels exist. Use them before you copy-paste that "CPU usage" panel for the twentieth time.
Install Grafana
# Debian/Ubuntu
sudo apt install -y apt-transport-https software-properties-common
sudo mkdir -p /etc/apt/keyrings
wget -qO - https://apt.grafana.com/gpg.key | sudo tee /etc/apt/keyrings/grafana.asc > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.asc] https://apt.grafana.com stable main" | \
sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update && sudo apt install -y grafana
sudo systemctl enable --now grafana-server
Listens on :3000, default login admin/admin, forced password change on first login. Put it behind nginx with TLS — see Certificate Basics and Nginx Reverse Proxy.
Key config bits
# /etc/grafana/grafana.ini
[server]
domain = grafana.internal
root_url = https://grafana.internal/
[security]
disable_initial_admin_creation = false
cookie_secure = true
cookie_samesite = strict
[auth]
disable_login_form = false
[auth.generic_oauth]
enabled = true
name = SSO
client_id = grafana
client_secret = $__file{/etc/grafana/oauth_secret}
auth_url = https://sso.internal/realms/main/protocol/openid-connect/auth
token_url = https://sso.internal/realms/main/protocol/openid-connect/token
allowed_domains = example.com
[unified_alerting]
enabled = true
Provisioning data sources
Clicking "Add data source" in the UI works. It does not survive a rebuild. Drop YAML into /etc/grafana/provisioning/datasources/ and it's rebuilt on every start.
# /etc/grafana/provisioning/datasources/default.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus.internal:9090
isDefault: true
jsonData:
timeInterval: 15s
httpMethod: POST
manageAlerts: true
uid: prom-main
- name: Loki
type: loki
access: proxy
url: http://loki.internal:3100
jsonData:
maxLines: 5000
derivedFields:
- name: TraceID
matcherRegex: "trace_id=(\\w+)"
url: "/explore?orgId=1&left=\\{\"datasource\":\"tempo\",\"queries\":[\\{\"query\":\"$${__value.raw}\"\\}]\\}"
uid: loki-main
uid on each data source. Dashboards reference data sources by uid; a fresh reinstall with a different uid breaks every dashboard at once.
Dashboards as code
You have two credible choices.
Option A — JSON exported from the UI, committed to git
Build in the UI, click "Share → Export → Save to file", commit the JSON. Provision it from disk:
# /etc/grafana/provisioning/dashboards/default.yml
apiVersion: 1
providers:
- name: main
folder: Infra
type: file
disableDeletion: false
updateIntervalSeconds: 30
allowUiUpdates: false
options:
path: /var/lib/grafana/dashboards
foldersFromFilesStructure: true
With allowUiUpdates: false, the "Save" button is disabled — people can still edit and then have to export the JSON and commit it. This is the friction you want.
Option B — Generated from Grizzly/Grafonnet
Higher upfront cost, pays off as soon as you have > 10 dashboards that share components.
pip install grizzly-cli
grr pull # fetch current dashboards into local JSON
# Edit .jsonnet / .libsonnet
grr apply # push changes
Grafonnet lets you say "a RED dashboard for this service" and stamp out 30 identical dashboards with one function call. Highly recommended once one person owns > 15 dashboards.
Panel types — when to use each
| Panel | Use for | Don't use for |
|---|---|---|
| Time series | Anything changing over time: rates, averages, saturation | A single current value |
| Stat | Single big number (current value, total, SLO %) | Multiple series — becomes illegible |
| Gauge | A value between a known min and max (CPU %, queue depth vs limit) | Unbounded counters |
| Bar gauge | Ranked lists ("top 10 hosts by CPU") | Time-varying — bars jump distractingly |
| Table | Per-instance current values you want to sort/filter | Time series (use time series) |
| Heatmap | Histograms (latency distribution) over time | Anything non-histogram — it lies |
| State timeline | Categorical state over time (deployment, up/down) | Continuous values |
| Logs | Loki queries alongside metrics | A Prometheus query (there's no log) |
Legend hygiene
Default legend (__name__-ish strings) is noise. Override with the template:
{{instance}} — {{mountpoint}}
Rule of thumb: the legend is readable at the size a reader will actually see it. If you have to widen the panel to read the labels, shrink the labels.
Variables
Dashboard-level variables make one dashboard serve every host, environment, region. Never build "CPU — prod" and "CPU — stage" as two dashboards.
Common variable patterns
Name: env
Type: Query
Data source: Prometheus
Query: label_values(up, env)
Multi-value: off
Include All: off
Name: instance
Type: Query
Query: label_values(up{env="$env"}, instance)
Multi-value: on
Include All: on
Name: interval
Type: Interval
Values: 1m,5m,15m,1h,6h
Then in your panel query:
rate(node_cpu_seconds_total{env="$env",instance=~"$instance",mode!="idle"}[$interval])
$instance unquoted inside a regex without =~. The multi-value variable expands to a|b|c which is only valid inside a regex matcher.
Unit display
Pick the unit in the panel's "Standard options → Unit" field. "None" is never correct for a quantity.
- Bytes (IEC) — 1 KiB = 1024 B. Use for RAM, filesystem sizes, buffer pools.
- Bytes (SI) — 1 kB = 1000 B. Use for network throughput (matches ISP marketing and
node_network_*_bytes). - Short — the "just a number, but abbreviate" unit. For counts.
- Percent (0-100) vs Percent (0.0-1.0) — these are different. Prometheus ratios are 0-1. Convert explicitly:
expr * 100if you want 0-100. - Seconds — Prometheus histograms are in seconds. Let Grafana auto-scale to ms/µs.
Unified alerting
Grafana 10+ unified alerting replaces the legacy per-panel alerts and can also absorb your Prometheus rules. It has three moving pieces:
- Alert rules — a query + condition + "for" duration, evaluated on a schedule.
- Contact points — where notifications go (Slack, email, PagerDuty, webhook).
- Notification policies — a tree that routes alert labels to contact points.
A paste-ready alert rule
# Provisioned via /etc/grafana/provisioning/alerting/rules.yml
apiVersion: 1
groups:
- orgId: 1
name: api-slo
folder: SLO
interval: 1m
rules:
- uid: api-5xx-burn
title: "API error budget burn (fast)"
condition: C
data:
- refId: A
datasourceUid: prom-main
model:
expr: |
sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
/ sum(rate(http_requests_total{service="api"}[5m]))
refId: A
- refId: C
datasourceUid: __expr__
model:
type: threshold
expression: A
conditions:
- evaluator: { params: [0.02], type: gt }
for: 5m
annotations:
runbook_url: https://wiki.internal/runbooks/api-5xx
labels:
severity: page
team: platform
Notification policies
Keep the tree small. Typical pattern: match severity=page to pager, severity=ticket to email. Match team=foo into a team-specific route. Default route catches everything else so nothing gets silently dropped.
Folders and permissions
- Group dashboards into folders by team that owns them, not by topic. A team can grant edit on its own folder.
- Give every team a "viewer" role on the top-level "Shared" folder for cross-team dashboards (SLO overview, infra, etc.).
- Never grant "Admin" team-wide. Admin means "can delete data sources" — that's a two-person job at best.
Library panels
"CPU usage" is the same panel in every dashboard in the company. Save it as a library panel, reuse it. One place to fix the query when you migrate to a new exporter version.
- Build the panel once.
- Three-dot menu → "Create library panel", give it a home folder.
- Drop the library panel into any dashboard. It appears with a chain-link icon.
- Edit the library panel once — every dashboard using it updates.
Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| "No data" on a panel, query works in Explore | Dashboard time range is "last 5 minutes" but data is older; or variable is empty | Widen time range; check $var in the query panel |
| Dashboard edits vanish after restart | It's provisioned; UI edits are not persisted | Edit the JSON file on disk, or export and commit |
| Units look off by a factor of 100 | Percent 0.0-1.0 used on a 0-100 value or vice versa | Multiply/divide in the expression; don't fight the unit |
| Loki "too many outstanding requests" | Dashboard hitting Loki with wide time range and no line filter | Add a |= line filter; shorten time range; lower max_lines |
| Alerts stuck in "Pending" forever | "For" duration > evaluation interval not the issue; rule never actually fires — check condition evaluation in the rule detail | Use the "View" state-history button on the rule |
| OAuth login loops | root_url mismatch with IdP redirect URI | Set root_url to the URL users actually type; restart Grafana |
Next: ship logs into Grafana with Loki Logs.