Grafana Basics

Install Grafana, wire up Prometheus and Loki, treat dashboards as code, pick the right panel types, use variables properly, and configure unified alerting.

If you only remember six things

Data sources are provisioned from files, not clicked into the UI. Same for dashboards if you want them to survive.
A dashboard with more than 12 panels is two dashboards.
Variables exist to make one dashboard serve many hosts/environments — don't build per-host dashboards.
Unified alerting (Grafana 10+) replaces both legacy Grafana alerting and an external Alertmanager for small shops.
Pick units. Bytes vs bytes (IEC) vs bytes (SI) are different — wrong unit = wrong graph.
Library panels exist. Use them before you copy-paste that "CPU usage" panel for the twentieth time.

On this page

Install Grafana
Provisioning data sources
Dashboards as code
Panel types — when to use each
Variables
Unit display
Unified alerting
Folders and permissions
Library panels
Troubleshooting

Install Grafana

# Debian/Ubuntu
sudo apt install -y apt-transport-https software-properties-common
sudo mkdir -p /etc/apt/keyrings
wget -qO - https://apt.grafana.com/gpg.key | sudo tee /etc/apt/keyrings/grafana.asc > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.asc] https://apt.grafana.com stable main" | \
  sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update && sudo apt install -y grafana
sudo systemctl enable --now grafana-server

Listens on :3000, default login admin/admin, forced password change on first login. Put it behind nginx with TLS — see Certificate Basics and Nginx Reverse Proxy.

Key config bits

# /etc/grafana/grafana.ini
[server]
domain = grafana.internal
root_url = https://grafana.internal/

[security]
disable_initial_admin_creation = false
cookie_secure = true
cookie_samesite = strict

[auth]
disable_login_form = false

[auth.generic_oauth]
enabled = true
name = SSO
client_id = grafana
client_secret = $__file{/etc/grafana/oauth_secret}
auth_url = https://sso.internal/realms/main/protocol/openid-connect/auth
token_url = https://sso.internal/realms/main/protocol/openid-connect/token
allowed_domains = example.com

[unified_alerting]
enabled = true

Provisioning data sources

Clicking "Add data source" in the UI works. It does not survive a rebuild. Drop YAML into /etc/grafana/provisioning/datasources/ and it's rebuilt on every start.

# /etc/grafana/provisioning/datasources/default.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus.internal:9090
    isDefault: true
    jsonData:
      timeInterval: 15s
      httpMethod: POST
      manageAlerts: true
    uid: prom-main

  - name: Loki
    type: loki
    access: proxy
    url: http://loki.internal:3100
    jsonData:
      maxLines: 5000
      derivedFields:
        - name: TraceID
          matcherRegex: "trace_id=(\\w+)"
          url: "/explore?orgId=1&left=\\{\"datasource\":\"tempo\",\"queries\":[\\{\"query\":\"$${__value.raw}\"\\}]\\}"
    uid: loki-main

Use a stable uid on each data source. Dashboards reference data sources by uid; a fresh reinstall with a different uid breaks every dashboard at once.

Dashboards as code

You have two credible choices.

Option A — JSON exported from the UI, committed to git

Build in the UI, click "Share → Export → Save to file", commit the JSON. Provision it from disk:

# /etc/grafana/provisioning/dashboards/default.yml
apiVersion: 1
providers:
  - name: main
    folder: Infra
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    allowUiUpdates: false
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

With allowUiUpdates: false, the "Save" button is disabled — people can still edit and then have to export the JSON and commit it. This is the friction you want.

Option B — Generated from Grizzly/Grafonnet

Higher upfront cost, pays off as soon as you have > 10 dashboards that share components.

pip install grizzly-cli
grr pull   # fetch current dashboards into local JSON
# Edit .jsonnet / .libsonnet
grr apply  # push changes

Grafonnet lets you say "a RED dashboard for this service" and stamp out 30 identical dashboards with one function call. Highly recommended once one person owns > 15 dashboards.

Panel types — when to use each

Panel	Use for	Don't use for
Time series	Anything changing over time: rates, averages, saturation	A single current value
Stat	Single big number (current value, total, SLO %)	Multiple series — becomes illegible
Gauge	A value between a known min and max (CPU %, queue depth vs limit)	Unbounded counters
Bar gauge	Ranked lists ("top 10 hosts by CPU")	Time-varying — bars jump distractingly
Table	Per-instance current values you want to sort/filter	Time series (use time series)
Heatmap	Histograms (latency distribution) over time	Anything non-histogram — it lies
State timeline	Categorical state over time (deployment, up/down)	Continuous values
Logs	Loki queries alongside metrics	A Prometheus query (there's no log)

Legend hygiene

Default legend (__name__-ish strings) is noise. Override with the template:

{{instance}} — {{mountpoint}}

Rule of thumb: the legend is readable at the size a reader will actually see it. If you have to widen the panel to read the labels, shrink the labels.

Variables

Dashboard-level variables make one dashboard serve every host, environment, region. Never build "CPU — prod" and "CPU — stage" as two dashboards.

Common variable patterns

Name: env
Type: Query
Data source: Prometheus
Query: label_values(up, env)
Multi-value: off
Include All: off

Name: instance
Type: Query
Query: label_values(up{env="$env"}, instance)
Multi-value: on
Include All: on

Name: interval
Type: Interval
Values: 1m,5m,15m,1h,6h

Then in your panel query:

rate(node_cpu_seconds_total{env="$env",instance=~"$instance",mode!="idle"}[$interval])

Don't use $instance unquoted inside a regex without =~. The multi-value variable expands to a|b|c which is only valid inside a regex matcher.

Unit display

Pick the unit in the panel's "Standard options → Unit" field. "None" is never correct for a quantity.

Bytes (IEC) — 1 KiB = 1024 B. Use for RAM, filesystem sizes, buffer pools.
Bytes (SI) — 1 kB = 1000 B. Use for network throughput (matches ISP marketing and node_network_*_bytes).
Short — the "just a number, but abbreviate" unit. For counts.
Percent (0-100) vs Percent (0.0-1.0) — these are different. Prometheus ratios are 0-1. Convert explicitly: expr * 100 if you want 0-100.
Seconds — Prometheus histograms are in seconds. Let Grafana auto-scale to ms/µs.

Unified alerting

Grafana 10+ unified alerting replaces the legacy per-panel alerts and can also absorb your Prometheus rules. It has three moving pieces:

Alert rules — a query + condition + "for" duration, evaluated on a schedule.
Contact points — where notifications go (Slack, email, PagerDuty, webhook).
Notification policies — a tree that routes alert labels to contact points.

A paste-ready alert rule

# Provisioned via /etc/grafana/provisioning/alerting/rules.yml
apiVersion: 1
groups:
  - orgId: 1
    name: api-slo
    folder: SLO
    interval: 1m
    rules:
      - uid: api-5xx-burn
        title: "API error budget burn (fast)"
        condition: C
        data:
          - refId: A
            datasourceUid: prom-main
            model:
              expr: |
                sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
                  / sum(rate(http_requests_total{service="api"}[5m]))
              refId: A
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              expression: A
              conditions:
                - evaluator: { params: [0.02], type: gt }
        for: 5m
        annotations:
          runbook_url: https://wiki.internal/runbooks/api-5xx
        labels:
          severity: page
          team: platform

Notification policies

Keep the tree small. Typical pattern: match severity=page to pager, severity=ticket to email. Match team=foo into a team-specific route. Default route catches everything else so nothing gets silently dropped.

If you already run external Alertmanager and Prometheus rules, don't duplicate. Use Grafana unified alerting in "external Alertmanager" mode and let AM continue to route.

Folders and permissions

Group dashboards into folders by team that owns them, not by topic. A team can grant edit on its own folder.
Give every team a "viewer" role on the top-level "Shared" folder for cross-team dashboards (SLO overview, infra, etc.).
Never grant "Admin" team-wide. Admin means "can delete data sources" — that's a two-person job at best.

Library panels

"CPU usage" is the same panel in every dashboard in the company. Save it as a library panel, reuse it. One place to fix the query when you migrate to a new exporter version.

Build the panel once.
Three-dot menu → "Create library panel", give it a home folder.
Drop the library panel into any dashboard. It appears with a chain-link icon.
Edit the library panel once — every dashboard using it updates.

Troubleshooting

Symptom	Likely cause	Fix
"No data" on a panel, query works in Explore	Dashboard time range is "last 5 minutes" but data is older; or variable is empty	Widen time range; check `$var` in the query panel
Dashboard edits vanish after restart	It's provisioned; UI edits are not persisted	Edit the JSON file on disk, or export and commit
Units look off by a factor of 100	Percent 0.0-1.0 used on a 0-100 value or vice versa	Multiply/divide in the expression; don't fight the unit
Loki "too many outstanding requests"	Dashboard hitting Loki with wide time range and no line filter	Add a `\|=` line filter; shorten time range; lower `max_lines`
Alerts stuck in "Pending" forever	"For" duration > evaluation interval not the issue; rule never actually fires — check condition evaluation in the rule detail	Use the "View" state-history button on the rule
OAuth login loops	`root_url` mismatch with IdP redirect URI	Set `root_url` to the URL users actually type; restart Grafana

Next: ship logs into Grafana with Loki Logs.