Grafana Basics

Install Grafana, wire up Prometheus and Loki, treat dashboards as code, pick the right panel types, use variables properly, and configure unified alerting.

If you only remember six things
  • Data sources are provisioned from files, not clicked into the UI. Same for dashboards if you want them to survive.
  • A dashboard with more than 12 panels is two dashboards.
  • Variables exist to make one dashboard serve many hosts/environments — don't build per-host dashboards.
  • Unified alerting (Grafana 10+) replaces both legacy Grafana alerting and an external Alertmanager for small shops.
  • Pick units. Bytes vs bytes (IEC) vs bytes (SI) are different — wrong unit = wrong graph.
  • Library panels exist. Use them before you copy-paste that "CPU usage" panel for the twentieth time.

Install Grafana

# Debian/Ubuntu
sudo apt install -y apt-transport-https software-properties-common
sudo mkdir -p /etc/apt/keyrings
wget -qO - https://apt.grafana.com/gpg.key | sudo tee /etc/apt/keyrings/grafana.asc > /dev/null
echo "deb [signed-by=/etc/apt/keyrings/grafana.asc] https://apt.grafana.com stable main" | \
  sudo tee /etc/apt/sources.list.d/grafana.list
sudo apt update && sudo apt install -y grafana
sudo systemctl enable --now grafana-server

Listens on :3000, default login admin/admin, forced password change on first login. Put it behind nginx with TLS — see Certificate Basics and Nginx Reverse Proxy.

Key config bits

# /etc/grafana/grafana.ini
[server]
domain = grafana.internal
root_url = https://grafana.internal/

[security]
disable_initial_admin_creation = false
cookie_secure = true
cookie_samesite = strict

[auth]
disable_login_form = false

[auth.generic_oauth]
enabled = true
name = SSO
client_id = grafana
client_secret = $__file{/etc/grafana/oauth_secret}
auth_url = https://sso.internal/realms/main/protocol/openid-connect/auth
token_url = https://sso.internal/realms/main/protocol/openid-connect/token
allowed_domains = example.com

[unified_alerting]
enabled = true

Provisioning data sources

Clicking "Add data source" in the UI works. It does not survive a rebuild. Drop YAML into /etc/grafana/provisioning/datasources/ and it's rebuilt on every start.

# /etc/grafana/provisioning/datasources/default.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus.internal:9090
    isDefault: true
    jsonData:
      timeInterval: 15s
      httpMethod: POST
      manageAlerts: true
    uid: prom-main

  - name: Loki
    type: loki
    access: proxy
    url: http://loki.internal:3100
    jsonData:
      maxLines: 5000
      derivedFields:
        - name: TraceID
          matcherRegex: "trace_id=(\\w+)"
          url: "/explore?orgId=1&left=\\{\"datasource\":\"tempo\",\"queries\":[\\{\"query\":\"$${__value.raw}\"\\}]\\}"
    uid: loki-main
Use a stable uid on each data source. Dashboards reference data sources by uid; a fresh reinstall with a different uid breaks every dashboard at once.

Dashboards as code

You have two credible choices.

Option A — JSON exported from the UI, committed to git

Build in the UI, click "Share → Export → Save to file", commit the JSON. Provision it from disk:

# /etc/grafana/provisioning/dashboards/default.yml
apiVersion: 1
providers:
  - name: main
    folder: Infra
    type: file
    disableDeletion: false
    updateIntervalSeconds: 30
    allowUiUpdates: false
    options:
      path: /var/lib/grafana/dashboards
      foldersFromFilesStructure: true

With allowUiUpdates: false, the "Save" button is disabled — people can still edit and then have to export the JSON and commit it. This is the friction you want.

Option B — Generated from Grizzly/Grafonnet

Higher upfront cost, pays off as soon as you have > 10 dashboards that share components.

pip install grizzly-cli
grr pull   # fetch current dashboards into local JSON
# Edit .jsonnet / .libsonnet
grr apply  # push changes

Grafonnet lets you say "a RED dashboard for this service" and stamp out 30 identical dashboards with one function call. Highly recommended once one person owns > 15 dashboards.

Panel types — when to use each

PanelUse forDon't use for
Time seriesAnything changing over time: rates, averages, saturationA single current value
StatSingle big number (current value, total, SLO %)Multiple series — becomes illegible
GaugeA value between a known min and max (CPU %, queue depth vs limit)Unbounded counters
Bar gaugeRanked lists ("top 10 hosts by CPU")Time-varying — bars jump distractingly
TablePer-instance current values you want to sort/filterTime series (use time series)
HeatmapHistograms (latency distribution) over timeAnything non-histogram — it lies
State timelineCategorical state over time (deployment, up/down)Continuous values
LogsLoki queries alongside metricsA Prometheus query (there's no log)

Legend hygiene

Default legend (__name__-ish strings) is noise. Override with the template:

{{instance}} — {{mountpoint}}

Rule of thumb: the legend is readable at the size a reader will actually see it. If you have to widen the panel to read the labels, shrink the labels.

Variables

Dashboard-level variables make one dashboard serve every host, environment, region. Never build "CPU — prod" and "CPU — stage" as two dashboards.

Common variable patterns

Name: env
Type: Query
Data source: Prometheus
Query: label_values(up, env)
Multi-value: off
Include All: off

Name: instance
Type: Query
Query: label_values(up{env="$env"}, instance)
Multi-value: on
Include All: on

Name: interval
Type: Interval
Values: 1m,5m,15m,1h,6h

Then in your panel query:

rate(node_cpu_seconds_total{env="$env",instance=~"$instance",mode!="idle"}[$interval])
Don't use $instance unquoted inside a regex without =~. The multi-value variable expands to a|b|c which is only valid inside a regex matcher.

Unit display

Pick the unit in the panel's "Standard options → Unit" field. "None" is never correct for a quantity.

Unified alerting

Grafana 10+ unified alerting replaces the legacy per-panel alerts and can also absorb your Prometheus rules. It has three moving pieces:

A paste-ready alert rule

# Provisioned via /etc/grafana/provisioning/alerting/rules.yml
apiVersion: 1
groups:
  - orgId: 1
    name: api-slo
    folder: SLO
    interval: 1m
    rules:
      - uid: api-5xx-burn
        title: "API error budget burn (fast)"
        condition: C
        data:
          - refId: A
            datasourceUid: prom-main
            model:
              expr: |
                sum(rate(http_requests_total{service="api",status=~"5.."}[5m]))
                  / sum(rate(http_requests_total{service="api"}[5m]))
              refId: A
          - refId: C
            datasourceUid: __expr__
            model:
              type: threshold
              expression: A
              conditions:
                - evaluator: { params: [0.02], type: gt }
        for: 5m
        annotations:
          runbook_url: https://wiki.internal/runbooks/api-5xx
        labels:
          severity: page
          team: platform

Notification policies

Keep the tree small. Typical pattern: match severity=page to pager, severity=ticket to email. Match team=foo into a team-specific route. Default route catches everything else so nothing gets silently dropped.

If you already run external Alertmanager and Prometheus rules, don't duplicate. Use Grafana unified alerting in "external Alertmanager" mode and let AM continue to route.

Folders and permissions

Library panels

"CPU usage" is the same panel in every dashboard in the company. Save it as a library panel, reuse it. One place to fix the query when you migrate to a new exporter version.

  1. Build the panel once.
  2. Three-dot menu → "Create library panel", give it a home folder.
  3. Drop the library panel into any dashboard. It appears with a chain-link icon.
  4. Edit the library panel once — every dashboard using it updates.

Troubleshooting

SymptomLikely causeFix
"No data" on a panel, query works in ExploreDashboard time range is "last 5 minutes" but data is older; or variable is emptyWiden time range; check $var in the query panel
Dashboard edits vanish after restartIt's provisioned; UI edits are not persistedEdit the JSON file on disk, or export and commit
Units look off by a factor of 100Percent 0.0-1.0 used on a 0-100 value or vice versaMultiply/divide in the expression; don't fight the unit
Loki "too many outstanding requests"Dashboard hitting Loki with wide time range and no line filterAdd a |= line filter; shorten time range; lower max_lines
Alerts stuck in "Pending" forever"For" duration > evaluation interval not the issue; rule never actually fires — check condition evaluation in the rule detailUse the "View" state-history button on the rule
OAuth login loopsroot_url mismatch with IdP redirect URISet root_url to the URL users actually type; restart Grafana

Next: ship logs into Grafana with Loki Logs.