Skip to main content

Assignment 2: Build a dashboard and a working alert

Goal: Get hands-on with Grafana on its own: connect to existing lab data, build a focused dashboard, and create an alert that you can actually make fire and recover. This sharpens the skills the capstone assumes.

Where: The lab Grafana at 10.100.100.4 (reached via the Jumpbox bastion), querying the Loki log data source at 10.100.100.5 and the metrics data source (InfluxDB). No new app required — use telemetry already flowing from lab hosts.

Tasks

  1. Log in and look around. Open Grafana at 10.100.100.4, go to Connections → Data sources, and note which data sources exist (Loki, InfluxDB, and any others). Confirm the Loki URL points at 10.100.100.5.
  2. Explore first. In Explore, pick Loki and run {job=~".+"} to see what hosts are shipping logs. Pick one host/job and narrow with a line filter, e.g. {host="Jumpbox"} |= "error" (adjust to a real label that exists).
  3. Build a dashboard. Create a new dashboard named after you (e.g. <yourname> — lab health). Add at least three panels:
    • A logs panel from Loki for one host/job.
    • A time-series metric panel (e.g. CPU or memory of a lab host from InfluxDB).
    • A stat or count panel using a LogQL metric query such as sum(count_over_time({job=~".+"} |= "error" [5m])).
  4. Add a dashboard variable. Add a $host (or $job) template variable so the panels can switch targets from a dropdown instead of hard-coding one host.
  5. Create an alert rule. In Alerting → Alert rules, build a rule on a query of your choice (e.g. error-line count above a small threshold, or a host metric over a limit). Set an evaluation interval, a for duration of a few minutes, a clear annotation message, and route it to a contact point.
  6. Make it fire, then recover. Trigger the condition (for a log-based alert, generate some matching log lines on a host you control; for a metric, load it briefly). Watch the rule go Normal → Pending → Firing, confirm the notification, then let it return to Normal.

Deliverable

A short markdown note with: a screenshot of your dashboard (showing the $host/$job dropdown), the LogQL/metric queries used in each panel, the alert rule definition (query, threshold, for, contact point), and a screenshot or log showing the alert transition from Pending → Firing → Resolved.

Acceptance criteria — you're done when:

  • You can name the lab Grafana data sources and confirm Loki points at 10.100.100.5.
  • A dashboard named after you exists with at least three panels (a Loki logs panel, a metric time-series panel, and a LogQL count/stat panel).
  • The dashboard has a working $host or $job template variable that changes the panels.
  • An alert rule exists with a query, a threshold/condition, an evaluation interval, a for duration, and an annotation message.
  • The alert is routed to a contact point.
  • You captured the alert transitioning Normal → Pending → Firing and back to Normal/Resolved.
  • The deliverable note includes every query used and the screenshots above.

Hints

  • Don't hardcode a host that may not exist — run {job=~".+"} in Explore first to discover real labels, then build from what's actually there.
  • A LogQL count query (count_over_time / rate) is the easiest thing to alert on without needing app metrics.
  • Keep the for duration short (e.g. 1m) while testing so you don't wait long, then raise it once it works.
  • "Alert on symptoms, not causes" — but for this exercise any reproducible condition is fine; the goal is to learn the mechanics.
  • If the alert never leaves Pending, your condition probably isn't actually true long enough — re-check the query in Explore.

blocked for >~30 min after re-reading the lessons? Bring what you've tried to your mentor.