Assignment 2: Build a dashboard and a working alert

Goal: Get hands-on with Grafana on its own: connect to existing lab data, build a focused dashboard, and create an alert that you can actually make fire and recover. This sharpens the skills the capstone assumes.

Where: The lab Grafana at 10.100.100.4 (reached via the Jumpbox bastion), querying the Loki log data source at 10.100.100.5 and the metrics data source (InfluxDB). No new app required — use telemetry already flowing from lab hosts.

Tasks

Log in and look around. Open Grafana at 10.100.100.4, go to Connections → Data sources, and note which data sources exist (Loki, InfluxDB, and any others). Confirm the Loki URL points at 10.100.100.5.
Explore first. In Explore, pick Loki and run {job=~".+"} to see what hosts are shipping logs. Pick one host/job and narrow with a line filter, e.g. {host="Jumpbox"} |= "error" (adjust to a real label that exists).
Build a dashboard. Create a new dashboard named after you (e.g. <yourname> — lab health). Add at least three panels:
- A logs panel from Loki for one host/job.
- A time-series metric panel (e.g. CPU or memory of a lab host from InfluxDB).
- A stat or count panel using a LogQL metric query such as sum(count_over_time({job=~".+"} |= "error" [5m])).
Add a dashboard variable. Add a $host (or $job) template variable so the panels can switch targets from a dropdown instead of hard-coding one host.
Create an alert rule. In Alerting → Alert rules, build a rule on a query of your choice (e.g. error-line count above a small threshold, or a host metric over a limit). Set an evaluation interval, a for duration of a few minutes, a clear annotation message, and route it to a contact point.
Make it fire, then recover. Trigger the condition (for a log-based alert, generate some matching log lines on a host you control; for a metric, load it briefly). Watch the rule go Normal → Pending → Firing, confirm the notification, then let it return to Normal.

Deliverable

A short markdown note with: a screenshot of your dashboard (showing the $host/$job dropdown), the LogQL/metric queries used in each panel, the alert rule definition (query, threshold, for, contact point), and a screenshot or log showing the alert transition from Pending → Firing → Resolved.

Acceptance criteria — you're done when:

You can name the lab Grafana data sources and confirm Loki points at 10.100.100.5.
A dashboard named after you exists with at least three panels (a Loki logs panel, a metric time-series panel, and a LogQL count/stat panel).
The dashboard has a working $host or $job template variable that changes the panels.
An alert rule exists with a query, a threshold/condition, an evaluation interval, a for duration, and an annotation message.
The alert is routed to a contact point.
You captured the alert transitioning Normal → Pending → Firing and back to Normal/Resolved.
The deliverable note includes every query used and the screenshots above.

Hints

Don't hardcode a host that may not exist — run {job=~".+"} in Explore first to discover real labels, then build from what's actually there.
A LogQL count query (count_over_time / rate) is the easiest thing to alert on without needing app metrics.
Keep the for duration short (e.g. 1m) while testing so you don't wait long, then raise it once it works.
"Alert on symptoms, not causes" — but for this exercise any reproducible condition is fine; the goal is to learn the mechanics.
If the alert never leaves Pending, your condition probably isn't actually true long enough — re-check the query in Explore.

blocked for >~30 min after re-reading the lessons? Bring what you've tried to your mentor.