Assignment 2: Build a dashboard and a working alert
Goal: Get hands-on with Grafana on its own: connect to existing lab data, build a focused dashboard, and create an alert that you can actually make fire and recover. This sharpens the skills the capstone assumes.
Where: The lab Grafana at 10.100.100.4 (reached via the Jumpbox bastion), querying the Loki log data source at 10.100.100.5 and the metrics data source (InfluxDB). No new app required — use telemetry already flowing from lab hosts.
Tasks
- Log in and look around. Open Grafana at 10.100.100.4, go to Connections → Data sources, and note which data sources exist (Loki, InfluxDB, and any others). Confirm the Loki URL points at 10.100.100.5.
- Explore first. In Explore, pick Loki and run
{job=~".+"}to see what hosts are shipping logs. Pick one host/job and narrow with a line filter, e.g.{host="Jumpbox"} |= "error"(adjust to a real label that exists). - Build a dashboard. Create a new dashboard named after you (e.g.
<yourname> — lab health). Add at least three panels:- A logs panel from Loki for one host/job.
- A time-series metric panel (e.g. CPU or memory of a lab host from InfluxDB).
- A stat or count panel using a LogQL metric query such as
sum(count_over_time({job=~".+"} |= "error" [5m])).
- Add a dashboard variable. Add a
$host(or$job) template variable so the panels can switch targets from a dropdown instead of hard-coding one host. - Create an alert rule. In Alerting → Alert rules, build a rule on a query of your choice (e.g. error-line count above a small threshold, or a host metric over a limit). Set an evaluation interval, a
forduration of a few minutes, a clear annotation message, and route it to a contact point. - Make it fire, then recover. Trigger the condition (for a log-based alert, generate some matching log lines on a host you control; for a metric, load it briefly). Watch the rule go Normal → Pending → Firing, confirm the notification, then let it return to Normal.
Deliverable
A short markdown note with: a screenshot of your dashboard (showing the $host/$job dropdown), the LogQL/metric queries used in each panel, the alert rule definition (query, threshold, for, contact point), and a screenshot or log showing the alert transition from Pending → Firing → Resolved.
Acceptance criteria — you're done when:
- You can name the lab Grafana data sources and confirm Loki points at 10.100.100.5.
- A dashboard named after you exists with at least three panels (a Loki logs panel, a metric time-series panel, and a LogQL count/stat panel).
- The dashboard has a working
$hostor$jobtemplate variable that changes the panels. - An alert rule exists with a query, a threshold/condition, an evaluation interval, a
forduration, and an annotation message. - The alert is routed to a contact point.
- You captured the alert transitioning Normal → Pending → Firing and back to Normal/Resolved.
- The deliverable note includes every query used and the screenshots above.
Hints
- Don't hardcode a host that may not exist — run
{job=~".+"}in Explore first to discover real labels, then build from what's actually there. - A LogQL count query (
count_over_time/rate) is the easiest thing to alert on without needing app metrics. - Keep the
forduration short (e.g.1m) while testing so you don't wait long, then raise it once it works. - "Alert on symptoms, not causes" — but for this exercise any reproducible condition is fine; the goal is to learn the mechanics.
- If the alert never leaves Pending, your condition probably isn't actually true long enough — re-check the query in Explore.
blocked for >~30 min after re-reading the lessons? Bring what you've tried to your mentor.
No comments to display
No comments to display