Lesson: Logs, Metrics & Traces

What you'll learn

Why observability matters when systems break at 3 a.m.
The three pillars — logs, metrics, and traces — and what each one is good at.
The difference between monitoring (watching known things) and observability (asking new questions).
What MTTR means and why faster answers save real money and stress.
The mindset shift from guessing to knowing.

By the end you'll be able to explain, in plain language, what observability is and which pillar to reach for when something goes wrong.

The lesson

1. The 3 a.m. problem

Imagine you deploy an app to the lab Kubernetes cluster. Hours later a user says "the site is slow." You log in. Where do you even look? Is the database overloaded? Is one pod crashing and restarting? Is the network dropping packets? Without good signals coming out of your system, you are guessing — clicking around, restarting things, hoping. That is stressful, slow, and risky.

Observability is the practice of instrumenting your systems so they emit enough signal that you can answer questions about their internal state from the outside, without shipping new code or attaching a debugger. The term comes from control theory: a system is "observable" if you can infer what's happening inside from its outputs.

The opposite of observability is the blind reboot. The goal is to move from guessing ("maybe it's the database?") to knowing ("the checkout pod's p99 latency jumped to 4s at 02:51, and its logs show connection timeouts to Postgres").

2. The three pillars

Observability rests on three kinds of telemetry (telemetry = data a system emits about itself). Each answers a different question.

            THE THREE PILLARS OF OBSERVABILITY

  LOGS                 METRICS                TRACES
  "what happened"      "how much / how many"  "where did time go"
  ----------------     ------------------     ------------------
  discrete events      numbers over time      one request's path
  with timestamps      (counters, gauges)     across services

  "ERROR: db timeout   cpu = 87%              order-api 12ms
   at 02:51:03"        requests = 1240/s   ->   payment 3800ms  <-- slow!
                       errors = 5%               email    8ms
       |                    |                       |
       v                    v                       v
     Loki                 Mimir                  Tempo
   (10.100.100.5)       (metrics)              (traces)
                \           |           /
                 \          |          /
                  v         v         v
                     GRAFANA (10.100.100.4)
                   one place to see it all

Logs are timestamped records of discrete events: "user logged in", "payment failed", "config reloaded". They are rich and human-readable, great for the detail of a single moment. The lab ships every host's logs to a central Loki server at 10.100.100.5 (365-day retention).

Metrics are numeric measurements sampled over time: CPU usage, request rate, error count, queue depth. They are cheap to store and perfect for spotting trends and thresholds ("errors above 5% for 5 minutes"). The lab stores metrics in InfluxDB today; the wider stack also uses Mimir for large-scale metrics.

Traces follow a single request as it hops between services. A trace is made of spans, where each span is one unit of work (one service call) with a start time and duration. Traces answer "where did the time go in this slow request?" The backend for traces is Tempo.

3. Logs vs metrics vs traces — when to use which

A simple rule of thumb:

Metrics tell you THAT something is wrong. Your alert fires: error rate is up.
Traces tell you WHERE it is wrong. The trace shows the payment service span took 3.8 seconds.
Logs tell you WHY it is wrong. The payment pod's log says connection refused to postgres:5432.

You usually walk down the pillars in that order: alert (metric) → trace → log. This is sometimes called "the observability workflow."

4. Monitoring vs observability

These words get mixed up. The difference is about known vs unknown questions.

Monitoring watches a fixed set of things you already decided to care about: "is CPU above 90%?", "is the disk full?", "is the service responding to health checks?". You build a dashboard and alerts ahead of time. Monitoring is excellent for known failure modes.

Observability lets you ask new questions you didn't predict, after the fact, by exploring the raw telemetry. "Show me only the requests from mobile clients in Germany that hit the v2 API and errored." You never built a dashboard for that — but because the data is there and is high-cardinality (cardinality = the number of distinct label values, e.g. many user IDs or regions), you can slice it on the fly.

Monitoring is a subset of observability. You need both: monitoring catches the predictable, observability handles the surprises.

5. MTTR — why speed matters

MTTR stands for Mean Time To Recovery (sometimes "...to Resolve"): the average time from when an incident starts to when it's fixed. It breaks down roughly into:

incident starts
   |--- time to DETECT ---|--- time to DIAGNOSE ---|--- time to FIX ---|
                                                                    resolved

Good observability shrinks the detect part (alerts fire fast) and especially the diagnose part (you find the cause in minutes, not hours). Most outage time is usually diagnosis — people staring at screens trying to understand what's happening. Cutting that is the biggest win, and it's exactly what the three pillars working together give you.

Lower MTTR means happier users, less revenue lost, and far less stress on the on-call engineer (which will be you, one day).

6. The cost of cardinality and noise

A quick caution so you build good habits early. Metrics with very high cardinality (a label per user, per request ID) can blow up storage and slow queries — that kind of detail belongs in logs and traces, not metric labels. And alerts that fire constantly get ignored ("alert fatigue"). Good observability is not "collect everything and alert on everything" — it's collecting the right signals and alerting only on things a human must act on.

7. How it fits together in the lab

Everything funnels into Grafana at 10.100.100.4, which is your single pane of glass. Grafana itself stores no telemetry — it queries data sources: InfluxDB/Mimir for metrics, Loki at 10.100.100.5 for logs, Tempo for traces. The agent that collects and ships all of this is Grafana Alloy (covered in a later lesson). You'll spend the rest of this module learning each piece, then wire them all together in the capstone.

Dig deeper

Search terms

three pillars of observability logs metrics traces
monitoring vs observability explained
what is MTTR mean time to recovery
high cardinality metrics problem
grafana LGTM stack overview
observability workflow alert trace log

Check yourself

Name the three pillars of observability and the one question each is best at answering.
In what order do you typically use the pillars when diagnosing an incident, and why?
What is the core difference between monitoring and observability?
What does MTTR stand for, and which part of it does good observability shrink the most?
Which lab component is the "single pane of glass," and what is its IP address?