part nine · observability

Observability

Logs, traces, metrics, events.

Observability is the property of a system that makes it possible to ask questions about its current state without having to deploy new code. The corpus pattern: observability is built with the feature, not after. By the time the gate is reached, the system is already legible.

The four signal types

Signal	What it answers	Cost	Read by
Logs	What happened in this specific request?	High in volume; easy to author	Developer, on-call
Traces	How did this request flow across services?	Moderate, depends on sampling	Developer, on-call
Metrics	What is happening across many requests?	Low per-event; needs cardinality discipline	On-call, Tech Lead
Product analytics events	What did the named person do?	Low; needs naming discipline	PO, data

Each signal has a different consumer and a different cost profile. The corpus uses all four; not as overlap, but as complement.

Logs

Structured. Always. JSON format with consistent fields.

json

{
  "ts": "2026-05-22T08:53:14.029Z",
  "service": "grading-api",
  "level": "info",
  "request_id": "req_a4f2c1",
  "user_id": "usr_2103",
  "endpoint": "GET /submissions/1234",
  "duration_ms": 187,
  "event": "submission.opened",
  "subject_id": "sub_1234",
  "domain_terms": ["submission", "grader"]
}

Log fields are picked at design time, not at incident time. The fields appear in the brief as part of the observability section.

The corpus pattern: never log PII. The grading flow logs user_id, not name. The privacy ility (Volume III Part 8) constrains the log shape.

Traces

Each request gets a trace ID. Spans nest within the trace. Sampling is intelligent — every error trace is captured; healthy traces are sampled at low rate.

Traces show the request's path. Did this request hit the LMS adapter? Yes. Did the LMS adapter respond in time? 124ms. Did the response normalise correctly? Yes.

A team that uses traces solves more incidents in less time. A team that doesn't reads logs sequentially and reconstructs the path mentally — slower, more error-prone.

Metrics

Counters and gauges and histograms. Aggregated. Cheap per data point.

The corpus's standard set, per service:

text

http_request_duration_ms{endpoint, method, status}      histogram
http_requests_total{endpoint, method, status}           counter
http_requests_in_flight{endpoint}                       gauge

job_duration_ms{queue, type, outcome}                   histogram
job_queue_depth{queue}                                  gauge

db_query_duration_ms{query_class}                       histogram
db_connections_active{}                                 gauge

flag_evaluations_total{flag, outcome}                   counter
flag_evaluation_duration_ms{flag}                       histogram

Plus the Epic-specific metrics named in the brief.

Cardinality is managed: labels are bounded. Labels per username is forbidden — that explodes cardinality. Labels per endpoint is fine.

Product analytics events

The named-action signals. The brief names them; the implementation emits them.

text

Brief: 'When Gal opens a submission, we want to know.'

Event:           submission.opened
Properties:      submission_id, grader_id, ts, duration_to_open_ms

Brief: 'When Gal saves a grade, we want to know.'

Event:           submission.graded
Properties:      submission_id, grader_id, ts, score_count, total_time_ms

Events use domain language. They follow subject.verb format. They are versioned conservatively — a property name doesn't change without a migration story.

These events feed Volume V's signal reading. The prediction Gal completes a grading cycle in under 15 minutes is checked against submission.graded.total_time_ms. The instrumentation is in place by release-gate time.

Alerts

Alerts are derived from metrics, not logs. They fire when an SLO threshold is crossed.

yaml

- alert: GradingApiHighErrorRate
  expr: sum(rate(http_requests_total{service="grading-api",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="grading-api"}[5m])) > 0.01
  for: 5m
  severity: P1
  runbook: https://runbooks.200apps.example/grading-flow-high-error-rate
  message: "grading-api error rate above 1% for 5 minutes"

Each alert has a runbook link. An alert without a runbook is an alert that produces panic, not action.

Dashboards

Two kinds.

Service dashboard — for the on-call. Latency, error rate, saturation. Read at every sync.
Feature dashboard — for the PO. Adoption, completion, error encounter rate, prediction-relevant metrics. Read at every signal reading.

The dashboards are versioned in code (or whatever the platform supports). They live next to the service. Changes to dashboards go through review like code.

What gets instrumented

The corpus rule: instrument what the brief named. If the prediction is checked against time-to-grade, the instrumentation that captures time-to-grade is part of the cycle. It is not deferred.

Instrumentation that isn't needed for the prediction or for operations isn't added. The corpus is opinionated against premature observability — too many metrics make the right ones harder to find.

The signal feeds Volume V

The whole observability stack exists to make Volume V's check possible. The check date arrives. The PO opens the feature dashboard. The metric is there. The check is straightforward.

A team that arrives at the check date and discovers the metric isn't instrumented has discovered, late, that the chain skipped a step. The corpus's discipline: instrument with the feature, not after.

What this volume produces, in one sentence

Volume IV carries the prediction from a signed brief through code, test, and release — with the language preserved, the trunk integrated, the gate held, the flag wrapped, the rollback rehearsed, and the instrumentation in place by the time the flag flips.

Back to the volume cover → · Volume V — After We Build →

✦ Why We Build

◐ Before We Build

◑ What We Shape

● As We Build

◔ After We Build

◕ Did We Serve?

Observability

The four signal types

Logs

Traces

Metrics

Product analytics events

Alerts

Dashboards

What gets instrumented

The signal feeds Volume V

What this volume produces, in one sentence

Observability ​

The four signal types ​

Logs ​

Traces ​

Metrics ​

Product analytics events ​

Alerts ​

Dashboards ​

What gets instrumented ​

The signal feeds Volume V ​

What this volume produces, in one sentence ​

Observability

The four signal types

Logs

Traces

Metrics

Product analytics events

Alerts

Dashboards

What gets instrumented

The signal feeds Volume V

What this volume produces, in one sentence