Skip to content

Observability

Logs, traces, metrics, events.

Observability is the property of a system that makes it possible to ask questions about its current state without having to deploy new code. The corpus pattern: observability is built with the feature, not after. By the time the gate is reached, the system is already legible.

The four signal types

SignalWhat it answersCostRead by
LogsWhat happened in this specific request?High in volume; easy to authorDeveloper, on-call
TracesHow did this request flow across services?Moderate, depends on samplingDeveloper, on-call
MetricsWhat is happening across many requests?Low per-event; needs cardinality disciplineOn-call, Tech Lead
Product analytics eventsWhat did the named person do?Low; needs naming disciplinePO, data

Each signal has a different consumer and a different cost profile. The corpus uses all four; not as overlap, but as complement.

Logs

Structured. Always. JSON format with consistent fields.

json
{
  "ts": "2026-05-22T08:53:14.029Z",
  "service": "grading-api",
  "level": "info",
  "request_id": "req_a4f2c1",
  "user_id": "usr_2103",
  "endpoint": "GET /submissions/1234",
  "duration_ms": 187,
  "event": "submission.opened",
  "subject_id": "sub_1234",
  "domain_terms": ["submission", "grader"]
}

Log fields are picked at design time, not at incident time. The fields appear in the brief as part of the observability section.

The corpus pattern: never log PII. The grading flow logs user_id, not name. The privacy ility (Volume III Part 8) constrains the log shape.

Traces

Each request gets a trace ID. Spans nest within the trace. Sampling is intelligent — every error trace is captured; healthy traces are sampled at low rate.

Traces show the request's path. Did this request hit the LMS adapter? Yes. Did the LMS adapter respond in time? 124ms. Did the response normalise correctly? Yes.

A team that uses traces solves more incidents in less time. A team that doesn't reads logs sequentially and reconstructs the path mentally — slower, more error-prone.

Metrics

Counters and gauges and histograms. Aggregated. Cheap per data point.

The corpus's standard set, per service:

text
http_request_duration_ms{endpoint, method, status}      histogram
http_requests_total{endpoint, method, status}           counter
http_requests_in_flight{endpoint}                       gauge

job_duration_ms{queue, type, outcome}                   histogram
job_queue_depth{queue}                                  gauge

db_query_duration_ms{query_class}                       histogram
db_connections_active{}                                 gauge

flag_evaluations_total{flag, outcome}                   counter
flag_evaluation_duration_ms{flag}                       histogram

Plus the Epic-specific metrics named in the brief.

Cardinality is managed: labels are bounded. Labels per username is forbidden — that explodes cardinality. Labels per endpoint is fine.

Product analytics events

The named-action signals. The brief names them; the implementation emits them.

text
Brief: 'When Gal opens a submission, we want to know.'

Event:           submission.opened
Properties:      submission_id, grader_id, ts, duration_to_open_ms

Brief: 'When Gal saves a grade, we want to know.'

Event:           submission.graded
Properties:      submission_id, grader_id, ts, score_count, total_time_ms

Events use domain language. They follow subject.verb format. They are versioned conservatively — a property name doesn't change without a migration story.

These events feed Volume V's signal reading. The prediction Gal completes a grading cycle in under 15 minutes is checked against submission.graded.total_time_ms. The instrumentation is in place by release-gate time.

Alerts

Alerts are derived from metrics, not logs. They fire when an SLO threshold is crossed.

yaml
- alert: GradingApiHighErrorRate
  expr: sum(rate(http_requests_total{service="grading-api",status=~"5.."}[5m])) / sum(rate(http_requests_total{service="grading-api"}[5m])) > 0.01
  for: 5m
  severity: P1
  runbook: https://runbooks.200apps.example/grading-flow-high-error-rate
  message: "grading-api error rate above 1% for 5 minutes"

Each alert has a runbook link. An alert without a runbook is an alert that produces panic, not action.

Dashboards

Two kinds.

  • Service dashboard — for the on-call. Latency, error rate, saturation. Read at every sync.
  • Feature dashboard — for the PO. Adoption, completion, error encounter rate, prediction-relevant metrics. Read at every signal reading.

The dashboards are versioned in code (or whatever the platform supports). They live next to the service. Changes to dashboards go through review like code.

What gets instrumented

The corpus rule: instrument what the brief named. If the prediction is checked against time-to-grade, the instrumentation that captures time-to-grade is part of the cycle. It is not deferred.

Instrumentation that isn't needed for the prediction or for operations isn't added. The corpus is opinionated against premature observability — too many metrics make the right ones harder to find.

The signal feeds Volume V

The whole observability stack exists to make Volume V's check possible. The check date arrives. The PO opens the feature dashboard. The metric is there. The check is straightforward.

A team that arrives at the check date and discovers the metric isn't instrumented has discovered, late, that the chain skipped a step. The corpus's discipline: instrument with the feature, not after.

What this volume produces, in one sentence

Volume IV carries the prediction from a signed brief through code, test, and release — with the language preserved, the trunk integrated, the gate held, the flag wrapped, the rollback rehearsed, and the instrumentation in place by the time the flag flips.

Back to the volume cover → · Volume V — After We Build →

200apps · How We Work · NWIRE