Skip to content

Observability

Logs. Traces. Metrics. Events. Built with the feature, not after. The system is legible by gate time — so on-call doesn't write the observability story under pressure at 02:00.

Owners: Tech Lead, DevOps, Developer Phase it lives in: How We Build (Volume IV) The corpus principle this enacts: The artefacts the chain produces survive the conversation.

Where it lives in the chain

The four telemetry kinds

KindWhat it answersCost shape
Metrics"How often, how fast, how much?"Cheap. Aggregated. Fixed cardinality.
Logs"What exactly happened in this case?"Medium. High volume. Searchable.
Traces"How did this request travel through the system?"Expensive but invaluable for distributed systems. Sampled.
Events"What domain action happened?"Cheap. Named in domain language. Feeds product analytics.

How to do this

  • Domain-named, not framework-named. ExamGraded event, not EntityUpdated. The on-call greps for the domain word and finds the moment.
  • Correlation IDs everywhere. Same request_id across services, logs, traces. Without it, three services' logs are three stories with no shared chapter.
  • Structured logs. JSON, not free-form prose. Searchable, filterable, aggregable.
  • High-cardinality fields kept out of metrics. user_id belongs in logs and traces, not as a Prometheus label.
  • Sensitive fields scrubbed at the SDK level. Not at the logging service. A PII leak in logs is a breach you may not detect for months.

What good practice looks like

A new feature ships with observability built in:

  • A handful of new metrics for the SLIs the feature is being measured on.
  • Structured logs at the boundaries of the domain operation, with correlation IDs.
  • Traces for any new cross-service call.
  • Domain events named (ExamGraded, GradingDeadlineMissed) and emitted, feeding product analytics.

When an incident arrives, on-call opens the dashboard for this feature, finds the spike, follows the correlation ID into traces, narrows to a specific user, reads the logs. Total time: minutes.

A team without observability finds the same incident by adding logging in a hotfix at 02:00, deploying the hotfix to see the next instance, and writing the postmortem about "we couldn't tell what happened the first time."

200apps · How We Work · NWIRE