Pipeline & Operations · master area
Observability
Logs. Traces. Metrics. Events. Built with the feature, not after. The system is legible by gate time — so on-call doesn't write the observability story under pressure at 02:00.
Owners: Tech Lead, DevOps, Developer Phase it lives in: How We Build (Volume IV) The corpus principle this enacts: The artefacts the chain produces survive the conversation.
Where it lives in the chain
- How We Build · Testing · System signals — where the SLIs that observability emits are designed
The four telemetry kinds
| Kind | What it answers | Cost shape |
|---|---|---|
| Metrics | "How often, how fast, how much?" | Cheap. Aggregated. Fixed cardinality. |
| Logs | "What exactly happened in this case?" | Medium. High volume. Searchable. |
| Traces | "How did this request travel through the system?" | Expensive but invaluable for distributed systems. Sampled. |
| Events | "What domain action happened?" | Cheap. Named in domain language. Feeds product analytics. |
How to do this
- Domain-named, not framework-named.
ExamGradedevent, notEntityUpdated.The on-call greps for the domain word and finds the moment. - Correlation IDs everywhere. Same
request_idacross services, logs, traces. Without it, three services' logs are three stories with no shared chapter. - Structured logs. JSON, not free-form prose. Searchable, filterable, aggregable.
- High-cardinality fields kept out of metrics.
user_idbelongs in logs and traces, not as a Prometheus label. - Sensitive fields scrubbed at the SDK level. Not at the logging service. A PII leak in logs is a breach you may not detect for months.
What good practice looks like
A new feature ships with observability built in:
- A handful of new metrics for the SLIs the feature is being measured on.
- Structured logs at the boundaries of the domain operation, with correlation IDs.
- Traces for any new cross-service call.
- Domain events named (
ExamGraded,GradingDeadlineMissed) and emitted, feeding product analytics.
When an incident arrives, on-call opens the dashboard for this feature, finds the spike, follows the correlation ID into traces, narrows to a specific user, reads the logs. Total time: minutes.
A team without observability finds the same incident by adding logging in a hotfix at 02:00, deploying the hotfix to see the next instance, and writing the postmortem about "we couldn't tell what happened the first time."
Related crafts
- Monitoring & Alerting — observability's consumer
- System Signal Tracking (DORA) — what observability data feeds
- Product Analytics — the event side