Observability (OpenTelemetry)
Category: Quality Attributes · Areas: api, backend, infra
Description
Category
observability
Areas
api, backend, infra
Components
- Standard: OpenTelemetry (traces, metrics, logs)
- Traces: Distributed tracing with correlation IDs
- Metrics: RED metrics (Rate, Errors, Duration) for all services
- Logs: Structured JSON logging with trace context
Constraints
- All HTTP/gRPC endpoints must emit latency and error metrics
- All cross-service calls must propagate trace context
- Logs must be structured JSON with correlation IDs
- No
console.log/printfor operational logging
When to use
Any project with backend services, APIs, or distributed systems. Essential for production debugging, performance monitoring, and incident response.
ADR References
Practices by activity
Agents working in any of these activities inherit the practices below via the bead’s context digest.
Requirements (Frame activity)
- All services must define SLOs for availability and latency
- Incident response requires structured logs and traces
Design
- Use OpenTelemetry SDK for instrumentation (not vendor-specific SDKs)
- All cross-service calls propagate W3C Trace Context headers
- Define span naming conventions per service type
- Metric names follow OpenTelemetry semantic conventions
Implementation
- Structured JSON logging — no unstructured text logs in production
- Every HTTP handler: request ID, trace ID, duration, status code
- Every database query: duration, table, operation type
- Every external API call: duration, endpoint, status code
- Error logs include stack trace and request context
- Use log levels consistently: ERROR (actionable), WARN (degraded), INFO (business events), DEBUG (development only)
Testing
- Verify trace propagation in integration tests
- Verify structured log format in unit tests
- Load test with tracing enabled to validate overhead < 5%
- Alert on missing trace context in production logs
Deployment
- Configure OTEL collector as a sidecar or daemonset
- Export to the project’s observability backend (Grafana, Datadog, etc.)
- Set sampling rate appropriate to traffic volume