BMAD-METHOD/docs/improvements/observability-implementatio...

119 lines
10 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: Observability Implementation Plan
---
# Observability Implementation Plan
**Date:** 2026-06-15
**Targets:** HIGH item #1 (Tracing 1/5), with synergistic wins on #5 (I/O Validation) and #9 (Evaluation Metrics)
**Source analysis:** [`agent-loop-2026-gap-analysis.md`](./agent-loop-2026-gap-analysis.md) — Appendix A
**Status:** Phases 02 implemented on branch `feat/observability-tracing` (gated behind `BMAD_TRACE=1`). Phase 3 (`--json-schema` enforcement) and Phase 4 (OTLP bridge) deferred to separate PRs. Chain-level trace id (decision #5) still deferred.
**Locked decisions:** (1) ambient vars · (2) jq live renderer, capture tee placed upstream of jq so renderer failure is cosmetic-only · (3) separate PRs · (4) hard-fail on jq when `BMAD_TRACE=1` · (5) chain-level trace id deferred.
**Added beyond original plan — intra-phase heartbeat.** Per request, `start_phase_heartbeat`/`emit_heartbeat`/`stop_phase_heartbeat` append a liveness beat to `epic-<id>-trace.live.jsonl` every `BMAD_TRACE_HEARTBEAT_INTERVAL` seconds (default 10) while a phase runs — so a `kill -9` mid-phase leaves a forensic trail (which phase, elapsed, context size, last assistant text). Reads only the latest assistant event (bounded I/O, no slurp); self-terminates within one interval if the main process dies; defensively stopped in `cleanup()`.
**Implementation note:** all new tracing logic lives in `scripts/epic-execute-lib/observability.sh` (new module). `epic-execute.sh` changes are limited to: sourcing the module, a startup dep-check + `init_observability`, the `run_claude_to_file` rewrite, a `record_span` call inside that helper, `rollup_telemetry` in `cleanup()`, one temp-file cleanup line, and 9 one-line `set_span_context` calls. Verified against the live CLI: real span lands with correct model/tokens/cost/ctx%, clean-text compatibility holds (downstream `IMPLEMENTATION COMPLETE` detection still works), crash path writes a degraded error span, legacy path (`BMAD_TRACE` unset) is byte-for-byte unchanged, and the hard-fail exits 1 when jq is absent.
---
## Goal
Stop discarding the telemetry the `claude` CLI already emits on every call. Capture it as OTel-shaped trace spans, roll it up into deterministic (non-fabricated) metrics, and — as a side effect of the same code-path change — replace fragile stdout-scraping with clean `.result` parsing.
## Verified CLI facts (claude 2.1.177)
All confirmed against the installed CLI on 2026-06-15:
1. `--output-format json` returns one result object with: `session_id`, `total_cost_usd`, `usage.{input_tokens,output_tokens,cache_read_input_tokens,cache_creation_input_tokens}`, `modelUsage[model].{costUSD,inputTokens,outputTokens,contextWindow}`, `duration_ms`, `duration_api_ms`, `ttft_ms`, `num_turns`, `stop_reason`, `is_error`, `api_error_status`, `permission_denials`.
2. `--output-format stream-json` **requires `--verbose`** (hard error otherwise). Emits JSONL: `system/init``rate_limit_event``assistant`(s)/`tool_use`/`tool_result` → **`result`** (always the last line).
3. The final `result` line carries `.result` = the clean final assistant text, plus the full telemetry envelope from (1). It is a single line, so `tail -n 1 | jq` extracts it cheaply regardless of phase length — the memory-safe "read the tail" design is preserved.
4. A stderr warning (`no stdin data received in 3s…`) leaks into `2>&1`; suppressed with `< /dev/null`.
5. `modelUsage` attributes cost across models — internal sub-agents (e.g. a Haiku helper inside an Opus phase) show up separately. Record verbatim; do not flatten.
6. `uuidgen`, `jq`, `yq` all present on the dev machine.
## Design decision: stream-json + `.result` extraction (compatibility-preserving)
`run_claude_to_file` is the single chokepoint (`epic-execute.sh:518-529`); all 9 phase call sites follow `run_claude_to_file …; result=$(read_phase_tail)`. The downstream parsers (`extract_json_result`, `check_phase_completion`) expect `PHASE_OUTPUT_FILE` to hold the rendered assistant text.
**Plan:** switch the invocation to `stream-json --verbose`, write the raw JSONL to a separate stream file, then post-process:
- Extract `.result` (clean text) from the final `result` line → write to `PHASE_OUTPUT_FILE`. **All 9 call sites keep working unchanged**, and parsing improves (no interleaved tool output → fixes the 9-mismark incident, gap #5).
- Extract telemetry from the same line → `record_span` (new) appends a JSONL span.
- Graceful degradation: if `jq` is absent, no `result` line is found, or the call crashed, fall back to today's raw-text behavior so nothing regresses.
This keeps the blast radius to one function plus two new helpers; the 9 call sites are untouched.
---
## Work breakdown (phased)
### Phase 0 — Prerequisites & scaffolding (low risk)
- [ ] Add `jq` to a startup capability check; decide hard-fail vs. graceful-degrade (see Open Decision 4). Mirror the existing `yq` check pattern.
- [ ] Create `TRACES_DIR="$SPRINT_ARTIFACTS_DIR/traces"`; `mkdir -p` alongside the other artifact dirs (`epic-execute.sh:138-145`).
- [ ] Generate an epic-level `TRACE_ID=$(uuidgen)` at startup (near `EPIC_START_SECONDS`, `:1283`). This is the single correlating ID `$$` never provided.
### Phase 1 — Instrument the invocation (core, additive)
- [ ] Rewrite `run_claude_to_file` to use `stream-json --verbose`, raw JSONL → `$PHASE_STREAM_FILE`, live-render assistant text to the log (Open Decision 2), then derive `PHASE_OUTPUT_FILE` from `.result`.
- [ ] Add `record_span <phase> <story_id>` — reads `tail -n 1 "$PHASE_STREAM_FILE"`, validates `.type=="result"`, appends one OTel-shaped span to `$TRACES_DIR/epic-<id>-trace.jsonl`.
- [ ] Pass `phase` + `story_id` into `run_claude_to_file` (currently positional `$1`/`-f`); thread through the 9 call sites or set module-level `CURRENT_PHASE`/`CURRENT_STORY_ID` before each call (Open Decision 1 — signature vs. ambient vars).
- [ ] Compute `ctx_util_pct` = (input + cache_read + cache_creation) / `contextWindow` and `log_warn` when > 80% (feeds Context Strategy, gap #7).
Span schema (one line per phase):
```json
{"trace_id":"…","span_id":"<session_id>","parent":"<story_id>","name":"dev",
"story_id":"4-3","model":"claude-opus-4-8[1m]","input_tokens":2705,
"output_tokens":4,"cache_read":19080,"cost_usd":0.0237,"duration_ms":1089,
"ttft_ms":1012,"num_turns":1,"is_error":false,"api_error_status":null,
"ctx_util_pct":2.1,"status":"COMPLETE","ts":"2026-06-15T…Z"}
```
### Phase 2 — Deterministic rollup (replaces fabrication)
- [ ] Add `rollup_telemetry` — sum the JSONL spans into a `telemetry:` block in `metrics.yaml` (`total_cost_usd`, `total_input_tokens`, `total_output_tokens`, `cache_read_tokens`, `by_phase`). Call from `finalize_metrics` (`:765`).
- [ ] Add derived metrics (gap #9): Task Completion Rate, Escalation Rate (`max_retries_hit/stories` + real `is_error` rate), Tool Call Success Rate (`is_error=false` ÷ phases).
- [ ] Chain: delete the fabricated `Estimated Token Usage` / `Cost Estimates` sections (`epic-chain-execution-report.md` generator, `epic-chain.sh:831-876`); have the report read measured `telemetry:` from each epic's metrics. Also instrument the chain's own report-generator `claude` call (`epic-chain.sh:884`).
### Phase 3 — Contract enforcement (synergistic, higher risk — gate separately)
- [ ] Use `--json-schema` to make the completion `status` field structurally mandatory, removing the fuzzy-regex fallback (`json-output.sh:311-387`). Sequence **after** Phases 12 prove stable, since it changes completion semantics.
### Phase 4 — OTel bridge (optional, later)
- [ ] Post-processor: JSONL spans → OTLP exporter to a collector. No hot-path change. Defer until someone needs a dashboard.
---
## Test strategy
- **Unit (bats or shell asserts):** feed a captured `result` JSONL fixture into `record_span`/`rollup_telemetry`; assert span fields and YAML sums. Capture one real fixture now (we already have sample envelopes) and commit under `test/fixtures/`.
- **Degradation:** run the new path with `jq` shadowed out of `PATH`; assert it falls back to raw text and does not crash.
- **Crash path:** feed a truncated stream (no `result` line); assert graceful fallback + a span marked `is_error`/incomplete.
- **End-to-end:** `--dry-run` won't call claude, so add a tiny real one-story epic to a scratch fixture and confirm a populated `trace.jsonl` + `telemetry:` block.
- **Regression:** confirm all 9 phases still parse completion correctly (the `.result` text is a superset of what they saw before).
## Rollout / safety
- Gate Phase 1 behind an env flag (e.g. `BMAD_TRACE=1`) for the first iteration so the old text path stays default until the new one is proven, then flip the default.
- Phases 12 are observe-only (no behavior change beyond cleaner parsing). Phase 3 changes control flow — ship it on its own PR per the repo's "<800 lines, split larger changes" rule.
## Open decisions (need sign-off before coding)
1. **Threading phase/story into the invocation** change `run_claude_to_file`'s signature, or set ambient `CURRENT_PHASE`/`CURRENT_STORY_ID` before each call? Ambient is a smaller diff (no call-site edits) but more implicit. *Recommendation: ambient vars set in the phase functions.*
2. **Live log format** tee raw JSONL to the log (ugly for humans tailing it) vs. pipe through a small `jq` renderer that prints assistant text live. *Recommendation: jq renderer for the terminal/log, raw JSONL to the stream file.*
3. **Bundle scope** observability-only (Phases 02), or include the contract enforcement (Phase 3) in the same effort since it touches the same path? *Recommendation: separate PRs; land 02 first.*
4. **`jq` dependency** hard prerequisite (fail fast at startup) or graceful degradation? Telemetry is useless without it, but failing hard is a sharper UX change. *Recommendation: graceful-degrade with a loud one-time warning, consistent with the current `yq` handling.*
5. **Trace ID propagation to chain** should `epic-chain.sh` mint a chain-level trace ID that each epic inherits as a parent span, giving one trace across the whole chain? *Recommendation: yes, but defer to Phase 2.*
---
## Effort estimate
| Phase | Scope | Risk | Rough size |
|-------|-------|------|-----------|
| 0 | dirs, trace id, jq check | low | ~30 lines |
| 1 | invocation + record_span | medium | ~120 lines (1 fn rewrite + 1 new) |
| 2 | rollup + derived metrics + chain report | lowmed | ~150 lines |
| 3 | --json-schema enforcement | mediumhigh | separate PR |
| 4 | OTLP bridge | low (isolated) | deferred |
Phases 02 are the core observability win and are mostly additive plumbing.
</content>
</invoke>