10 KiB
| title |
|---|
| Observability Implementation Plan |
Observability Implementation Plan
Date: 2026-06-15
Targets: HIGH item #1 (Tracing 1/5), with synergistic wins on #5 (I/O Validation) and #9 (Evaluation Metrics)
Source analysis: agent-loop-2026-gap-analysis.md — Appendix A
Status: Phases 0–2 implemented on branch feat/observability-tracing (gated behind BMAD_TRACE=1). Phase 3 (--json-schema enforcement) and Phase 4 (OTLP bridge) deferred to separate PRs. Chain-level trace id (decision #5) still deferred.
Locked decisions: (1) ambient vars · (2) jq live renderer, capture tee placed upstream of jq so renderer failure is cosmetic-only · (3) separate PRs · (4) hard-fail on jq when BMAD_TRACE=1 · (5) chain-level trace id deferred.
Added beyond original plan — intra-phase heartbeat. Per request, start_phase_heartbeat/emit_heartbeat/stop_phase_heartbeat append a liveness beat to epic-<id>-trace.live.jsonl every BMAD_TRACE_HEARTBEAT_INTERVAL seconds (default 10) while a phase runs — so a kill -9 mid-phase leaves a forensic trail (which phase, elapsed, context size, last assistant text). Reads only the latest assistant event (bounded I/O, no slurp); self-terminates within one interval if the main process dies; defensively stopped in cleanup().
Implementation note: all new tracing logic lives in scripts/epic-execute-lib/observability.sh (new module). epic-execute.sh changes are limited to: sourcing the module, a startup dep-check + init_observability, the run_claude_to_file rewrite, a record_span call inside that helper, rollup_telemetry in cleanup(), one temp-file cleanup line, and 9 one-line set_span_context calls. Verified against the live CLI: real span lands with correct model/tokens/cost/ctx%, clean-text compatibility holds (downstream IMPLEMENTATION COMPLETE detection still works), crash path writes a degraded error span, legacy path (BMAD_TRACE unset) is byte-for-byte unchanged, and the hard-fail exits 1 when jq is absent.
Goal
Stop discarding the telemetry the claude CLI already emits on every call. Capture it as OTel-shaped trace spans, roll it up into deterministic (non-fabricated) metrics, and — as a side effect of the same code-path change — replace fragile stdout-scraping with clean .result parsing.
Verified CLI facts (claude 2.1.177)
All confirmed against the installed CLI on 2026-06-15:
--output-format jsonreturns one result object with:session_id,total_cost_usd,usage.{input_tokens,output_tokens,cache_read_input_tokens,cache_creation_input_tokens},modelUsage[model].{costUSD,inputTokens,outputTokens,contextWindow},duration_ms,duration_api_ms,ttft_ms,num_turns,stop_reason,is_error,api_error_status,permission_denials.--output-format stream-jsonrequires--verbose(hard error otherwise). Emits JSONL:system/init→rate_limit_event→assistant(s)/tool_use/tool_result→result(always the last line).- The final
resultline carries.result= the clean final assistant text, plus the full telemetry envelope from (1). It is a single line, sotail -n 1 | jqextracts it cheaply regardless of phase length — the memory-safe "read the tail" design is preserved. - A stderr warning (
no stdin data received in 3s…) leaks into2>&1; suppressed with< /dev/null. modelUsageattributes cost across models — internal sub-agents (e.g. a Haiku helper inside an Opus phase) show up separately. Record verbatim; do not flatten.uuidgen,jq,yqall present on the dev machine.
Design decision: stream-json + .result extraction (compatibility-preserving)
run_claude_to_file is the single chokepoint (epic-execute.sh:518-529); all 9 phase call sites follow run_claude_to_file …; result=$(read_phase_tail). The downstream parsers (extract_json_result, check_phase_completion) expect PHASE_OUTPUT_FILE to hold the rendered assistant text.
Plan: switch the invocation to stream-json --verbose, write the raw JSONL to a separate stream file, then post-process:
- Extract
.result(clean text) from the finalresultline → write toPHASE_OUTPUT_FILE. All 9 call sites keep working unchanged, and parsing improves (no interleaved tool output → fixes the 9-mismark incident, gap #5). - Extract telemetry from the same line →
record_span(new) appends a JSONL span. - Graceful degradation: if
jqis absent, noresultline is found, or the call crashed, fall back to today's raw-text behavior so nothing regresses.
This keeps the blast radius to one function plus two new helpers; the 9 call sites are untouched.
Work breakdown (phased)
Phase 0 — Prerequisites & scaffolding (low risk)
- Add
jqto a startup capability check; decide hard-fail vs. graceful-degrade (see Open Decision 4). Mirror the existingyqcheck pattern. - Create
TRACES_DIR="$SPRINT_ARTIFACTS_DIR/traces";mkdir -palongside the other artifact dirs (epic-execute.sh:138-145). - Generate an epic-level
TRACE_ID=$(uuidgen)at startup (nearEPIC_START_SECONDS,:1283). This is the single correlating ID$$never provided.
Phase 1 — Instrument the invocation (core, additive)
- Rewrite
run_claude_to_fileto usestream-json --verbose, raw JSONL →$PHASE_STREAM_FILE, live-render assistant text to the log (Open Decision 2), then derivePHASE_OUTPUT_FILEfrom.result. - Add
record_span <phase> <story_id>— readstail -n 1 "$PHASE_STREAM_FILE", validates.type=="result", appends one OTel-shaped span to$TRACES_DIR/epic-<id>-trace.jsonl. - Pass
phase+story_idintorun_claude_to_file(currently positional$1/-f); thread through the 9 call sites or set module-levelCURRENT_PHASE/CURRENT_STORY_IDbefore each call (Open Decision 1 — signature vs. ambient vars). - Compute
ctx_util_pct= (input + cache_read + cache_creation) /contextWindowandlog_warnwhen > 80% (feeds Context Strategy, gap #7).
Span schema (one line per phase):
{"trace_id":"…","span_id":"<session_id>","parent":"<story_id>","name":"dev",
"story_id":"4-3","model":"claude-opus-4-8[1m]","input_tokens":2705,
"output_tokens":4,"cache_read":19080,"cost_usd":0.0237,"duration_ms":1089,
"ttft_ms":1012,"num_turns":1,"is_error":false,"api_error_status":null,
"ctx_util_pct":2.1,"status":"COMPLETE","ts":"2026-06-15T…Z"}
Phase 2 — Deterministic rollup (replaces fabrication)
- Add
rollup_telemetry— sum the JSONL spans into atelemetry:block inmetrics.yaml(total_cost_usd,total_input_tokens,total_output_tokens,cache_read_tokens,by_phase). Call fromfinalize_metrics(:765). - Add derived metrics (gap #9): Task Completion Rate, Escalation Rate (
max_retries_hit/stories+ realis_errorrate), Tool Call Success Rate (is_error=false÷ phases). - Chain: delete the fabricated
Estimated Token Usage/Cost Estimatessections (epic-chain-execution-report.mdgenerator,epic-chain.sh:831-876); have the report read measuredtelemetry:from each epic's metrics. Also instrument the chain's own report-generatorclaudecall (epic-chain.sh:884).
Phase 3 — Contract enforcement (synergistic, higher risk — gate separately)
- Use
--json-schemato make the completionstatusfield structurally mandatory, removing the fuzzy-regex fallback (json-output.sh:311-387). Sequence after Phases 1–2 prove stable, since it changes completion semantics.
Phase 4 — OTel bridge (optional, later)
- Post-processor: JSONL spans → OTLP exporter to a collector. No hot-path change. Defer until someone needs a dashboard.
Test strategy
- Unit (bats or shell asserts): feed a captured
resultJSONL fixture intorecord_span/rollup_telemetry; assert span fields and YAML sums. Capture one real fixture now (we already have sample envelopes) and commit undertest/fixtures/. - Degradation: run the new path with
jqshadowed out ofPATH; assert it falls back to raw text and does not crash. - Crash path: feed a truncated stream (no
resultline); assert graceful fallback + a span markedis_error/incomplete. - End-to-end:
--dry-runwon't call claude, so add a tiny real one-story epic to a scratch fixture and confirm a populatedtrace.jsonl+telemetry:block. - Regression: confirm all 9 phases still parse completion correctly (the
.resulttext is a superset of what they saw before).
Rollout / safety
- Gate Phase 1 behind an env flag (e.g.
BMAD_TRACE=1) for the first iteration so the old text path stays default until the new one is proven, then flip the default. - Phases 1–2 are observe-only (no behavior change beyond cleaner parsing). Phase 3 changes control flow — ship it on its own PR per the repo's "<800 lines, split larger changes" rule.
Open decisions (need sign-off before coding)
- Threading phase/story into the invocation — change
run_claude_to_file's signature, or set ambientCURRENT_PHASE/CURRENT_STORY_IDbefore each call? Ambient is a smaller diff (no call-site edits) but more implicit. Recommendation: ambient vars set in the phase functions. - Live log format — tee raw JSONL to the log (ugly for humans tailing it) vs. pipe through a small
jqrenderer that prints assistant text live. Recommendation: jq renderer for the terminal/log, raw JSONL to the stream file. - Bundle scope — observability-only (Phases 0–2), or include the contract enforcement (Phase 3) in the same effort since it touches the same path? Recommendation: separate PRs; land 0–2 first.
jqdependency — hard prerequisite (fail fast at startup) or graceful degradation? Telemetry is useless without it, but failing hard is a sharper UX change. Recommendation: graceful-degrade with a loud one-time warning, consistent with the currentyqhandling.- Trace ID propagation to chain — should
epic-chain.shmint a chain-level trace ID that each epic inherits as a parent span, giving one trace across the whole chain? Recommendation: yes, but defer to Phase 2.
Effort estimate
| Phase | Scope | Risk | Rough size |
|---|---|---|---|
| 0 | dirs, trace id, jq check | low | ~30 lines |
| 1 | invocation + record_span | medium | ~120 lines (1 fn rewrite + 1 new) |
| 2 | rollup + derived metrics + chain report | low–med | ~150 lines |
| 3 | --json-schema enforcement | medium–high | separate PR |
| 4 | OTLP bridge | low (isolated) | deferred |
Phases 0–2 are the core observability win and are mostly additive plumbing.