feat(epic-execute): add BMAD_TRACE observability + telemetry rollup

Capture the telemetry the claude CLI already emits (session id, tokens, cost, latency, context window) as OTel-shaped trace spans and roll them up into deterministic metrics. Gated behind BMAD_TRACE=1; the legacy text path is unchanged when tracing is off. - New scripts/epic-execute-lib/observability.sh: span recording, rollup, jq dep enforcement, and an intra-phase heartbeat for crash forensics - epic-execute.sh: stream-json capture in run_claude_to_file with clean .result extraction, per-phase set_span_context calls, rollup in cleanup - epic-chain.sh: measured (non-fabricated) telemetry section in reports - Guard set -e aborts on malformed stream lines so crash/timeout paths degrade gracefully instead of killing the run - Docs: gap analysis + observability implementation plan Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-15 06:07:48 -05:00 · 2026-06-15 06:07:48 -05:00 · c525d673fb
parent 3d6aae746c
commit c525d673fb
5 changed files with 932 additions and 5 deletions
--- a/docs/improvements/agent-loop-2026-gap-analysis.md
+++ b/docs/improvements/agent-loop-2026-gap-analysis.md
@ -0,0 +1,280 @@
 ---
 title: Agent Loop 2026 Framework — Gap Analysis
 ---
 # Agent Loop 2026 Framework — Gap Analysis
 **Date:** 2026-06-15
 **Workflows Reviewed:** epic-execute, epic-chain (+ `scripts/epic-execute-lib/`)
 **Benchmark:** "Agent Loop Evaluation Framework (2026 Standard)" — Manus AI 5-category rubric
 **Status:** Active
 ---
 ## Overview
 This document scores the epic-execute / epic-chain shell automation against the 2026 Agent Loop Evaluation Framework (context engineering, tool design, runtime governance, error recovery, observability). Each criterion is rated on the rubric's Legacy↔2026 axis (1–5) with concrete `file:line` evidence, followed by a prioritized remediation roadmap.
 The headline: the **objective machinery is already strong** — durable external state, per-phase context isolation, real-tooling gates, and deterministic state transitions are at or near 2026 standard. The gaps cluster in **governance, observability, and "prose where code belongs"** — places where a rule is told to the model instead of enforced in the harness.
 **Scope note:** This is a point-in-time review of the scripts as of the date above. Line numbers reference the current `scripts/epic-execute.sh` (3,328 lines), `scripts/epic-chain.sh` (946 lines), and `scripts/epic-execute-lib/*.sh`.
 ---
 ## Scorecard
 | # | Category | Criterion | Score (1–5) | Verdict |
 |---|----------|-----------|:-----------:|---------|
 | 1 | Context & Memory | Context Strategy | 3 | Partial — isolation yes, summarization no |
 | 1 | Context & Memory | Tool Disclosure | 5 | 2026 — lean per-phase step templates |
 | 1 | Context & Memory | Memory Persistence | 5 | 2026 — checkpoints / metrics / decision log |
 | 1 | Context & Memory | Scaffolding Type | 3 | Partial — strong gates, residual personas |
 | 2 | Tool Design | Tool Scoping | 2 | Legacy — `--dangerously-skip-permissions` + `eval` |
 | 2 | Tool Design | Input/Output Validation | 3 | Partial — schema declared, not enforced |
 | 2 | Tool Design | Deterministic Delegation | 5 | 2026 — state/math/writes all in bash |
 | 2 | Tool Design | Failure Handling | 4 | 2026-leaning — self-heal loops |
 | 3 | Governance (OWASP) | Policy Enforcement | 3 | Partial — one structural, one prompt-only |
 | 3 | Governance (OWASP) | Privilege Tiers | 2 | Legacy — one standing autonomous tier, no HITL |
 | 3 | Governance (OWASP) | Identity Management | 2 | Legacy — ambient long-lived creds |
 | 3 | Governance (OWASP) | Memory Security | 2 | Legacy — unvalidated write-back + re-ingest |
 | 4 | Error Recovery / RALF | Session Architecture | 3 | Partial — fresh process/phase, no watchdog |
 | 4 | Error Recovery / RALF | Error Diagnosis | 3 | Partial — self-heal yes, retry wrapper dead code |
 | 4 | Error Recovery / RALF | Loop Guardrails | 4 | 2026 — hard caps; missing time/token budgets |
 | 4 | Error Recovery / RALF | Exit Conditions | 4 | 2026 — deterministic gates; some advisory |
 | 5 | Observability | Tracing Infrastructure | 1 | Legacy — no trace ID, no token/cost/OTel |
 | 5 | Observability | Evaluation Metrics | 3 | Partial — raw counters, self-graded scores |
 | 5 | Observability | Evaluation Integration | 4 | 2026-leaning — embedded inline gates |
 **Category averages:** Context **4.0** · Tools **3.75** · Governance **2.25** · Error Recovery **3.5** · Observability **2.67**
 ---
 ## What's Already at the 2026 Bar
 ### Memory Persistence (5/5)
 Durable external state survives session death — the rubric's "external state files" pattern:
 - Checkpoint files with 7-day expiry — `utils.sh:500-551`, written on exit `epic-execute.sh:74-82`
 - Metrics YAML that **resumes and accumulates** counters across sessions — `epic-execute.sh:646-664`, `:779-785`
 - Per-story design plans persisted for post-resume dev phases — `epic-execute.sh:144-145`
 - Decision log + sprint-status.yaml as external workflow state — `decision-log.sh`, `epic-execute.sh:832-906`
 ### Tool Disclosure (5/5)
 Progressive disclosure: loads lean ~4–9KB step templates per phase instead of embedding ~40KB workflow YAML — `epic-execute.sh:551-563`, per-phase template selection `:179-183`.
 ### Deterministic Delegation (5/5)
 The LLM emits intent only (a status enum + findings); all counting, duration math, status writes, and pass/fail verdicts run in bash — `epic-execute.sh:693-797`, `:803-906`; `contract-exec.sh:144-170`. Playwright specs are **generated from the harness** (`contract-exec.sh:220-296`), not authored by the model.
 ### Embedded Probes (4/5)
 Quality gates run **inline per story**, not offline batch: arch / test-quality / traceability / static-analysis / regression / contract. The static-analysis and contract gates run **real tooling** (`tsc`, lint, build, pytest, `curl`, `playwright`) whose actual output gates progression and can fail the epic's exit code — `epic-execute.sh:1986-2307`, `:3203-3208`.
 ---
 ## Improvement Areas
 ### HIGH Priority
 #### 1. Observability foundation: no trace ID, no token/cost telemetry (Tracing 1/5)
 The weakest foundational layer — and the rubric explicitly advises fixing the lowest foundational layer first.
 There is no session/trace ID, no OpenTelemetry, and zero token/cost/latency capture. LLM calls are an opaque blob teed to a PID-named text log; the only correlation key is `$$`.
 **Evidence:**
 - Plain-text logging, no structured fields/spans — `epic-execute.sh:199-217`
 - LLM call uninstrumented (no tokens/cost/latency/model captured) — `run_claude_to_file`, `epic-execute.sh:518-529`
 - Metrics schema has no tokens/cost/trace ID — `epic-execute.sh:668-688`
 - Cost figures in reports are openly **estimated** ("may vary 50–200%") — `epic-chain-execution-report.md:225-248`
 **Fix:** Generate a session/trace ID at startup and thread it through every phase + the metrics/log/decision-log writes. Capture real usage by invoking `claude` with `--output-format json` (or `--output-format stream-json`) and parsing the `usage` / `total_cost_usd` fields per phase. This unlocks every downstream metric.
 ---
 #### 2. Resilience layer is fully built but never wired in (Error Recovery)
 `execute_claude_with_retry`, `run_with_timeout`, and `CLAUDE_TIMEOUT` (600s default) all exist in `utils.sh:55-150` — but have **zero callers**. Every phase uses bare `run_claude_to_file` (`epic-execute.sh:518-529`) with `|| true`, so:
 - A hung phase blocks forever (`CLAUDE_TIMEOUT` never applied to the real path)
 - Transient errors (429/timeout/503) are not retried
 - Crashes are swallowed and silently judged "incomplete"
 There is also no watchdog/supervisor and no stuck-loop progress detection — fix loops burn all attempts even on an identical failure set, despite computing failure signatures in `test-failure-filter.sh:139-151`.
 **Fix:** Route `run_claude_to_file` through the existing `run_with_timeout` + retry wrapper (mostly plumbing — the code already exists). Add a progress check that compares failure signatures across fix iterations and aborts early when the set is unchanged.
 ---
 #### 3. Governance is structurally bypassable (Privilege 2/5, Policy partial)
 The strong git policies — `check_sensitive_files` (`epic-execute.sh:351-391`), `git add -u` (`:2937`), `check_branch_protection` (`utils.sh:444-474`) — only guard the **final `commit_story` path**. Mid-phase, every agent runs `--dangerously-skip-permissions` (`:524`, `:527`) and can `git add -A` / commit anything itself. The "don't use git add -A" rule is injected as **prose in 7+ places** (`:597-598`, `:1772`, `:1875`, …) but enforced in code only at commit time. There is no read/write/destructive privilege separation and no HITL gate ("AUTOMATED… do NOT pause for user confirmation" — `:593-595`).
 **Fix:** Move the staging policy into a git **pre-commit hook** (structural, not prose) so it governs agent-authored commits too; re-run `check_sensitive_files` on any commit. Introduce an approval tier (even a coarse env-gated one) for destructive operations.
 ---
 #### 4. Memory poisoning loop is unguarded (Memory Security 2/5)
 `append_to_decision_log` writes raw agent output straight to disk (`decision-log.sh:54-76`), and `get_decision_log_context` re-injects the whole log **verbatim** into the next phase's prompt (`decision-log.sh:80-87`) — the exact RAG/memory-poisoning loop OWASP 2026 (ASI06) warns against, with no validation, segmentation, or provenance tagging. `add_metrics_issue` / `record_fix_attempt` also interpolate agent-influenced strings directly into `yq` expressions (`epic-execute.sh:725`, `:745`) — a YAML/expression-injection surface.
 **Fix:** Validate and length-bound decision-log entries before commit; tag provenance (which phase/story produced each entry); sanitize or parameterize strings before they enter `yq` expressions.
 ---
 #### 5. Output contracts declared but not enforced (I/O Validation 3/5)
 JSON result schemas are prescribed to the model and parsed with `jq`, but malformed/missing output **silently falls back to grepping prose** (`json-output.sh:311-387`, `check_phase_completion_fuzzy` in `utils.sh`). The documented consequence: **9 stories were mis-marked failed** because the model didn't emit the exact `IMPLEMENTATION COMPLETE:` phrase, requiring manual correction (`epic-chain-execution-report.md:254-272`).
 **Fix:** Reject non-conforming output and force a bounded retry instead of degrading to regex. Make the JSON result block mandatory (fail the phase if absent). Relatedly, promote the advisory gates (arch / test-quality / traceability / regression — currently "proceed with documented concerns") to blocking where the risk warrants it, matching the deterministic behavior of the static-analysis and contract gates.
 ---
 ### MEDIUM Priority
 #### 6. Tool Scoping & sandboxing (Tool Scoping 2/5)
 `claude --dangerously-skip-permissions` grants the full unrestricted toolset with no per-phase allowlist or sandbox (`epic-execute.sh:524`, `:527`). Harness commands run via raw `eval` on YAML-derived strings — an injection surface — in `contract-exec.sh:43,53,86,168` and `contract-harness.sh:333,351,368`. The production-scope datastore guard is advisory (`log_warn`), not a block (`contract-harness.sh:194-213`).
 **Fix:** Run harness commands as argv arrays (no `eval`). Consider a per-phase tool allowlist and/or containerized execution. Promote the production-scope datastore guard from warning to hard block.
 #### 7. Context summarization (Context Strategy 3/5)
 Per-phase context isolation is excellent (fresh `claude` process per phase, paths-not-contents handoff), but there is no **anchored iterative summarization**: cross-phase carryover is raw grep/sed-extracted text or tail-truncation (decision context truncated to 20KB at `epic-execute.sh:1466`), and the only control is a hard 150KB cap (`MAX_PROMPT_SIZE`, `:398`), not a utilization band.
 **Fix:** Add a summarization step between phases — hold an anchor block (story + ACs + constraints) constant while condensing completed-phase outcomes into a structured summary; target 60–80% utilization rather than a hard truncate.
 #### 8. Identity Management (2/5)
 Every `claude` call inherits the operator's ambient, long-lived credentials; harness secrets are consumed as ambient env vars (`contract-harness.sh:205`, `:254`). No unique per-task identity, short-lived tokens, or credential scoping.
 **Fix:** Where feasible, issue short-lived/scoped credentials per run; vault harness secrets rather than relying on ambient env.
 #### 9. Evaluation Metrics — derive rates, separate the judge (Metrics 3/5)
 The raw inputs exist (completed/failed/skipped, fix attempts, `max_retries_hit`) but are never computed into **Task Completion Rate / Escalation Rate / Tool Call Success Rate**. Rubric scores (test-quality ≥70, traceability P0=100%) are **self-graded by the same executing model** (`json-output.sh:473-496`) rather than by an independent calibrated judge.
 **Fix:** Compute and persist the derived rates in the metrics YAML. Introduce a separate, cheaper judge model (e.g., Haiku) for binary rubric scoring so the executor isn't grading its own work.
 ---
 ### LOW Priority
 #### 10. `yq`-dependent durability
 Metrics, sprint-status, and issue persistence silently degrade without `yq` installed (`epic-execute.sh:707`, `:790`). The otherwise-excellent memory layer is best-effort, not guaranteed.
 **Fix:** Either declare `yq` a hard prerequisite (fail fast at startup) or harden the sed/awk fallbacks to full parity.
 #### 11. Vestigial "full workflow YAML" priority tier
 `CONTENT_PRIORITY_LOW` is still described as "Full workflow YAML (truncate first)" (`epic-execute.sh:404`), a legacy fallback path no active builder uses. Remove to avoid confusion.
 #### 12. Gate status not persisted by standalone runs
 `validation.gate_status` is written by the **chain** wrapper (`epic-chain.sh:626-638`), not the inner execute loop, so a standalone `epic-execute.sh` run leaves `gate_status: PENDING`.
 ---
 ## Two Structural Themes
 1. **Prose where code belongs.** The recurring pattern — git rules, "do NOT pause", personas like *"You ARE an adversarial reviewer"* (`:1616`) — is the compensatory scaffolding the rubric flags: a rule told to the model that the harness could instead enforce. The codebase is mid-migration; the objective gates (contracts, real tooling, JSON results) are already constitutive, but the soft rules haven't caught up.
 2. **Built-but-unwired.** The retry/timeout resilience layer (`utils.sh:55-150`) is the clearest example — fully implemented, zero callers. The capability gap is often plumbing, not net-new code.
 ---
 ## Suggested Sequencing
 Per the rubric's "fix the lowest-scoring foundational layer first":
 1. **Observability** (#1) — trace/session ID + real token/cost/latency from `claude --output-format json`. Foundation for everything; currently 1/5.
 2. **Wire the existing retry/timeout layer** (#2) — pure plumbing, already-written code, large RALF payoff.
 3. **Governance** (#3, #4) — pre-commit hook + sensitive-file re-check on agent commits; validate decision-log/metrics writes before commit.
 4. **Enforce JSON contracts** (#5) — fail-and-retry on missing signal instead of fuzzy fallback; promote advisory gates to blocking.
 5. **Context summarization** (#7) — anchored iterative summarization targeting a utilization band.
 The smallest, highest-leverage starting points are **#2 (retry wiring)** and **#3 (pre-commit hook)**.
 ---
 ## References
 - Benchmark source: "Agent Loop Evaluation Framework (2026 Standard)," Manus AI — context engineering, tool design, OWASP Top 10 for Agentic Apps 2026, RALF loop, OpenTelemetry-first observability.
 - Prior review: [`epic-workflows-v1.md`](./epic-workflows-v1.md) (2026-01-02) — overlaps on the `--dangerously-skip-permissions` finding (#1/#3, #6 here).
 ---
 ## Appendix A — Observability Deep Dive
 **Added:** 2026-06-15 · Expands HIGH-priority item #1 and Evaluation Metrics (#9).
 ### A.1 Root cause: a single discard point
 Every LLM call routes through `run_claude_to_file` (`epic-execute.sh:518-529`), which uses the CLI's **default `text` output format**:
 ```bash
 claude --dangerously-skip-permissions -p "$prompt" 2>&1 | tee -a "$LOG_FILE" > "$PHASE_OUTPUT_FILE" || true
 ```
 Only rendered assistant text survives. The chain report generator does the same (`epic-chain.sh:884`). The 1/5 Tracing score is the consequence of this one choice — not an architectural limit. The telemetry is produced on every call and thrown away.
 ### A.2 What `claude --output-format json` already returns (verified)
 Tested against the installed CLI (v2.1.177). The result envelope contains every field 2026 observability requires:
 | Field (verified present) | Example | Rubric need it satisfies |
 |---|---|---|
 | `session_id` | `f6ff5b55-…` | Trace/session ID (today: PID `$$`) |
 | `total_cost_usd` | `0.0586` | Real cost (today: fabricated) |
 | `usage.input_tokens` / `output_tokens` | 2629 / 4 | Token spend |
 | `usage.cache_read_input_tokens` / `cache_creation_input_tokens` | 15362 / 3718 | Cache efficiency |
 | `modelUsage[model].costUSD` + per-model tokens | Opus + Haiku sub-agent | Per-model cost attribution |
 | `modelUsage[model].contextWindow` | 1000000 | Enables context-utilization % |
 | `duration_ms` / `duration_api_ms` / `ttft_ms` | 1757 / 2522 / 1754 | Per-call latency |
 | `num_turns`, `stop_reason`, `is_error`, `api_error_status`, `permission_denials` | 1 / end_turn / false / null / [] | Tool-call success / error telemetry |
 The CLI also exposes `--output-format stream-json` (live JSONL ending with the same `result` envelope) and `--json-schema <schema>` for structured-output enforcement. All three require `--print`, which the script already passes.
 ### A.3 The current report is actively misleading, not merely empty
 Because real telemetry is discarded, the chain report **fabricates** it (`epic-chain-execution-report.md:225-248`):
 - Token table derived from `Est. Calls = stories × 2` and `~16K input/call` assumptions — arithmetic on story counts, not measurement.
 - Cost table priced against **Claude Sonnet 3.5 ($3/$15)** and **Opus ($15/$75)** — neither is the model that ran (`claude-opus-4-8[1m]`); the real `total_cost_usd` was available and discarded.
 - Carries the disclaimer *"Actual usage may vary by 50-200%."*
 An authoritative-looking cost table that is invented is worse than a blank cell — it is unfalsifiable noise where ground truth was one flag away.
 ### A.4 Synergy: this fix lifts three other findings
 The same envelope partially closes gaps scored elsewhere:
 1. **Context Strategy (#7).** `contextWindow` + `input_tokens + cache_read + cache_creation` yields exact per-phase utilization, making the 60–80% target measurable and enforceable for free.
 2. **I/O Validation (#5) + the 9-mismark incident.** Parsing `.result` (clean final message) instead of scraping interleaved stdout, plus `--json-schema` to make the status field structurally mandatory, removes the fuzzy-regex fallback (`json-output.sh:311-387`) that mismarked 9 stories (`epic-chain-execution-report.md:254-272`). That incident is fundamentally an output-format problem.
 3. **Evaluation Metrics (#9).** Enables the rubric's business metrics: Task Completion Rate (`completed/total`), Escalation Rate (`max_retries_hit/stories` + real `is_error` rate), Tool Call Success Rate (`is_error=false` phases ÷ total).
 ### A.5 Target design (fits the existing architecture)
 Constraints: preserve the memory-safe "pipe to file, read 32KB tail" pattern, and keep the live `tee` to the log.
 1. **Switch to `stream-json`, not plain `json`.** Plain `json` buffers and kills the live tee. `--output-format stream-json --include-partial-messages` streams live *and* makes the **last JSON line** the `result` envelope; `read_phase_tail` still captures it (parse the last line where `.type=="result"`). Memory-safety preserved.
 2. **One append-only trace file per epic**, using the OTel span data model (convertible to OTLP later):
   `docs/sprint-artifacts/traces/epic-<id>-trace.jsonl` — one span per phase:
   ```json
   {"trace_id":"<epic-uuid>","span_id":"<claude session_id>","parent":"<story_id>",
    "name":"dev","story_id":"4-3","model":"claude-opus-4-8[1m]",
    "input_tokens":2629,"output_tokens":4,"cache_read":15362,"cost_usd":0.058,
    "duration_ms":1757,"ttft_ms":1754,"num_turns":1,"is_error":false,
    "ctx_util_pct":2.1,"status":"COMPLETE","ts":"2026-06-15T…"}
   ```
   Generate one epic-level `trace_id` (`uuidgen`) at startup; each call's `session_id` is the `span_id`, `story_id` the parent. This is the single correlating ID `$$` never provided.
 3. **Deterministic rollup into `metrics.yaml`** — add a `telemetry:` block summed from the JSONL (no model, no fabrication): `total_cost_usd`, `total_input_tokens`, `total_output_tokens`, `cache_read_tokens`, `by_phase`. The chain report then reads measured numbers; the `Estimated Token Usage` section is deleted.
 4. **OTel bridge (phase 2, optional).** JSONL-with-OTel-fields is the pragmatic 80%. A later post-processor converts spans → OTLP without touching the hot path.
 ### A.6 Caveats
 - **jq dependency** — telemetry parsing needs `jq` (same soft-dep fragility as `yq`, item #10). Degrade gracefully (skip span, don't crash); consider making `jq` a hard startup prerequisite.
 - **Cost includes sub-agents** — `modelUsage` surfaced an internal Haiku call inside an Opus phase. Record `modelUsage` verbatim; don't flatten to one model.
 - **Cache tokens dominate** — in testing, cache-read (15K) was 6× fresh input (2.6K). Report fresh vs. cache-read separately; compute utilization from the sum.
 - **`stream-json` is noisier on disk** — log grows faster (every partial chunk). The existing 64KB inter-story log truncation (`epic-execute.sh:3182`) mitigates; confirm it suffices.
 ### A.7 Why this is the right place to start
 Observability scored lowest yet is the cheapest high-priority fix and the only one that drags three other findings upward. The data already exists on every call; the work is plumbing (a format switch + a `record_span` helper + a deterministic rollup), not building telemetry infrastructure.
--- a/docs/improvements/observability-implementation-plan.md
+++ b/docs/improvements/observability-implementation-plan.md
@ -0,0 +1,118 @@
 ---
 title: Observability Implementation Plan
 ---
 # Observability Implementation Plan
 **Date:** 2026-06-15
 **Targets:** HIGH item #1 (Tracing 1/5), with synergistic wins on #5 (I/O Validation) and #9 (Evaluation Metrics)
 **Source analysis:** [`agent-loop-2026-gap-analysis.md`](./agent-loop-2026-gap-analysis.md) — Appendix A
 **Status:** Phases 0–2 implemented on branch `feat/observability-tracing` (gated behind `BMAD_TRACE=1`). Phase 3 (`--json-schema` enforcement) and Phase 4 (OTLP bridge) deferred to separate PRs. Chain-level trace id (decision #5) still deferred.
 **Locked decisions:** (1) ambient vars · (2) jq live renderer, capture tee placed upstream of jq so renderer failure is cosmetic-only · (3) separate PRs · (4) hard-fail on jq when `BMAD_TRACE=1` · (5) chain-level trace id deferred.
 **Added beyond original plan — intra-phase heartbeat.** Per request, `start_phase_heartbeat`/`emit_heartbeat`/`stop_phase_heartbeat` append a liveness beat to `epic-<id>-trace.live.jsonl` every `BMAD_TRACE_HEARTBEAT_INTERVAL` seconds (default 10) while a phase runs — so a `kill -9` mid-phase leaves a forensic trail (which phase, elapsed, context size, last assistant text). Reads only the latest assistant event (bounded I/O, no slurp); self-terminates within one interval if the main process dies; defensively stopped in `cleanup()`.
 **Implementation note:** all new tracing logic lives in `scripts/epic-execute-lib/observability.sh` (new module). `epic-execute.sh` changes are limited to: sourcing the module, a startup dep-check + `init_observability`, the `run_claude_to_file` rewrite, a `record_span` call inside that helper, `rollup_telemetry` in `cleanup()`, one temp-file cleanup line, and 9 one-line `set_span_context` calls. Verified against the live CLI: real span lands with correct model/tokens/cost/ctx%, clean-text compatibility holds (downstream `IMPLEMENTATION COMPLETE` detection still works), crash path writes a degraded error span, legacy path (`BMAD_TRACE` unset) is byte-for-byte unchanged, and the hard-fail exits 1 when jq is absent.
 ---
 ## Goal
 Stop discarding the telemetry the `claude` CLI already emits on every call. Capture it as OTel-shaped trace spans, roll it up into deterministic (non-fabricated) metrics, and — as a side effect of the same code-path change — replace fragile stdout-scraping with clean `.result` parsing.
 ## Verified CLI facts (claude 2.1.177)
 All confirmed against the installed CLI on 2026-06-15:
 1. `--output-format json` returns one result object with: `session_id`, `total_cost_usd`, `usage.{input_tokens,output_tokens,cache_read_input_tokens,cache_creation_input_tokens}`, `modelUsage[model].{costUSD,inputTokens,outputTokens,contextWindow}`, `duration_ms`, `duration_api_ms`, `ttft_ms`, `num_turns`, `stop_reason`, `is_error`, `api_error_status`, `permission_denials`.
 2. `--output-format stream-json` **requires `--verbose`** (hard error otherwise). Emits JSONL: `system/init` → `rate_limit_event` → `assistant`(s)/`tool_use`/`tool_result` → **`result`** (always the last line).
 3. The final `result` line carries `.result` = the clean final assistant text, plus the full telemetry envelope from (1). It is a single line, so `tail -n 1 | jq` extracts it cheaply regardless of phase length — the memory-safe "read the tail" design is preserved.
 4. A stderr warning (`no stdin data received in 3s…`) leaks into `2>&1`; suppressed with `< /dev/null`.
 5. `modelUsage` attributes cost across models — internal sub-agents (e.g. a Haiku helper inside an Opus phase) show up separately. Record verbatim; do not flatten.
 6. `uuidgen`, `jq`, `yq` all present on the dev machine.
 ## Design decision: stream-json + `.result` extraction (compatibility-preserving)
 `run_claude_to_file` is the single chokepoint (`epic-execute.sh:518-529`); all 9 phase call sites follow `run_claude_to_file …; result=$(read_phase_tail)`. The downstream parsers (`extract_json_result`, `check_phase_completion`) expect `PHASE_OUTPUT_FILE` to hold the rendered assistant text.
 **Plan:** switch the invocation to `stream-json --verbose`, write the raw JSONL to a separate stream file, then post-process:
 - Extract `.result` (clean text) from the final `result` line → write to `PHASE_OUTPUT_FILE`. **All 9 call sites keep working unchanged**, and parsing improves (no interleaved tool output → fixes the 9-mismark incident, gap #5).
 - Extract telemetry from the same line → `record_span` (new) appends a JSONL span.
 - Graceful degradation: if `jq` is absent, no `result` line is found, or the call crashed, fall back to today's raw-text behavior so nothing regresses.
 This keeps the blast radius to one function plus two new helpers; the 9 call sites are untouched.
 ---
 ## Work breakdown (phased)
 ### Phase 0 — Prerequisites & scaffolding (low risk)
 - [ ] Add `jq` to a startup capability check; decide hard-fail vs. graceful-degrade (see Open Decision 4). Mirror the existing `yq` check pattern.
 - [ ] Create `TRACES_DIR="$SPRINT_ARTIFACTS_DIR/traces"`; `mkdir -p` alongside the other artifact dirs (`epic-execute.sh:138-145`).
 - [ ] Generate an epic-level `TRACE_ID=$(uuidgen)` at startup (near `EPIC_START_SECONDS`, `:1283`). This is the single correlating ID `$$` never provided.
 ### Phase 1 — Instrument the invocation (core, additive)
 - [ ] Rewrite `run_claude_to_file` to use `stream-json --verbose`, raw JSONL → `$PHASE_STREAM_FILE`, live-render assistant text to the log (Open Decision 2), then derive `PHASE_OUTPUT_FILE` from `.result`.
 - [ ] Add `record_span <phase> <story_id>` — reads `tail -n 1 "$PHASE_STREAM_FILE"`, validates `.type=="result"`, appends one OTel-shaped span to `$TRACES_DIR/epic-<id>-trace.jsonl`.
 - [ ] Pass `phase` + `story_id` into `run_claude_to_file` (currently positional `$1`/`-f`); thread through the 9 call sites or set module-level `CURRENT_PHASE`/`CURRENT_STORY_ID` before each call (Open Decision 1 — signature vs. ambient vars).
 - [ ] Compute `ctx_util_pct` = (input + cache_read + cache_creation) / `contextWindow` and `log_warn` when > 80% (feeds Context Strategy, gap #7).
 Span schema (one line per phase):
 ```json
 {"trace_id":"…","span_id":"<session_id>","parent":"<story_id>","name":"dev",
 "story_id":"4-3","model":"claude-opus-4-8[1m]","input_tokens":2705,
 "output_tokens":4,"cache_read":19080,"cost_usd":0.0237,"duration_ms":1089,
 "ttft_ms":1012,"num_turns":1,"is_error":false,"api_error_status":null,
 "ctx_util_pct":2.1,"status":"COMPLETE","ts":"2026-06-15T…Z"}
 ```
 ### Phase 2 — Deterministic rollup (replaces fabrication)
 - [ ] Add `rollup_telemetry` — sum the JSONL spans into a `telemetry:` block in `metrics.yaml` (`total_cost_usd`, `total_input_tokens`, `total_output_tokens`, `cache_read_tokens`, `by_phase`). Call from `finalize_metrics` (`:765`).
 - [ ] Add derived metrics (gap #9): Task Completion Rate, Escalation Rate (`max_retries_hit/stories` + real `is_error` rate), Tool Call Success Rate (`is_error=false` ÷ phases).
 - [ ] Chain: delete the fabricated `Estimated Token Usage` / `Cost Estimates` sections (`epic-chain-execution-report.md` generator, `epic-chain.sh:831-876`); have the report read measured `telemetry:` from each epic's metrics. Also instrument the chain's own report-generator `claude` call (`epic-chain.sh:884`).
 ### Phase 3 — Contract enforcement (synergistic, higher risk — gate separately)
 - [ ] Use `--json-schema` to make the completion `status` field structurally mandatory, removing the fuzzy-regex fallback (`json-output.sh:311-387`). Sequence **after** Phases 1–2 prove stable, since it changes completion semantics.
 ### Phase 4 — OTel bridge (optional, later)
 - [ ] Post-processor: JSONL spans → OTLP exporter to a collector. No hot-path change. Defer until someone needs a dashboard.
 ---
 ## Test strategy
 - **Unit (bats or shell asserts):** feed a captured `result` JSONL fixture into `record_span`/`rollup_telemetry`; assert span fields and YAML sums. Capture one real fixture now (we already have sample envelopes) and commit under `test/fixtures/`.
 - **Degradation:** run the new path with `jq` shadowed out of `PATH`; assert it falls back to raw text and does not crash.
 - **Crash path:** feed a truncated stream (no `result` line); assert graceful fallback + a span marked `is_error`/incomplete.
 - **End-to-end:** `--dry-run` won't call claude, so add a tiny real one-story epic to a scratch fixture and confirm a populated `trace.jsonl` + `telemetry:` block.
 - **Regression:** confirm all 9 phases still parse completion correctly (the `.result` text is a superset of what they saw before).
 ## Rollout / safety
 - Gate Phase 1 behind an env flag (e.g. `BMAD_TRACE=1`) for the first iteration so the old text path stays default until the new one is proven, then flip the default.
 - Phases 1–2 are observe-only (no behavior change beyond cleaner parsing). Phase 3 changes control flow — ship it on its own PR per the repo's "<800 lines, split larger changes" rule.
 ## Open decisions (need sign-off before coding)
 1. **Threading phase/story into the invocation** — change `run_claude_to_file`'s signature, or set ambient `CURRENT_PHASE`/`CURRENT_STORY_ID` before each call? Ambient is a smaller diff (no call-site edits) but more implicit. *Recommendation: ambient vars set in the phase functions.*
 2. **Live log format** — tee raw JSONL to the log (ugly for humans tailing it) vs. pipe through a small `jq` renderer that prints assistant text live. *Recommendation: jq renderer for the terminal/log, raw JSONL to the stream file.*
 3. **Bundle scope** — observability-only (Phases 0–2), or include the contract enforcement (Phase 3) in the same effort since it touches the same path? *Recommendation: separate PRs; land 0–2 first.*
 4. **`jq` dependency** — hard prerequisite (fail fast at startup) or graceful degradation? Telemetry is useless without it, but failing hard is a sharper UX change. *Recommendation: graceful-degrade with a loud one-time warning, consistent with the current `yq` handling.*
 5. **Trace ID propagation to chain** — should `epic-chain.sh` mint a chain-level trace ID that each epic inherits as a parent span, giving one trace across the whole chain? *Recommendation: yes, but defer to Phase 2.*
 ---
 ## Effort estimate
 | Phase | Scope | Risk | Rough size |
 |-------|-------|------|-----------|
 | 0 | dirs, trace id, jq check | low | ~30 lines |
 | 1 | invocation + record_span | medium | ~120 lines (1 fn rewrite + 1 new) |
 | 2 | rollup + derived metrics + chain report | low–med | ~150 lines |
 | 3 | --json-schema enforcement | medium–high | separate PR |
 | 4 | OTLP bridge | low (isolated) | deferred |
 Phases 0–2 are the core observability win and are mostly additive plumbing.
 </content>
 </invoke>
--- a/scripts/epic-chain.sh
+++ b/scripts/epic-chain.sh
@ -240,6 +240,15 @@ EOF
 ---
 EOF
    # Measured telemetry (real numbers, or an honest "not captured")
    build_measured_telemetry_md >> "$CHAIN_REPORT_FILE"
    cat >> "$CHAIN_REPORT_FILE" << EOF
 ---
 ## Artifacts
 | Artifact | Location |
@ -259,6 +268,57 @@ EOF
    log_success "Basic report created: $CHAIN_REPORT_FILE"
 }
 # Build a MEASURED telemetry markdown section by aggregating the deterministic
 # `telemetry:` blocks that epic-execute writes into each epic's metrics.yaml
 # (populated when run with BMAD_TRACE=1). Emits real per-epic + total token/cost
 # numbers — never estimates. If no epic captured telemetry, says so plainly.
 # Output: markdown on stdout.
 build_measured_telemetry_md() {
    if ! command -v yq >/dev/null 2>&1; then
        echo "## Telemetry"
        echo ""
        echo "_Telemetry rollup requires \`yq\` (not found)._"
        return 0
    fi
    local any=false
    local total_cost=0 total_in=0 total_out=0
    local rows=""
    for epic_id in "${EPIC_IDS[@]}"; do
        local mf="$METRICS_DIR/epic-${epic_id}-metrics.yaml"
        [ -f "$mf" ] || continue
        local has
        has=$(yq '.telemetry // "" | tag' "$mf" 2>/dev/null)
        [ "$has" = "!!map" ] || continue
        any=true
        local cost in out
        cost=$(yq '.telemetry.total_cost_usd // 0' "$mf" 2>/dev/null)
        in=$(yq '.telemetry.total_input_tokens // 0' "$mf" 2>/dev/null)
        out=$(yq '.telemetry.total_output_tokens // 0' "$mf" 2>/dev/null)
        rows+="| $epic_id | $in | $out | \$$(printf '%.4f' "$cost" 2>/dev/null || echo "$cost") |"$'\n'
        total_cost=$(awk "BEGIN { printf \"%.6f\", $total_cost + $cost }")
        total_in=$(( total_in + ${in%.*} ))
        total_out=$(( total_out + ${out%.*} ))
    done
    echo "## Telemetry (Measured)"
    echo ""
    if [ "$any" != true ]; then
        echo "_No measured telemetry captured. Re-run with \`BMAD_TRACE=1\` to record"
        echo "real token usage and cost per epic (see docs/improvements/observability-implementation-plan.md)._"
        return 0
    fi
    echo "Actual usage captured from the Claude CLI result envelopes (not estimated):"
    echo ""
    echo "| Epic | Input Tokens | Output Tokens | Cost (USD) |"
    echo "|------|--------------|---------------|------------|"
    printf '%s' "$rows"
    echo "| **Total** | **$total_in** | **$total_out** | **\$$(printf '%.4f' "$total_cost")** |"
 }
 # =============================================================================
 # Argument Parsing
 # =============================================================================
@ -828,6 +888,10 @@ if [ "$GENERATE_REPORT" = true ] && [ "$DRY_RUN" = false ]; then
            WORKFLOW_PATH="$PROJECT_ROOT/src/bmm/workflows/4-implementation/epic-chain"
        fi
        # Precompute the MEASURED telemetry section in bash so the report uses
        # ground-truth numbers and the model has no room to invent token/cost data.
        measured_telemetry_md=$(build_measured_telemetry_md)
        # Build report generation prompt
        report_prompt="You are Bob, the Scrum Master, generating a chain execution report.
@ -835,6 +899,17 @@ if [ "$GENERATE_REPORT" = true ] && [ "$DRY_RUN" = false ]; then
 Generate a comprehensive chain execution report for the completed epic chain.
 ## CRITICAL: No Fabricated Metrics
 Do NOT estimate, infer, or invent token counts, costs, or call counts. Token
 and cost figures come ONLY from the measured telemetry section below. If a value
 is not present there, write \"not captured\" — never a guess or a typical-pattern
 estimate. Include the measured telemetry section verbatim in the report.
 ## Measured Telemetry (ground truth — include verbatim)
 ${measured_telemetry_md}
 ## Configuration
 - Chain Plan: $CHAIN_PLAN_FILE
--- a/scripts/epic-execute-lib/observability.sh
+++ b/scripts/epic-execute-lib/observability.sh
@ -0,0 +1,355 @@
 #!/bin/bash
 #
 # BMAD Epic Execute - Observability Module
 #
 # Captures the telemetry the `claude` CLI already emits (session id, token
 # usage, cost, latency, context window) as OTel-shaped trace spans, and rolls
 # them up into deterministic (non-fabricated) metrics.
 #
 # Source: docs/improvements/observability-implementation-plan.md
 #
 # Usage: Sourced by epic-execute.sh. Relies on jq (hard prerequisite, enforced
 # by require_observability_deps below) and the following globals from the
 # parent script: SPRINT_ARTIFACTS_DIR, EPIC_ID, LOG_FILE, VERBOSE.
 #
 # Gated behind BMAD_TRACE: tracing is only active when BMAD_TRACE=1 (default
 # off during initial rollout). The invocation helper in epic-execute.sh falls
 # back to the legacy text path when tracing is disabled.
 #
 # =============================================================================
 # Observability State
 # =============================================================================
 # Epic-level trace id (one per run); each phase's claude session_id is a span id.
 TRACE_ID="${TRACE_ID:-}"
 # Path to the append-only span log for this epic (set by init_observability).
 TRACE_FILE=""
 # Ambient context for the next span (set by the phase functions before each
 # claude invocation — see Open Decision 1 in the implementation plan).
 CURRENT_PHASE=""
 CURRENT_STORY_ID=""
 # Whether tracing is active for this run.
 TRACE_ENABLED=false
 # Intra-phase heartbeat: appends a liveness beat to a .live.jsonl every N
 # seconds while a phase is running, so a hard kill (SIGKILL) mid-phase still
 # leaves a forensic trail of which phase was in flight and how far it got.
 HEARTBEAT_INTERVAL="${BMAD_TRACE_HEARTBEAT_INTERVAL:-10}"
 LIVE_TRACE_FILE=""
 HEARTBEAT_PID=""
 PHASE_START_SECONDS=""
 # Main script PID ($$ is the sourcing shell = the epic-execute process).
 MAIN_PID="$$"
 # =============================================================================
 # Dependency Enforcement (Open Decision 4: hard fail)
 # =============================================================================
 # jq is load-bearing for observability in three places: the live renderer,
 # span extraction, and clean .result extraction. When tracing is enabled it is
 # a hard prerequisite — too much rides on it to silently degrade.
 require_observability_deps() {
    if [ "${BMAD_TRACE:-}" != "1" ]; then
        return 0
    fi
    if ! command -v jq >/dev/null 2>&1; then
        log_error "BMAD_TRACE=1 requires 'jq' but it was not found on PATH."
        log_error "Install jq (https://jqlang.github.io/jq/) or unset BMAD_TRACE."
        exit 1
    fi
    return 0
 }
 # =============================================================================
 # Initialization
 # =============================================================================
 # Initialize the trace for this epic. Mints a trace id (if not already set) and
 # creates the span file. No-op unless BMAD_TRACE=1.
 init_observability() {
    if [ "${BMAD_TRACE:-}" != "1" ]; then
        TRACE_ENABLED=false
        return 0
    fi
    if [ -z "$SPRINT_ARTIFACTS_DIR" ] || [ -z "$EPIC_ID" ]; then
        log_warn "Cannot initialize tracing: SPRINT_ARTIFACTS_DIR or EPIC_ID not set"
        TRACE_ENABLED=false
        return 1
    fi
    if [ -z "$TRACE_ID" ]; then
        if command -v uuidgen >/dev/null 2>&1; then
            TRACE_ID=$(uuidgen | tr '[:upper:]' '[:lower:]')
        else
            # Fallback id: epic + pid + start seconds (still unique per run)
            TRACE_ID="epic-${EPIC_ID}-$$-$(date +%s)"
        fi
    fi
    TRACES_DIR="${TRACES_DIR:-$SPRINT_ARTIFACTS_DIR/traces}"
    mkdir -p "$TRACES_DIR" 2>/dev/null || true
    TRACE_FILE="$TRACES_DIR/epic-${EPIC_ID}-trace.jsonl"
    LIVE_TRACE_FILE="$TRACES_DIR/epic-${EPIC_ID}-trace.live.jsonl"
    TRACE_ENABLED=true
    log "Tracing enabled (trace_id=$TRACE_ID) -> $TRACE_FILE"
    return 0
 }
 # Set the ambient context for the next span. Called by phase functions before
 # invoking claude (ambient-vars approach, Open Decision 1).
 set_span_context() {
    CURRENT_PHASE="$1"
    CURRENT_STORY_ID="$2"
 }
 # =============================================================================
 # Intra-phase Heartbeat (liveness for debugging hangs / hard kills)
 # =============================================================================
 # Append one heartbeat record describing the in-flight phase. Reads only the
 # latest assistant event from the stream (bounded I/O — no slurp), so it stays
 # cheap even for long phases with large tool output. Best-effort: a transient
 # partial line just skips this tick.
 emit_heartbeat() {
    local stream_file="$1"
    { [ "$TRACE_ENABLED" = true ] && [ -n "$LIVE_TRACE_FILE" ] && [ -f "$stream_file" ]; } || return 0
    local now elapsed ts events last_asst
    now=$(date +%s)
    elapsed=$(( now - ${PHASE_START_SECONDS:-$now} ))
    ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
    events=$(wc -l < "$stream_file" 2>/dev/null | tr -d ' '); events="${events:-0}"
    last_asst=$(grep '"type":"assistant"' "$stream_file" 2>/dev/null | tail -n 1)
    if [ -n "$last_asst" ]; then
        printf '%s' "$last_asst" | jq -c \
            --arg trace_id "$TRACE_ID" --arg phase "$CURRENT_PHASE" --arg story "$CURRENT_STORY_ID" \
            --arg ts "$ts" --argjson elapsed "$elapsed" --argjson events "$events" '
            (.message.usage // {}) as $u
            | { trace_id:$trace_id, kind:"heartbeat", phase:$phase, story_id:$story,
                elapsed_s:$elapsed, ts:$ts, events:$events,
                ctx_input_tokens: ($u.input_tokens // 0),
                cache_read: ($u.cache_read_input_tokens // 0),
                out_tokens: ($u.output_tokens // 0),
                last_text: ((([ .message.content[]? | select(.type=="text") | .text ] | last) // "")[0:160]) }' \
            >> "$LIVE_TRACE_FILE" 2>/dev/null || true
    else
        # No assistant output yet — still emit a beat so you can see it started.
        jq -nc \
            --arg trace_id "$TRACE_ID" --arg phase "$CURRENT_PHASE" --arg story "$CURRENT_STORY_ID" \
            --arg ts "$ts" --argjson elapsed "$elapsed" --argjson events "$events" \
            '{trace_id:$trace_id, kind:"heartbeat", phase:$phase, story_id:$story,
              elapsed_s:$elapsed, ts:$ts, events:$events, ctx_input_tokens:0, out_tokens:0, last_text:""}' \
            >> "$LIVE_TRACE_FILE" 2>/dev/null || true
    fi
 }
 # Start a background heartbeat for the current phase. Emits an immediate beat,
 # then one every HEARTBEAT_INTERVAL seconds until the phase ends or the main
 # process dies (self-terminates within one interval on SIGKILL).
 start_phase_heartbeat() {
    local stream_file="$1"
    { [ "$TRACE_ENABLED" = true ] && [ "${HEARTBEAT_INTERVAL:-0}" -gt 0 ] 2>/dev/null; } || return 0
    PHASE_START_SECONDS=$(date +%s)
    emit_heartbeat "$stream_file"
    (
        while kill -0 "$MAIN_PID" 2>/dev/null; do
            sleep "$HEARTBEAT_INTERVAL"
            emit_heartbeat "$stream_file"
        done
    ) &
    HEARTBEAT_PID=$!
 }
 # Stop the background heartbeat (called when the phase's claude call returns,
 # and defensively from cleanup()).
 stop_phase_heartbeat() {
    [ -n "${HEARTBEAT_PID:-}" ] || return 0
    kill "$HEARTBEAT_PID" 2>/dev/null || true
    wait "$HEARTBEAT_PID" 2>/dev/null || true
    HEARTBEAT_PID=""
 }
 # =============================================================================
 # Span Recording
 # =============================================================================
 # Append one OTel-shaped span for the just-completed claude phase.
 # Reads the final `result` envelope from the raw stream file (always the last
 # JSONL line) and derives the telemetry fields.
 #
 # Arguments:
 #   $1 - raw stream file (JSONL emitted by claude --output-format stream-json)
 #   $2 - phase status (optional; the downstream completion status, e.g. COMPLETE)
 record_span() {
    local stream_file="$1"
    local status="${2:-}"
    [ "$TRACE_ENABLED" = true ] || return 0
    [ -n "$TRACE_FILE" ] || return 0
    [ -f "$stream_file" ] || return 0
    # The result envelope is always the last line. A single JSONL line, so this
    # is memory-cheap regardless of phase length.
    local envelope
    envelope=$(tail -n 1 "$stream_file" 2>/dev/null)
    # Validate it is actually the result event; if the call crashed mid-stream
    # the last line won't be a result — record a degraded error span instead.
    local etype
    etype=$(printf '%s' "$envelope" | jq -r '.type // empty' 2>/dev/null) || true
    local ts
    ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
    if [ "$etype" != "result" ]; then
        jq -nc \
            --arg trace_id "$TRACE_ID" \
            --arg parent "$CURRENT_STORY_ID" \
            --arg name "$CURRENT_PHASE" \
            --arg story_id "$CURRENT_STORY_ID" \
            --arg status "${status:-UNKNOWN}" \
            --arg ts "$ts" \
            '{trace_id:$trace_id, span_id:null, parent:$parent, name:$name,
              story_id:$story_id, model:null, input_tokens:0, output_tokens:0,
              cache_read:0, cost_usd:0, duration_ms:0, ttft_ms:0, num_turns:0,
              is_error:true, api_error_status:"no_result_envelope",
              ctx_util_pct:0, status:$status, ts:$ts}' \
            >> "$TRACE_FILE" 2>/dev/null || true
        [ "$VERBOSE" = true ] && log_warn "record_span: no result envelope in stream (degraded span)"
        return 0
    fi
    # Pull the primary (largest-context) model for ctx utilization. Sub-agent
    # usage is preserved verbatim under model/cost via the envelope, but for the
    # span's headline model we take the model with the largest context window.
    printf '%s' "$envelope" | jq -c \
        --arg trace_id "$TRACE_ID" \
        --arg parent "$CURRENT_STORY_ID" \
        --arg name "$CURRENT_PHASE" \
        --arg story_id "$CURRENT_STORY_ID" \
        --arg status "$status" \
        --arg ts "$ts" '
        # Choose headline model = the one with the largest contextWindow
        (.modelUsage // {}) as $mu
        | ($mu | to_entries | sort_by(.value.contextWindow // 0) | last) as $primary
        | ($primary.key // .usage.service_tier // null) as $model
        | (($primary.value.contextWindow) // 0) as $ctxwin
        | ((.usage.input_tokens // 0)
            + (.usage.cache_read_input_tokens // 0)
            + (.usage.cache_creation_input_tokens // 0)) as $ctx_used
        | (if $ctxwin > 0 then (($ctx_used / $ctxwin) * 100 * 10 | round / 10) else 0 end) as $ctx_pct
        | {
            trace_id: $trace_id,
            span_id: (.session_id // null),
            parent: $parent,
            name: $name,
            story_id: $story_id,
            model: $model,
            input_tokens: (.usage.input_tokens // 0),
            output_tokens: (.usage.output_tokens // 0),
            cache_read: (.usage.cache_read_input_tokens // 0),
            cost_usd: (.total_cost_usd // 0),
            duration_ms: (.duration_ms // 0),
            ttft_ms: (.ttft_ms // 0),
            num_turns: (.num_turns // 0),
            is_error: (.is_error // false),
            api_error_status: (.api_error_status // null),
            ctx_util_pct: $ctx_pct,
            status: $status,
            ts: $ts
          }' >> "$TRACE_FILE" 2>/dev/null || true
    # Context-utilization warning feeds the Context Strategy gap (#7).
    local ctx_pct
    ctx_pct=$(printf '%s' "$envelope" | jq -r '
        (.modelUsage // {} | to_entries | sort_by(.value.contextWindow // 0) | last) as $p
        | (($p.value.contextWindow) // 0) as $w
        | ((.usage.input_tokens // 0) + (.usage.cache_read_input_tokens // 0) + (.usage.cache_creation_input_tokens // 0)) as $u
        | if $w > 0 then (($u / $w) * 100 | floor) else 0 end' 2>/dev/null)
    if [ -n "$ctx_pct" ] && [ "$ctx_pct" -ge 80 ] 2>/dev/null; then
        log_warn "Context utilization high: ${ctx_pct}% (${CURRENT_PHASE} / ${CURRENT_STORY_ID})"
    fi
    return 0
 }
 # =============================================================================
 # Rollup (deterministic telemetry into metrics.yaml)
 # =============================================================================
 # Sum all spans for this epic and write a telemetry block into metrics.yaml.
 # Deterministic: no model involved, no fabrication. No-op without yq.
 rollup_telemetry() {
    [ "$TRACE_ENABLED" = true ] || return 0
    [ -n "$TRACE_FILE" ] && [ -f "$TRACE_FILE" ] || return 0
    [ -n "$METRICS_FILE" ] && [ -f "$METRICS_FILE" ] || return 0
    command -v yq >/dev/null 2>&1 || { log_warn "yq not found - skipping telemetry rollup"; return 0; }
    # Aggregate totals + per-phase breakdown from the JSONL spans.
    local totals_json
    totals_json=$(jq -s '
        {
          total_cost_usd: (map(.cost_usd) | add // 0),
          total_input_tokens: (map(.input_tokens) | add // 0),
          total_output_tokens: (map(.output_tokens) | add // 0),
          cache_read_tokens: (map(.cache_read) | add // 0),
          phases_total: length,
          phases_errored: (map(select(.is_error == true)) | length),
          by_phase: (group_by(.name) | map({
              key: (.[0].name // "unknown"),
              value: {
                  calls: length,
                  cost_usd: (map(.cost_usd) | add // 0),
                  input_tokens: (map(.input_tokens) | add // 0),
                  output_tokens: (map(.output_tokens) | add // 0)
              }
          }) | from_entries)
        }' "$TRACE_FILE" 2>/dev/null)
    [ -z "$totals_json" ] && return 0
    # Tool Call Success Rate (phase-level proxy): non-errored phases / total.
    local rate
    rate=$(printf '%s' "$totals_json" | jq -r '
        if .phases_total > 0
        then (((.phases_total - .phases_errored) / .phases_total) * 100 * 10 | round / 10)
        else 0 end' 2>/dev/null)
    # Write the telemetry block (yq reads the JSON via env to avoid quoting hell).
    TELEMETRY_JSON="$totals_json" yq -i '.telemetry = (strenv(TELEMETRY_JSON) | from_json) | (.telemetry | ..) style=""' "$METRICS_FILE" 2>/dev/null || {
        log_warn "Failed to write telemetry block to metrics"
        return 0
    }
    yq -i ".telemetry.trace_id = \"$TRACE_ID\"" "$METRICS_FILE" 2>/dev/null || true
    # 2026 business-outcome rates (gap #9). Derived deterministically from the
    # counters already in metrics.yaml — no model, no estimation.
    #   - Tool Call Success Rate (phase-level proxy): non-errored phases / total
    #   - Task Completion Rate: completed stories / total stories
    #   - Escalation Rate: stories that exhausted retries / total stories
    local total completed max_retries
    total=$(yq '.stories.total // 0' "$METRICS_FILE" 2>/dev/null); total="${total:-0}"
    completed=$(yq '.stories.completed // 0' "$METRICS_FILE" 2>/dev/null); completed="${completed:-0}"
    max_retries=$(yq '.fix_loop.max_retries_hit // 0' "$METRICS_FILE" 2>/dev/null); max_retries="${max_retries:-0}"
    local completion_rate=0 escalation_rate=0
    if [ "$total" -gt 0 ] 2>/dev/null; then
        completion_rate=$(awk "BEGIN { printf \"%.1f\", ($completed / $total) * 100 }")
        escalation_rate=$(awk "BEGIN { printf \"%.1f\", ($max_retries / $total) * 100 }")
    fi
    yq -i ".telemetry.tool_call_success_rate = $rate" "$METRICS_FILE" 2>/dev/null || true
    yq -i ".telemetry.task_completion_rate = $completion_rate" "$METRICS_FILE" 2>/dev/null || true
    yq -i ".telemetry.escalation_rate = $escalation_rate" "$METRICS_FILE" 2>/dev/null || true
    log "Telemetry rolled up into metrics: $METRICS_FILE"
    return 0
 }
--- a/scripts/epic-execute.sh
+++ b/scripts/epic-execute.sh
@ -43,6 +43,11 @@ cleanup() {
    # Disable trap during cleanup
    trap - EXIT INT TERM
    # Stop any in-flight heartbeat background process first
    if type stop_phase_heartbeat >/dev/null 2>&1; then
        stop_phase_heartbeat
    fi
    echo ""
    log "Cleaning up (exit code: $exit_code)..."
@ -58,6 +63,11 @@ cleanup() {
            finalize_metrics "${#STORIES[@]}" "${COMPLETED:-0}" "${FAILED:-0}" "${SKIPPED:-0}" "$duration"
            log "Metrics finalized: $METRICS_FILE"
        fi
        # Roll up trace spans into a deterministic telemetry block (no-op unless traced)
        if type rollup_telemetry >/dev/null 2>&1; then
            rollup_telemetry
        fi
    fi
    # Report git status
@ -97,8 +107,9 @@ cleanup() {
    # Kill orphaned node/test processes
    kill_orphaned_test_processes
-    # Clean up phase output temp file
+    # Clean up phase output temp files
    rm -f "$PHASE_OUTPUT_FILE" 2>/dev/null
    rm -f "$PHASE_STREAM_FILE" 2>/dev/null
    # Save log to repo before exiting
    save_log_to_repo
@ -134,6 +145,7 @@ LIB_DIR="$SCRIPT_DIR/epic-execute-lib"
 [ -f "$LIB_DIR/tdd-flow.sh" ] && source "$LIB_DIR/tdd-flow.sh"
 [ -f "$LIB_DIR/contract-harness.sh" ] && source "$LIB_DIR/contract-harness.sh"
 [ -f "$LIB_DIR/contract-exec.sh" ] && source "$LIB_DIR/contract-exec.sh"
 [ -f "$LIB_DIR/observability.sh" ] && source "$LIB_DIR/observability.sh"
 STORIES_DIR="$PROJECT_ROOT/docs/stories"
 SPRINT_ARTIFACTS_DIR="$PROJECT_ROOT/docs/sprint-artifacts"
@ -509,23 +521,92 @@ log_prompt_size() {
 # Temp file for current phase output (reused across phases, cleaned up on exit)
 PHASE_OUTPUT_FILE="/tmp/bmad-phase-output-$$.txt"
 # Raw stream-json (JSONL) for the current phase when tracing is enabled.
 PHASE_STREAM_FILE="/tmp/bmad-phase-stream-$$.jsonl"
-# Run claude and pipe output directly to file + LOG_FILE (no bash variable)
+# Run claude and pipe output directly to file + LOG_FILE (no bash variable).
 #
 # When BMAD_TRACE=1, uses --output-format stream-json so the per-call telemetry
 # envelope (session id, tokens, cost, latency) is captured. The raw JSONL is
 # tee'd to PHASE_STREAM_FILE *before* the live jq renderer, so a renderer
 # failure can never corrupt capture (Open Decision 2). The clean .result text
 # is then written to PHASE_OUTPUT_FILE, so all downstream parsers that read
 # PHASE_OUTPUT_FILE keep working unchanged — and parse cleaner text than before.
 #
 # When tracing is off, falls back to the legacy text path verbatim.
 #
 # Callers set CURRENT_PHASE/CURRENT_STORY_ID (via set_span_context) beforehand.
 #
 # Arguments:
 #   $1 - prompt text (use "-f" as first arg to use file-based prompt)
 #   $2 - prompt file path (only when $1 is "-f")
-# Sets: PHASE_OUTPUT_FILE with the output
+# Sets: PHASE_OUTPUT_FILE with the clean assistant text
 run_claude_to_file() {
    # Truncate phase output file
    : > "$PHASE_OUTPUT_FILE"
    # Legacy text path (tracing disabled) — unchanged behavior.
    if [ "${TRACE_ENABLED:-false}" != true ]; then
        if [ "$1" = "-f" ]; then
            local prompt_file="$2"
            claude --dangerously-skip-permissions -f "$prompt_file" 2>&1 | tee -a "$LOG_FILE" > "$PHASE_OUTPUT_FILE" || true
        else
            local prompt="$1"
            claude --dangerously-skip-permissions -p "$prompt" 2>&1 | tee -a "$LOG_FILE" > "$PHASE_OUTPUT_FILE" || true
        fi
        return 0
    fi
    # Traced path: stream-json + telemetry capture.
    : > "$PHASE_STREAM_FILE"
    # Intra-phase heartbeat: liveness trail so a hard kill mid-phase is debuggable.
    start_phase_heartbeat "$PHASE_STREAM_FILE"
    # Live renderer prints assistant text to the terminal/log. It reads the
    # raw JSONL *after* the tee has already persisted it, so if jq chokes on a
    # partial line the capture in PHASE_STREAM_FILE is unaffected (cosmetic only).
    # stdin is redirected from /dev/null to suppress the CLI's "no stdin" warning.
    if [ "$1" = "-f" ]; then
        local prompt_file="$2"
-        claude --dangerously-skip-permissions -f "$prompt_file" 2>&1 | tee -a "$LOG_FILE" > "$PHASE_OUTPUT_FILE" || true
+        claude --dangerously-skip-permissions --output-format stream-json --verbose -f "$prompt_file" </dev/null 2>>"$LOG_FILE" \
            | tee -a "$PHASE_STREAM_FILE" \
            | jq -r --unbuffered 'select(.type=="assistant") | .message.content[]? | select(.type=="text") | .text' 2>/dev/null \
            | tee -a "$LOG_FILE" || true
    else
        local prompt="$1"
-        claude --dangerously-skip-permissions -p "$prompt" 2>&1 | tee -a "$LOG_FILE" > "$PHASE_OUTPUT_FILE" || true
+        claude --dangerously-skip-permissions --output-format stream-json --verbose -p "$prompt" </dev/null 2>>"$LOG_FILE" \
            | tee -a "$PHASE_STREAM_FILE" \
            | jq -r --unbuffered 'select(.type=="assistant") | .message.content[]? | select(.type=="text") | .text' 2>/dev/null \
            | tee -a "$LOG_FILE" || true
    fi
    # Stop the heartbeat now that the phase's claude call has returned.
    stop_phase_heartbeat
    # Derive the clean assistant text from the final result envelope so that
    # all downstream parsers (extract_json_result, check_phase_completion) keep
    # reading PHASE_OUTPUT_FILE exactly as before. The result line is always last.
    local result_text=""
    if [ -s "$PHASE_STREAM_FILE" ]; then
        result_text=$(tail -n 1 "$PHASE_STREAM_FILE" 2>/dev/null | jq -r 'select(.type=="result") | .result // empty' 2>/dev/null) || true
    fi
    if [ -n "$result_text" ]; then
        printf '%s' "$result_text" > "$PHASE_OUTPUT_FILE"
    else
        # No result envelope (crash/timeout) — fall back to raw stream so the
        # downstream "unclear result" handling still has something to inspect.
        cp "$PHASE_STREAM_FILE" "$PHASE_OUTPUT_FILE" 2>/dev/null || true
    fi
    # Record one span for this phase. Status is best-effort from the JSON result
    # block in the clean text; the core telemetry comes from the envelope itself.
    local span_status=""
    if [ -n "$result_text" ] && type extract_json_result >/dev/null 2>&1; then
        extract_json_result "$result_text" >/dev/null 2>&1 || true
        span_status=$(get_result_status 2>/dev/null || echo "")
    fi
    record_span "$PHASE_STREAM_FILE" "$span_status"
 }
 # Read the tail of phase output for completion signal parsing.
@ -1279,6 +1360,15 @@ fi
 mkdir -p "$UAT_DIR"
 mkdir -p "$SPRINTS_DIR"
 # Observability: enforce jq when tracing is enabled (hard prerequisite), then
 # mint the epic-level trace id and create the span file. No-op unless BMAD_TRACE=1.
 if type require_observability_deps >/dev/null 2>&1; then
    require_observability_deps
 fi
 if type init_observability >/dev/null 2>&1; then
    init_observability
 fi
 # Initialize metrics collection
 EPIC_START_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
 EPIC_START_SECONDS=$(date +%s)
@ -1543,6 +1633,7 @@ Do NOT use 'git add -A' or 'git add .' - only stage files you created or modifie
    fi
    # Execute in isolated context — pipe to file to avoid memory bloat
    set_span_context "dev" "$story_id"
    run_claude_to_file "$dev_prompt"
    local result
    result=$(read_phase_tail)
@ -1682,6 +1773,7 @@ Stage any fixes with explicit file paths: git add <file1> <file2> ..."
    fi
    # Execute in isolated context — pipe to file to avoid memory bloat
    set_span_context "review" "$story_id"
    run_claude_to_file "$review_prompt"
    local result
    result=$(read_phase_tail)
@ -1928,6 +2020,7 @@ Address all review findings now. This is attempt $attempt_num of 3."
        fi
        # Pipe to file to avoid memory bloat
        set_span_context "fix" "$story_id"
        run_claude_to_file "-f" "$temp_prompt_file"
        rm -f "$temp_prompt_file"
    else
@ -1937,6 +2030,7 @@ Address all review findings now. This is attempt $attempt_num of 3."
        fi
        # Execute in isolated context — pipe to file to avoid memory bloat
        set_span_context "fix" "$story_id"
        run_claude_to_file "$fix_prompt"
    fi
@ -2395,6 +2489,7 @@ Stage any fixes with: git add <file1> <file2> ..."
    fi
    # Pipe to file to avoid memory bloat
    set_span_context "arch" "$story_id"
    run_claude_to_file "$arch_prompt"
    local result
    result=$(read_phase_tail)
@ -2472,6 +2567,7 @@ Stage any fixes with: git add <file1> <file2> ..."
    fi
    # Pipe to file to avoid memory bloat
    set_span_context "test_quality" "$story_id"
    run_claude_to_file "$quality_prompt"
    local result
    result=$(read_phase_tail)
@ -2607,6 +2703,7 @@ Analyze traceability now. Read story files on-demand as needed."
    fi
    # Pipe to file to avoid memory bloat
    set_span_context "trace" "epic-$EPIC_ID"
    run_claude_to_file "$trace_prompt"
    local result
    result=$(read_phase_tail)
@ -2685,6 +2782,7 @@ Generate missing tests now."
    fi
    # Pipe to file to avoid memory bloat
    set_span_context "trace_fix" "epic-$EPIC_ID"
    run_claude_to_file "$fix_prompt"
    local result
    result=$(read_phase_tail)
@ -3039,6 +3137,7 @@ Generate the UAT document now. Read story files on-demand as needed."
    fi
    # Pipe to file to avoid memory bloat
    set_span_context "uat" "epic-$EPIC_ID"
    run_claude_to_file "$uat_prompt"
    local result
    result=$(read_phase_tail)