feat(epic-execute): add BMAD_TRACE observability + telemetry rollup

Capture the telemetry the claude CLI already emits (session id, tokens,
cost, latency, context window) as OTel-shaped trace spans and roll them
up into deterministic metrics. Gated behind BMAD_TRACE=1; the legacy
text path is unchanged when tracing is off.

- New scripts/epic-execute-lib/observability.sh: span recording, rollup,
  jq dep enforcement, and an intra-phase heartbeat for crash forensics
- epic-execute.sh: stream-json capture in run_claude_to_file with clean
  .result extraction, per-phase set_span_context calls, rollup in cleanup
- epic-chain.sh: measured (non-fabricated) telemetry section in reports
- Guard set -e aborts on malformed stream lines so crash/timeout paths
  degrade gracefully instead of killing the run
- Docs: gap analysis + observability implementation plan

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
Caleb 2026-06-15 06:07:48 -05:00
parent 3d6aae746c
commit c525d673fb
5 changed files with 932 additions and 5 deletions

View File

@ -0,0 +1,280 @@
---
title: Agent Loop 2026 Framework — Gap Analysis
---
# Agent Loop 2026 Framework — Gap Analysis
**Date:** 2026-06-15
**Workflows Reviewed:** epic-execute, epic-chain (+ `scripts/epic-execute-lib/`)
**Benchmark:** "Agent Loop Evaluation Framework (2026 Standard)" — Manus AI 5-category rubric
**Status:** Active
---
## Overview
This document scores the epic-execute / epic-chain shell automation against the 2026 Agent Loop Evaluation Framework (context engineering, tool design, runtime governance, error recovery, observability). Each criterion is rated on the rubric's Legacy↔2026 axis (15) with concrete `file:line` evidence, followed by a prioritized remediation roadmap.
The headline: the **objective machinery is already strong** — durable external state, per-phase context isolation, real-tooling gates, and deterministic state transitions are at or near 2026 standard. The gaps cluster in **governance, observability, and "prose where code belongs"** — places where a rule is told to the model instead of enforced in the harness.
**Scope note:** This is a point-in-time review of the scripts as of the date above. Line numbers reference the current `scripts/epic-execute.sh` (3,328 lines), `scripts/epic-chain.sh` (946 lines), and `scripts/epic-execute-lib/*.sh`.
---
## Scorecard
| # | Category | Criterion | Score (15) | Verdict |
|---|----------|-----------|:-----------:|---------|
| 1 | Context & Memory | Context Strategy | 3 | Partial — isolation yes, summarization no |
| 1 | Context & Memory | Tool Disclosure | 5 | 2026 — lean per-phase step templates |
| 1 | Context & Memory | Memory Persistence | 5 | 2026 — checkpoints / metrics / decision log |
| 1 | Context & Memory | Scaffolding Type | 3 | Partial — strong gates, residual personas |
| 2 | Tool Design | Tool Scoping | 2 | Legacy — `--dangerously-skip-permissions` + `eval` |
| 2 | Tool Design | Input/Output Validation | 3 | Partial — schema declared, not enforced |
| 2 | Tool Design | Deterministic Delegation | 5 | 2026 — state/math/writes all in bash |
| 2 | Tool Design | Failure Handling | 4 | 2026-leaning — self-heal loops |
| 3 | Governance (OWASP) | Policy Enforcement | 3 | Partial — one structural, one prompt-only |
| 3 | Governance (OWASP) | Privilege Tiers | 2 | Legacy — one standing autonomous tier, no HITL |
| 3 | Governance (OWASP) | Identity Management | 2 | Legacy — ambient long-lived creds |
| 3 | Governance (OWASP) | Memory Security | 2 | Legacy — unvalidated write-back + re-ingest |
| 4 | Error Recovery / RALF | Session Architecture | 3 | Partial — fresh process/phase, no watchdog |
| 4 | Error Recovery / RALF | Error Diagnosis | 3 | Partial — self-heal yes, retry wrapper dead code |
| 4 | Error Recovery / RALF | Loop Guardrails | 4 | 2026 — hard caps; missing time/token budgets |
| 4 | Error Recovery / RALF | Exit Conditions | 4 | 2026 — deterministic gates; some advisory |
| 5 | Observability | Tracing Infrastructure | 1 | Legacy — no trace ID, no token/cost/OTel |
| 5 | Observability | Evaluation Metrics | 3 | Partial — raw counters, self-graded scores |
| 5 | Observability | Evaluation Integration | 4 | 2026-leaning — embedded inline gates |
**Category averages:** Context **4.0** · Tools **3.75** · Governance **2.25** · Error Recovery **3.5** · Observability **2.67**
---
## What's Already at the 2026 Bar
### Memory Persistence (5/5)
Durable external state survives session death — the rubric's "external state files" pattern:
- Checkpoint files with 7-day expiry — `utils.sh:500-551`, written on exit `epic-execute.sh:74-82`
- Metrics YAML that **resumes and accumulates** counters across sessions — `epic-execute.sh:646-664`, `:779-785`
- Per-story design plans persisted for post-resume dev phases — `epic-execute.sh:144-145`
- Decision log + sprint-status.yaml as external workflow state — `decision-log.sh`, `epic-execute.sh:832-906`
### Tool Disclosure (5/5)
Progressive disclosure: loads lean ~49KB step templates per phase instead of embedding ~40KB workflow YAML — `epic-execute.sh:551-563`, per-phase template selection `:179-183`.
### Deterministic Delegation (5/5)
The LLM emits intent only (a status enum + findings); all counting, duration math, status writes, and pass/fail verdicts run in bash — `epic-execute.sh:693-797`, `:803-906`; `contract-exec.sh:144-170`. Playwright specs are **generated from the harness** (`contract-exec.sh:220-296`), not authored by the model.
### Embedded Probes (4/5)
Quality gates run **inline per story**, not offline batch: arch / test-quality / traceability / static-analysis / regression / contract. The static-analysis and contract gates run **real tooling** (`tsc`, lint, build, pytest, `curl`, `playwright`) whose actual output gates progression and can fail the epic's exit code — `epic-execute.sh:1986-2307`, `:3203-3208`.
---
## Improvement Areas
### HIGH Priority
#### 1. Observability foundation: no trace ID, no token/cost telemetry (Tracing 1/5)
The weakest foundational layer — and the rubric explicitly advises fixing the lowest foundational layer first.
There is no session/trace ID, no OpenTelemetry, and zero token/cost/latency capture. LLM calls are an opaque blob teed to a PID-named text log; the only correlation key is `$$`.
**Evidence:**
- Plain-text logging, no structured fields/spans — `epic-execute.sh:199-217`
- LLM call uninstrumented (no tokens/cost/latency/model captured) — `run_claude_to_file`, `epic-execute.sh:518-529`
- Metrics schema has no tokens/cost/trace ID — `epic-execute.sh:668-688`
- Cost figures in reports are openly **estimated** ("may vary 50200%") — `epic-chain-execution-report.md:225-248`
**Fix:** Generate a session/trace ID at startup and thread it through every phase + the metrics/log/decision-log writes. Capture real usage by invoking `claude` with `--output-format json` (or `--output-format stream-json`) and parsing the `usage` / `total_cost_usd` fields per phase. This unlocks every downstream metric.
---
#### 2. Resilience layer is fully built but never wired in (Error Recovery)
`execute_claude_with_retry`, `run_with_timeout`, and `CLAUDE_TIMEOUT` (600s default) all exist in `utils.sh:55-150` — but have **zero callers**. Every phase uses bare `run_claude_to_file` (`epic-execute.sh:518-529`) with `|| true`, so:
- A hung phase blocks forever (`CLAUDE_TIMEOUT` never applied to the real path)
- Transient errors (429/timeout/503) are not retried
- Crashes are swallowed and silently judged "incomplete"
There is also no watchdog/supervisor and no stuck-loop progress detection — fix loops burn all attempts even on an identical failure set, despite computing failure signatures in `test-failure-filter.sh:139-151`.
**Fix:** Route `run_claude_to_file` through the existing `run_with_timeout` + retry wrapper (mostly plumbing — the code already exists). Add a progress check that compares failure signatures across fix iterations and aborts early when the set is unchanged.
---
#### 3. Governance is structurally bypassable (Privilege 2/5, Policy partial)
The strong git policies — `check_sensitive_files` (`epic-execute.sh:351-391`), `git add -u` (`:2937`), `check_branch_protection` (`utils.sh:444-474`) — only guard the **final `commit_story` path**. Mid-phase, every agent runs `--dangerously-skip-permissions` (`:524`, `:527`) and can `git add -A` / commit anything itself. The "don't use git add -A" rule is injected as **prose in 7+ places** (`:597-598`, `:1772`, `:1875`, …) but enforced in code only at commit time. There is no read/write/destructive privilege separation and no HITL gate ("AUTOMATED… do NOT pause for user confirmation" — `:593-595`).
**Fix:** Move the staging policy into a git **pre-commit hook** (structural, not prose) so it governs agent-authored commits too; re-run `check_sensitive_files` on any commit. Introduce an approval tier (even a coarse env-gated one) for destructive operations.
---
#### 4. Memory poisoning loop is unguarded (Memory Security 2/5)
`append_to_decision_log` writes raw agent output straight to disk (`decision-log.sh:54-76`), and `get_decision_log_context` re-injects the whole log **verbatim** into the next phase's prompt (`decision-log.sh:80-87`) — the exact RAG/memory-poisoning loop OWASP 2026 (ASI06) warns against, with no validation, segmentation, or provenance tagging. `add_metrics_issue` / `record_fix_attempt` also interpolate agent-influenced strings directly into `yq` expressions (`epic-execute.sh:725`, `:745`) — a YAML/expression-injection surface.
**Fix:** Validate and length-bound decision-log entries before commit; tag provenance (which phase/story produced each entry); sanitize or parameterize strings before they enter `yq` expressions.
---
#### 5. Output contracts declared but not enforced (I/O Validation 3/5)
JSON result schemas are prescribed to the model and parsed with `jq`, but malformed/missing output **silently falls back to grepping prose** (`json-output.sh:311-387`, `check_phase_completion_fuzzy` in `utils.sh`). The documented consequence: **9 stories were mis-marked failed** because the model didn't emit the exact `IMPLEMENTATION COMPLETE:` phrase, requiring manual correction (`epic-chain-execution-report.md:254-272`).
**Fix:** Reject non-conforming output and force a bounded retry instead of degrading to regex. Make the JSON result block mandatory (fail the phase if absent). Relatedly, promote the advisory gates (arch / test-quality / traceability / regression — currently "proceed with documented concerns") to blocking where the risk warrants it, matching the deterministic behavior of the static-analysis and contract gates.
---
### MEDIUM Priority
#### 6. Tool Scoping & sandboxing (Tool Scoping 2/5)
`claude --dangerously-skip-permissions` grants the full unrestricted toolset with no per-phase allowlist or sandbox (`epic-execute.sh:524`, `:527`). Harness commands run via raw `eval` on YAML-derived strings — an injection surface — in `contract-exec.sh:43,53,86,168` and `contract-harness.sh:333,351,368`. The production-scope datastore guard is advisory (`log_warn`), not a block (`contract-harness.sh:194-213`).
**Fix:** Run harness commands as argv arrays (no `eval`). Consider a per-phase tool allowlist and/or containerized execution. Promote the production-scope datastore guard from warning to hard block.
#### 7. Context summarization (Context Strategy 3/5)
Per-phase context isolation is excellent (fresh `claude` process per phase, paths-not-contents handoff), but there is no **anchored iterative summarization**: cross-phase carryover is raw grep/sed-extracted text or tail-truncation (decision context truncated to 20KB at `epic-execute.sh:1466`), and the only control is a hard 150KB cap (`MAX_PROMPT_SIZE`, `:398`), not a utilization band.
**Fix:** Add a summarization step between phases — hold an anchor block (story + ACs + constraints) constant while condensing completed-phase outcomes into a structured summary; target 6080% utilization rather than a hard truncate.
#### 8. Identity Management (2/5)
Every `claude` call inherits the operator's ambient, long-lived credentials; harness secrets are consumed as ambient env vars (`contract-harness.sh:205`, `:254`). No unique per-task identity, short-lived tokens, or credential scoping.
**Fix:** Where feasible, issue short-lived/scoped credentials per run; vault harness secrets rather than relying on ambient env.
#### 9. Evaluation Metrics — derive rates, separate the judge (Metrics 3/5)
The raw inputs exist (completed/failed/skipped, fix attempts, `max_retries_hit`) but are never computed into **Task Completion Rate / Escalation Rate / Tool Call Success Rate**. Rubric scores (test-quality ≥70, traceability P0=100%) are **self-graded by the same executing model** (`json-output.sh:473-496`) rather than by an independent calibrated judge.
**Fix:** Compute and persist the derived rates in the metrics YAML. Introduce a separate, cheaper judge model (e.g., Haiku) for binary rubric scoring so the executor isn't grading its own work.
---
### LOW Priority
#### 10. `yq`-dependent durability
Metrics, sprint-status, and issue persistence silently degrade without `yq` installed (`epic-execute.sh:707`, `:790`). The otherwise-excellent memory layer is best-effort, not guaranteed.
**Fix:** Either declare `yq` a hard prerequisite (fail fast at startup) or harden the sed/awk fallbacks to full parity.
#### 11. Vestigial "full workflow YAML" priority tier
`CONTENT_PRIORITY_LOW` is still described as "Full workflow YAML (truncate first)" (`epic-execute.sh:404`), a legacy fallback path no active builder uses. Remove to avoid confusion.
#### 12. Gate status not persisted by standalone runs
`validation.gate_status` is written by the **chain** wrapper (`epic-chain.sh:626-638`), not the inner execute loop, so a standalone `epic-execute.sh` run leaves `gate_status: PENDING`.
---
## Two Structural Themes
1. **Prose where code belongs.** The recurring pattern — git rules, "do NOT pause", personas like *"You ARE an adversarial reviewer"* (`:1616`) — is the compensatory scaffolding the rubric flags: a rule told to the model that the harness could instead enforce. The codebase is mid-migration; the objective gates (contracts, real tooling, JSON results) are already constitutive, but the soft rules haven't caught up.
2. **Built-but-unwired.** The retry/timeout resilience layer (`utils.sh:55-150`) is the clearest example — fully implemented, zero callers. The capability gap is often plumbing, not net-new code.
---
## Suggested Sequencing
Per the rubric's "fix the lowest-scoring foundational layer first":
1. **Observability** (#1) — trace/session ID + real token/cost/latency from `claude --output-format json`. Foundation for everything; currently 1/5.
2. **Wire the existing retry/timeout layer** (#2) — pure plumbing, already-written code, large RALF payoff.
3. **Governance** (#3, #4) — pre-commit hook + sensitive-file re-check on agent commits; validate decision-log/metrics writes before commit.
4. **Enforce JSON contracts** (#5) — fail-and-retry on missing signal instead of fuzzy fallback; promote advisory gates to blocking.
5. **Context summarization** (#7) — anchored iterative summarization targeting a utilization band.
The smallest, highest-leverage starting points are **#2 (retry wiring)** and **#3 (pre-commit hook)**.
---
## References
- Benchmark source: "Agent Loop Evaluation Framework (2026 Standard)," Manus AI — context engineering, tool design, OWASP Top 10 for Agentic Apps 2026, RALF loop, OpenTelemetry-first observability.
- Prior review: [`epic-workflows-v1.md`](./epic-workflows-v1.md) (2026-01-02) — overlaps on the `--dangerously-skip-permissions` finding (#1/#3, #6 here).
---
## Appendix A — Observability Deep Dive
**Added:** 2026-06-15 · Expands HIGH-priority item #1 and Evaluation Metrics (#9).
### A.1 Root cause: a single discard point
Every LLM call routes through `run_claude_to_file` (`epic-execute.sh:518-529`), which uses the CLI's **default `text` output format**:
```bash
claude --dangerously-skip-permissions -p "$prompt" 2>&1 | tee -a "$LOG_FILE" > "$PHASE_OUTPUT_FILE" || true
```
Only rendered assistant text survives. The chain report generator does the same (`epic-chain.sh:884`). The 1/5 Tracing score is the consequence of this one choice — not an architectural limit. The telemetry is produced on every call and thrown away.
### A.2 What `claude --output-format json` already returns (verified)
Tested against the installed CLI (v2.1.177). The result envelope contains every field 2026 observability requires:
| Field (verified present) | Example | Rubric need it satisfies |
|---|---|---|
| `session_id` | `f6ff5b55-…` | Trace/session ID (today: PID `$$`) |
| `total_cost_usd` | `0.0586` | Real cost (today: fabricated) |
| `usage.input_tokens` / `output_tokens` | 2629 / 4 | Token spend |
| `usage.cache_read_input_tokens` / `cache_creation_input_tokens` | 15362 / 3718 | Cache efficiency |
| `modelUsage[model].costUSD` + per-model tokens | Opus + Haiku sub-agent | Per-model cost attribution |
| `modelUsage[model].contextWindow` | 1000000 | Enables context-utilization % |
| `duration_ms` / `duration_api_ms` / `ttft_ms` | 1757 / 2522 / 1754 | Per-call latency |
| `num_turns`, `stop_reason`, `is_error`, `api_error_status`, `permission_denials` | 1 / end_turn / false / null / [] | Tool-call success / error telemetry |
The CLI also exposes `--output-format stream-json` (live JSONL ending with the same `result` envelope) and `--json-schema <schema>` for structured-output enforcement. All three require `--print`, which the script already passes.
### A.3 The current report is actively misleading, not merely empty
Because real telemetry is discarded, the chain report **fabricates** it (`epic-chain-execution-report.md:225-248`):
- Token table derived from `Est. Calls = stories × 2` and `~16K input/call` assumptions — arithmetic on story counts, not measurement.
- Cost table priced against **Claude Sonnet 3.5 ($3/$15)** and **Opus ($15/$75)** — neither is the model that ran (`claude-opus-4-8[1m]`); the real `total_cost_usd` was available and discarded.
- Carries the disclaimer *"Actual usage may vary by 50-200%."*
An authoritative-looking cost table that is invented is worse than a blank cell — it is unfalsifiable noise where ground truth was one flag away.
### A.4 Synergy: this fix lifts three other findings
The same envelope partially closes gaps scored elsewhere:
1. **Context Strategy (#7).** `contextWindow` + `input_tokens + cache_read + cache_creation` yields exact per-phase utilization, making the 6080% target measurable and enforceable for free.
2. **I/O Validation (#5) + the 9-mismark incident.** Parsing `.result` (clean final message) instead of scraping interleaved stdout, plus `--json-schema` to make the status field structurally mandatory, removes the fuzzy-regex fallback (`json-output.sh:311-387`) that mismarked 9 stories (`epic-chain-execution-report.md:254-272`). That incident is fundamentally an output-format problem.
3. **Evaluation Metrics (#9).** Enables the rubric's business metrics: Task Completion Rate (`completed/total`), Escalation Rate (`max_retries_hit/stories` + real `is_error` rate), Tool Call Success Rate (`is_error=false` phases ÷ total).
### A.5 Target design (fits the existing architecture)
Constraints: preserve the memory-safe "pipe to file, read 32KB tail" pattern, and keep the live `tee` to the log.
1. **Switch to `stream-json`, not plain `json`.** Plain `json` buffers and kills the live tee. `--output-format stream-json --include-partial-messages` streams live *and* makes the **last JSON line** the `result` envelope; `read_phase_tail` still captures it (parse the last line where `.type=="result"`). Memory-safety preserved.
2. **One append-only trace file per epic**, using the OTel span data model (convertible to OTLP later):
`docs/sprint-artifacts/traces/epic-<id>-trace.jsonl` — one span per phase:
```json
{"trace_id":"<epic-uuid>","span_id":"<claude session_id>","parent":"<story_id>",
"name":"dev","story_id":"4-3","model":"claude-opus-4-8[1m]",
"input_tokens":2629,"output_tokens":4,"cache_read":15362,"cost_usd":0.058,
"duration_ms":1757,"ttft_ms":1754,"num_turns":1,"is_error":false,
"ctx_util_pct":2.1,"status":"COMPLETE","ts":"2026-06-15T…"}
```
Generate one epic-level `trace_id` (`uuidgen`) at startup; each call's `session_id` is the `span_id`, `story_id` the parent. This is the single correlating ID `$$` never provided.
3. **Deterministic rollup into `metrics.yaml`** — add a `telemetry:` block summed from the JSONL (no model, no fabrication): `total_cost_usd`, `total_input_tokens`, `total_output_tokens`, `cache_read_tokens`, `by_phase`. The chain report then reads measured numbers; the `Estimated Token Usage` section is deleted.
4. **OTel bridge (phase 2, optional).** JSONL-with-OTel-fields is the pragmatic 80%. A later post-processor converts spans → OTLP without touching the hot path.
### A.6 Caveats
- **jq dependency** — telemetry parsing needs `jq` (same soft-dep fragility as `yq`, item #10). Degrade gracefully (skip span, don't crash); consider making `jq` a hard startup prerequisite.
- **Cost includes sub-agents**`modelUsage` surfaced an internal Haiku call inside an Opus phase. Record `modelUsage` verbatim; don't flatten to one model.
- **Cache tokens dominate** — in testing, cache-read (15K) was 6× fresh input (2.6K). Report fresh vs. cache-read separately; compute utilization from the sum.
- **`stream-json` is noisier on disk** — log grows faster (every partial chunk). The existing 64KB inter-story log truncation (`epic-execute.sh:3182`) mitigates; confirm it suffices.
### A.7 Why this is the right place to start
Observability scored lowest yet is the cheapest high-priority fix and the only one that drags three other findings upward. The data already exists on every call; the work is plumbing (a format switch + a `record_span` helper + a deterministic rollup), not building telemetry infrastructure.

View File

@ -0,0 +1,118 @@
---
title: Observability Implementation Plan
---
# Observability Implementation Plan
**Date:** 2026-06-15
**Targets:** HIGH item #1 (Tracing 1/5), with synergistic wins on #5 (I/O Validation) and #9 (Evaluation Metrics)
**Source analysis:** [`agent-loop-2026-gap-analysis.md`](./agent-loop-2026-gap-analysis.md) — Appendix A
**Status:** Phases 02 implemented on branch `feat/observability-tracing` (gated behind `BMAD_TRACE=1`). Phase 3 (`--json-schema` enforcement) and Phase 4 (OTLP bridge) deferred to separate PRs. Chain-level trace id (decision #5) still deferred.
**Locked decisions:** (1) ambient vars · (2) jq live renderer, capture tee placed upstream of jq so renderer failure is cosmetic-only · (3) separate PRs · (4) hard-fail on jq when `BMAD_TRACE=1` · (5) chain-level trace id deferred.
**Added beyond original plan — intra-phase heartbeat.** Per request, `start_phase_heartbeat`/`emit_heartbeat`/`stop_phase_heartbeat` append a liveness beat to `epic-<id>-trace.live.jsonl` every `BMAD_TRACE_HEARTBEAT_INTERVAL` seconds (default 10) while a phase runs — so a `kill -9` mid-phase leaves a forensic trail (which phase, elapsed, context size, last assistant text). Reads only the latest assistant event (bounded I/O, no slurp); self-terminates within one interval if the main process dies; defensively stopped in `cleanup()`.
**Implementation note:** all new tracing logic lives in `scripts/epic-execute-lib/observability.sh` (new module). `epic-execute.sh` changes are limited to: sourcing the module, a startup dep-check + `init_observability`, the `run_claude_to_file` rewrite, a `record_span` call inside that helper, `rollup_telemetry` in `cleanup()`, one temp-file cleanup line, and 9 one-line `set_span_context` calls. Verified against the live CLI: real span lands with correct model/tokens/cost/ctx%, clean-text compatibility holds (downstream `IMPLEMENTATION COMPLETE` detection still works), crash path writes a degraded error span, legacy path (`BMAD_TRACE` unset) is byte-for-byte unchanged, and the hard-fail exits 1 when jq is absent.
---
## Goal
Stop discarding the telemetry the `claude` CLI already emits on every call. Capture it as OTel-shaped trace spans, roll it up into deterministic (non-fabricated) metrics, and — as a side effect of the same code-path change — replace fragile stdout-scraping with clean `.result` parsing.
## Verified CLI facts (claude 2.1.177)
All confirmed against the installed CLI on 2026-06-15:
1. `--output-format json` returns one result object with: `session_id`, `total_cost_usd`, `usage.{input_tokens,output_tokens,cache_read_input_tokens,cache_creation_input_tokens}`, `modelUsage[model].{costUSD,inputTokens,outputTokens,contextWindow}`, `duration_ms`, `duration_api_ms`, `ttft_ms`, `num_turns`, `stop_reason`, `is_error`, `api_error_status`, `permission_denials`.
2. `--output-format stream-json` **requires `--verbose`** (hard error otherwise). Emits JSONL: `system/init``rate_limit_event``assistant`(s)/`tool_use`/`tool_result` → **`result`** (always the last line).
3. The final `result` line carries `.result` = the clean final assistant text, plus the full telemetry envelope from (1). It is a single line, so `tail -n 1 | jq` extracts it cheaply regardless of phase length — the memory-safe "read the tail" design is preserved.
4. A stderr warning (`no stdin data received in 3s…`) leaks into `2>&1`; suppressed with `< /dev/null`.
5. `modelUsage` attributes cost across models — internal sub-agents (e.g. a Haiku helper inside an Opus phase) show up separately. Record verbatim; do not flatten.
6. `uuidgen`, `jq`, `yq` all present on the dev machine.
## Design decision: stream-json + `.result` extraction (compatibility-preserving)
`run_claude_to_file` is the single chokepoint (`epic-execute.sh:518-529`); all 9 phase call sites follow `run_claude_to_file …; result=$(read_phase_tail)`. The downstream parsers (`extract_json_result`, `check_phase_completion`) expect `PHASE_OUTPUT_FILE` to hold the rendered assistant text.
**Plan:** switch the invocation to `stream-json --verbose`, write the raw JSONL to a separate stream file, then post-process:
- Extract `.result` (clean text) from the final `result` line → write to `PHASE_OUTPUT_FILE`. **All 9 call sites keep working unchanged**, and parsing improves (no interleaved tool output → fixes the 9-mismark incident, gap #5).
- Extract telemetry from the same line → `record_span` (new) appends a JSONL span.
- Graceful degradation: if `jq` is absent, no `result` line is found, or the call crashed, fall back to today's raw-text behavior so nothing regresses.
This keeps the blast radius to one function plus two new helpers; the 9 call sites are untouched.
---
## Work breakdown (phased)
### Phase 0 — Prerequisites & scaffolding (low risk)
- [ ] Add `jq` to a startup capability check; decide hard-fail vs. graceful-degrade (see Open Decision 4). Mirror the existing `yq` check pattern.
- [ ] Create `TRACES_DIR="$SPRINT_ARTIFACTS_DIR/traces"`; `mkdir -p` alongside the other artifact dirs (`epic-execute.sh:138-145`).
- [ ] Generate an epic-level `TRACE_ID=$(uuidgen)` at startup (near `EPIC_START_SECONDS`, `:1283`). This is the single correlating ID `$$` never provided.
### Phase 1 — Instrument the invocation (core, additive)
- [ ] Rewrite `run_claude_to_file` to use `stream-json --verbose`, raw JSONL → `$PHASE_STREAM_FILE`, live-render assistant text to the log (Open Decision 2), then derive `PHASE_OUTPUT_FILE` from `.result`.
- [ ] Add `record_span <phase> <story_id>` — reads `tail -n 1 "$PHASE_STREAM_FILE"`, validates `.type=="result"`, appends one OTel-shaped span to `$TRACES_DIR/epic-<id>-trace.jsonl`.
- [ ] Pass `phase` + `story_id` into `run_claude_to_file` (currently positional `$1`/`-f`); thread through the 9 call sites or set module-level `CURRENT_PHASE`/`CURRENT_STORY_ID` before each call (Open Decision 1 — signature vs. ambient vars).
- [ ] Compute `ctx_util_pct` = (input + cache_read + cache_creation) / `contextWindow` and `log_warn` when > 80% (feeds Context Strategy, gap #7).
Span schema (one line per phase):
```json
{"trace_id":"…","span_id":"<session_id>","parent":"<story_id>","name":"dev",
"story_id":"4-3","model":"claude-opus-4-8[1m]","input_tokens":2705,
"output_tokens":4,"cache_read":19080,"cost_usd":0.0237,"duration_ms":1089,
"ttft_ms":1012,"num_turns":1,"is_error":false,"api_error_status":null,
"ctx_util_pct":2.1,"status":"COMPLETE","ts":"2026-06-15T…Z"}
```
### Phase 2 — Deterministic rollup (replaces fabrication)
- [ ] Add `rollup_telemetry` — sum the JSONL spans into a `telemetry:` block in `metrics.yaml` (`total_cost_usd`, `total_input_tokens`, `total_output_tokens`, `cache_read_tokens`, `by_phase`). Call from `finalize_metrics` (`:765`).
- [ ] Add derived metrics (gap #9): Task Completion Rate, Escalation Rate (`max_retries_hit/stories` + real `is_error` rate), Tool Call Success Rate (`is_error=false` ÷ phases).
- [ ] Chain: delete the fabricated `Estimated Token Usage` / `Cost Estimates` sections (`epic-chain-execution-report.md` generator, `epic-chain.sh:831-876`); have the report read measured `telemetry:` from each epic's metrics. Also instrument the chain's own report-generator `claude` call (`epic-chain.sh:884`).
### Phase 3 — Contract enforcement (synergistic, higher risk — gate separately)
- [ ] Use `--json-schema` to make the completion `status` field structurally mandatory, removing the fuzzy-regex fallback (`json-output.sh:311-387`). Sequence **after** Phases 12 prove stable, since it changes completion semantics.
### Phase 4 — OTel bridge (optional, later)
- [ ] Post-processor: JSONL spans → OTLP exporter to a collector. No hot-path change. Defer until someone needs a dashboard.
---
## Test strategy
- **Unit (bats or shell asserts):** feed a captured `result` JSONL fixture into `record_span`/`rollup_telemetry`; assert span fields and YAML sums. Capture one real fixture now (we already have sample envelopes) and commit under `test/fixtures/`.
- **Degradation:** run the new path with `jq` shadowed out of `PATH`; assert it falls back to raw text and does not crash.
- **Crash path:** feed a truncated stream (no `result` line); assert graceful fallback + a span marked `is_error`/incomplete.
- **End-to-end:** `--dry-run` won't call claude, so add a tiny real one-story epic to a scratch fixture and confirm a populated `trace.jsonl` + `telemetry:` block.
- **Regression:** confirm all 9 phases still parse completion correctly (the `.result` text is a superset of what they saw before).
## Rollout / safety
- Gate Phase 1 behind an env flag (e.g. `BMAD_TRACE=1`) for the first iteration so the old text path stays default until the new one is proven, then flip the default.
- Phases 12 are observe-only (no behavior change beyond cleaner parsing). Phase 3 changes control flow — ship it on its own PR per the repo's "<800 lines, split larger changes" rule.
## Open decisions (need sign-off before coding)
1. **Threading phase/story into the invocation** — change `run_claude_to_file`'s signature, or set ambient `CURRENT_PHASE`/`CURRENT_STORY_ID` before each call? Ambient is a smaller diff (no call-site edits) but more implicit. *Recommendation: ambient vars set in the phase functions.*
2. **Live log format** — tee raw JSONL to the log (ugly for humans tailing it) vs. pipe through a small `jq` renderer that prints assistant text live. *Recommendation: jq renderer for the terminal/log, raw JSONL to the stream file.*
3. **Bundle scope** — observability-only (Phases 02), or include the contract enforcement (Phase 3) in the same effort since it touches the same path? *Recommendation: separate PRs; land 02 first.*
4. **`jq` dependency** — hard prerequisite (fail fast at startup) or graceful degradation? Telemetry is useless without it, but failing hard is a sharper UX change. *Recommendation: graceful-degrade with a loud one-time warning, consistent with the current `yq` handling.*
5. **Trace ID propagation to chain** — should `epic-chain.sh` mint a chain-level trace ID that each epic inherits as a parent span, giving one trace across the whole chain? *Recommendation: yes, but defer to Phase 2.*
---
## Effort estimate
| Phase | Scope | Risk | Rough size |
|-------|-------|------|-----------|
| 0 | dirs, trace id, jq check | low | ~30 lines |
| 1 | invocation + record_span | medium | ~120 lines (1 fn rewrite + 1 new) |
| 2 | rollup + derived metrics + chain report | lowmed | ~150 lines |
| 3 | --json-schema enforcement | mediumhigh | separate PR |
| 4 | OTLP bridge | low (isolated) | deferred |
Phases 02 are the core observability win and are mostly additive plumbing.
</content>
</invoke>

View File

@ -240,6 +240,15 @@ EOF
--- ---
EOF
# Measured telemetry (real numbers, or an honest "not captured")
build_measured_telemetry_md >> "$CHAIN_REPORT_FILE"
cat >> "$CHAIN_REPORT_FILE" << EOF
---
## Artifacts ## Artifacts
| Artifact | Location | | Artifact | Location |
@ -259,6 +268,57 @@ EOF
log_success "Basic report created: $CHAIN_REPORT_FILE" log_success "Basic report created: $CHAIN_REPORT_FILE"
} }
# Build a MEASURED telemetry markdown section by aggregating the deterministic
# `telemetry:` blocks that epic-execute writes into each epic's metrics.yaml
# (populated when run with BMAD_TRACE=1). Emits real per-epic + total token/cost
# numbers — never estimates. If no epic captured telemetry, says so plainly.
# Output: markdown on stdout.
build_measured_telemetry_md() {
if ! command -v yq >/dev/null 2>&1; then
echo "## Telemetry"
echo ""
echo "_Telemetry rollup requires \`yq\` (not found)._"
return 0
fi
local any=false
local total_cost=0 total_in=0 total_out=0
local rows=""
for epic_id in "${EPIC_IDS[@]}"; do
local mf="$METRICS_DIR/epic-${epic_id}-metrics.yaml"
[ -f "$mf" ] || continue
local has
has=$(yq '.telemetry // "" | tag' "$mf" 2>/dev/null)
[ "$has" = "!!map" ] || continue
any=true
local cost in out
cost=$(yq '.telemetry.total_cost_usd // 0' "$mf" 2>/dev/null)
in=$(yq '.telemetry.total_input_tokens // 0' "$mf" 2>/dev/null)
out=$(yq '.telemetry.total_output_tokens // 0' "$mf" 2>/dev/null)
rows+="| $epic_id | $in | $out | \$$(printf '%.4f' "$cost" 2>/dev/null || echo "$cost") |"$'\n'
total_cost=$(awk "BEGIN { printf \"%.6f\", $total_cost + $cost }")
total_in=$(( total_in + ${in%.*} ))
total_out=$(( total_out + ${out%.*} ))
done
echo "## Telemetry (Measured)"
echo ""
if [ "$any" != true ]; then
echo "_No measured telemetry captured. Re-run with \`BMAD_TRACE=1\` to record"
echo "real token usage and cost per epic (see docs/improvements/observability-implementation-plan.md)._"
return 0
fi
echo "Actual usage captured from the Claude CLI result envelopes (not estimated):"
echo ""
echo "| Epic | Input Tokens | Output Tokens | Cost (USD) |"
echo "|------|--------------|---------------|------------|"
printf '%s' "$rows"
echo "| **Total** | **$total_in** | **$total_out** | **\$$(printf '%.4f' "$total_cost")** |"
}
# ============================================================================= # =============================================================================
# Argument Parsing # Argument Parsing
# ============================================================================= # =============================================================================
@ -828,6 +888,10 @@ if [ "$GENERATE_REPORT" = true ] && [ "$DRY_RUN" = false ]; then
WORKFLOW_PATH="$PROJECT_ROOT/src/bmm/workflows/4-implementation/epic-chain" WORKFLOW_PATH="$PROJECT_ROOT/src/bmm/workflows/4-implementation/epic-chain"
fi fi
# Precompute the MEASURED telemetry section in bash so the report uses
# ground-truth numbers and the model has no room to invent token/cost data.
measured_telemetry_md=$(build_measured_telemetry_md)
# Build report generation prompt # Build report generation prompt
report_prompt="You are Bob, the Scrum Master, generating a chain execution report. report_prompt="You are Bob, the Scrum Master, generating a chain execution report.
@ -835,6 +899,17 @@ if [ "$GENERATE_REPORT" = true ] && [ "$DRY_RUN" = false ]; then
Generate a comprehensive chain execution report for the completed epic chain. Generate a comprehensive chain execution report for the completed epic chain.
## CRITICAL: No Fabricated Metrics
Do NOT estimate, infer, or invent token counts, costs, or call counts. Token
and cost figures come ONLY from the measured telemetry section below. If a value
is not present there, write \"not captured\" — never a guess or a typical-pattern
estimate. Include the measured telemetry section verbatim in the report.
## Measured Telemetry (ground truth — include verbatim)
${measured_telemetry_md}
## Configuration ## Configuration
- Chain Plan: $CHAIN_PLAN_FILE - Chain Plan: $CHAIN_PLAN_FILE

View File

@ -0,0 +1,355 @@
#!/bin/bash
#
# BMAD Epic Execute - Observability Module
#
# Captures the telemetry the `claude` CLI already emits (session id, token
# usage, cost, latency, context window) as OTel-shaped trace spans, and rolls
# them up into deterministic (non-fabricated) metrics.
#
# Source: docs/improvements/observability-implementation-plan.md
#
# Usage: Sourced by epic-execute.sh. Relies on jq (hard prerequisite, enforced
# by require_observability_deps below) and the following globals from the
# parent script: SPRINT_ARTIFACTS_DIR, EPIC_ID, LOG_FILE, VERBOSE.
#
# Gated behind BMAD_TRACE: tracing is only active when BMAD_TRACE=1 (default
# off during initial rollout). The invocation helper in epic-execute.sh falls
# back to the legacy text path when tracing is disabled.
#
# =============================================================================
# Observability State
# =============================================================================
# Epic-level trace id (one per run); each phase's claude session_id is a span id.
TRACE_ID="${TRACE_ID:-}"
# Path to the append-only span log for this epic (set by init_observability).
TRACE_FILE=""
# Ambient context for the next span (set by the phase functions before each
# claude invocation — see Open Decision 1 in the implementation plan).
CURRENT_PHASE=""
CURRENT_STORY_ID=""
# Whether tracing is active for this run.
TRACE_ENABLED=false
# Intra-phase heartbeat: appends a liveness beat to a .live.jsonl every N
# seconds while a phase is running, so a hard kill (SIGKILL) mid-phase still
# leaves a forensic trail of which phase was in flight and how far it got.
HEARTBEAT_INTERVAL="${BMAD_TRACE_HEARTBEAT_INTERVAL:-10}"
LIVE_TRACE_FILE=""
HEARTBEAT_PID=""
PHASE_START_SECONDS=""
# Main script PID ($$ is the sourcing shell = the epic-execute process).
MAIN_PID="$$"
# =============================================================================
# Dependency Enforcement (Open Decision 4: hard fail)
# =============================================================================
# jq is load-bearing for observability in three places: the live renderer,
# span extraction, and clean .result extraction. When tracing is enabled it is
# a hard prerequisite — too much rides on it to silently degrade.
require_observability_deps() {
if [ "${BMAD_TRACE:-}" != "1" ]; then
return 0
fi
if ! command -v jq >/dev/null 2>&1; then
log_error "BMAD_TRACE=1 requires 'jq' but it was not found on PATH."
log_error "Install jq (https://jqlang.github.io/jq/) or unset BMAD_TRACE."
exit 1
fi
return 0
}
# =============================================================================
# Initialization
# =============================================================================
# Initialize the trace for this epic. Mints a trace id (if not already set) and
# creates the span file. No-op unless BMAD_TRACE=1.
init_observability() {
if [ "${BMAD_TRACE:-}" != "1" ]; then
TRACE_ENABLED=false
return 0
fi
if [ -z "$SPRINT_ARTIFACTS_DIR" ] || [ -z "$EPIC_ID" ]; then
log_warn "Cannot initialize tracing: SPRINT_ARTIFACTS_DIR or EPIC_ID not set"
TRACE_ENABLED=false
return 1
fi
if [ -z "$TRACE_ID" ]; then
if command -v uuidgen >/dev/null 2>&1; then
TRACE_ID=$(uuidgen | tr '[:upper:]' '[:lower:]')
else
# Fallback id: epic + pid + start seconds (still unique per run)
TRACE_ID="epic-${EPIC_ID}-$$-$(date +%s)"
fi
fi
TRACES_DIR="${TRACES_DIR:-$SPRINT_ARTIFACTS_DIR/traces}"
mkdir -p "$TRACES_DIR" 2>/dev/null || true
TRACE_FILE="$TRACES_DIR/epic-${EPIC_ID}-trace.jsonl"
LIVE_TRACE_FILE="$TRACES_DIR/epic-${EPIC_ID}-trace.live.jsonl"
TRACE_ENABLED=true
log "Tracing enabled (trace_id=$TRACE_ID) -> $TRACE_FILE"
return 0
}
# Set the ambient context for the next span. Called by phase functions before
# invoking claude (ambient-vars approach, Open Decision 1).
set_span_context() {
CURRENT_PHASE="$1"
CURRENT_STORY_ID="$2"
}
# =============================================================================
# Intra-phase Heartbeat (liveness for debugging hangs / hard kills)
# =============================================================================
# Append one heartbeat record describing the in-flight phase. Reads only the
# latest assistant event from the stream (bounded I/O — no slurp), so it stays
# cheap even for long phases with large tool output. Best-effort: a transient
# partial line just skips this tick.
emit_heartbeat() {
local stream_file="$1"
{ [ "$TRACE_ENABLED" = true ] && [ -n "$LIVE_TRACE_FILE" ] && [ -f "$stream_file" ]; } || return 0
local now elapsed ts events last_asst
now=$(date +%s)
elapsed=$(( now - ${PHASE_START_SECONDS:-$now} ))
ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
events=$(wc -l < "$stream_file" 2>/dev/null | tr -d ' '); events="${events:-0}"
last_asst=$(grep '"type":"assistant"' "$stream_file" 2>/dev/null | tail -n 1)
if [ -n "$last_asst" ]; then
printf '%s' "$last_asst" | jq -c \
--arg trace_id "$TRACE_ID" --arg phase "$CURRENT_PHASE" --arg story "$CURRENT_STORY_ID" \
--arg ts "$ts" --argjson elapsed "$elapsed" --argjson events "$events" '
(.message.usage // {}) as $u
| { trace_id:$trace_id, kind:"heartbeat", phase:$phase, story_id:$story,
elapsed_s:$elapsed, ts:$ts, events:$events,
ctx_input_tokens: ($u.input_tokens // 0),
cache_read: ($u.cache_read_input_tokens // 0),
out_tokens: ($u.output_tokens // 0),
last_text: ((([ .message.content[]? | select(.type=="text") | .text ] | last) // "")[0:160]) }' \
>> "$LIVE_TRACE_FILE" 2>/dev/null || true
else
# No assistant output yet — still emit a beat so you can see it started.
jq -nc \
--arg trace_id "$TRACE_ID" --arg phase "$CURRENT_PHASE" --arg story "$CURRENT_STORY_ID" \
--arg ts "$ts" --argjson elapsed "$elapsed" --argjson events "$events" \
'{trace_id:$trace_id, kind:"heartbeat", phase:$phase, story_id:$story,
elapsed_s:$elapsed, ts:$ts, events:$events, ctx_input_tokens:0, out_tokens:0, last_text:""}' \
>> "$LIVE_TRACE_FILE" 2>/dev/null || true
fi
}
# Start a background heartbeat for the current phase. Emits an immediate beat,
# then one every HEARTBEAT_INTERVAL seconds until the phase ends or the main
# process dies (self-terminates within one interval on SIGKILL).
start_phase_heartbeat() {
local stream_file="$1"
{ [ "$TRACE_ENABLED" = true ] && [ "${HEARTBEAT_INTERVAL:-0}" -gt 0 ] 2>/dev/null; } || return 0
PHASE_START_SECONDS=$(date +%s)
emit_heartbeat "$stream_file"
(
while kill -0 "$MAIN_PID" 2>/dev/null; do
sleep "$HEARTBEAT_INTERVAL"
emit_heartbeat "$stream_file"
done
) &
HEARTBEAT_PID=$!
}
# Stop the background heartbeat (called when the phase's claude call returns,
# and defensively from cleanup()).
stop_phase_heartbeat() {
[ -n "${HEARTBEAT_PID:-}" ] || return 0
kill "$HEARTBEAT_PID" 2>/dev/null || true
wait "$HEARTBEAT_PID" 2>/dev/null || true
HEARTBEAT_PID=""
}
# =============================================================================
# Span Recording
# =============================================================================
# Append one OTel-shaped span for the just-completed claude phase.
# Reads the final `result` envelope from the raw stream file (always the last
# JSONL line) and derives the telemetry fields.
#
# Arguments:
# $1 - raw stream file (JSONL emitted by claude --output-format stream-json)
# $2 - phase status (optional; the downstream completion status, e.g. COMPLETE)
record_span() {
local stream_file="$1"
local status="${2:-}"
[ "$TRACE_ENABLED" = true ] || return 0
[ -n "$TRACE_FILE" ] || return 0
[ -f "$stream_file" ] || return 0
# The result envelope is always the last line. A single JSONL line, so this
# is memory-cheap regardless of phase length.
local envelope
envelope=$(tail -n 1 "$stream_file" 2>/dev/null)
# Validate it is actually the result event; if the call crashed mid-stream
# the last line won't be a result — record a degraded error span instead.
local etype
etype=$(printf '%s' "$envelope" | jq -r '.type // empty' 2>/dev/null) || true
local ts
ts=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
if [ "$etype" != "result" ]; then
jq -nc \
--arg trace_id "$TRACE_ID" \
--arg parent "$CURRENT_STORY_ID" \
--arg name "$CURRENT_PHASE" \
--arg story_id "$CURRENT_STORY_ID" \
--arg status "${status:-UNKNOWN}" \
--arg ts "$ts" \
'{trace_id:$trace_id, span_id:null, parent:$parent, name:$name,
story_id:$story_id, model:null, input_tokens:0, output_tokens:0,
cache_read:0, cost_usd:0, duration_ms:0, ttft_ms:0, num_turns:0,
is_error:true, api_error_status:"no_result_envelope",
ctx_util_pct:0, status:$status, ts:$ts}' \
>> "$TRACE_FILE" 2>/dev/null || true
[ "$VERBOSE" = true ] && log_warn "record_span: no result envelope in stream (degraded span)"
return 0
fi
# Pull the primary (largest-context) model for ctx utilization. Sub-agent
# usage is preserved verbatim under model/cost via the envelope, but for the
# span's headline model we take the model with the largest context window.
printf '%s' "$envelope" | jq -c \
--arg trace_id "$TRACE_ID" \
--arg parent "$CURRENT_STORY_ID" \
--arg name "$CURRENT_PHASE" \
--arg story_id "$CURRENT_STORY_ID" \
--arg status "$status" \
--arg ts "$ts" '
# Choose headline model = the one with the largest contextWindow
(.modelUsage // {}) as $mu
| ($mu | to_entries | sort_by(.value.contextWindow // 0) | last) as $primary
| ($primary.key // .usage.service_tier // null) as $model
| (($primary.value.contextWindow) // 0) as $ctxwin
| ((.usage.input_tokens // 0)
+ (.usage.cache_read_input_tokens // 0)
+ (.usage.cache_creation_input_tokens // 0)) as $ctx_used
| (if $ctxwin > 0 then (($ctx_used / $ctxwin) * 100 * 10 | round / 10) else 0 end) as $ctx_pct
| {
trace_id: $trace_id,
span_id: (.session_id // null),
parent: $parent,
name: $name,
story_id: $story_id,
model: $model,
input_tokens: (.usage.input_tokens // 0),
output_tokens: (.usage.output_tokens // 0),
cache_read: (.usage.cache_read_input_tokens // 0),
cost_usd: (.total_cost_usd // 0),
duration_ms: (.duration_ms // 0),
ttft_ms: (.ttft_ms // 0),
num_turns: (.num_turns // 0),
is_error: (.is_error // false),
api_error_status: (.api_error_status // null),
ctx_util_pct: $ctx_pct,
status: $status,
ts: $ts
}' >> "$TRACE_FILE" 2>/dev/null || true
# Context-utilization warning feeds the Context Strategy gap (#7).
local ctx_pct
ctx_pct=$(printf '%s' "$envelope" | jq -r '
(.modelUsage // {} | to_entries | sort_by(.value.contextWindow // 0) | last) as $p
| (($p.value.contextWindow) // 0) as $w
| ((.usage.input_tokens // 0) + (.usage.cache_read_input_tokens // 0) + (.usage.cache_creation_input_tokens // 0)) as $u
| if $w > 0 then (($u / $w) * 100 | floor) else 0 end' 2>/dev/null)
if [ -n "$ctx_pct" ] && [ "$ctx_pct" -ge 80 ] 2>/dev/null; then
log_warn "Context utilization high: ${ctx_pct}% (${CURRENT_PHASE} / ${CURRENT_STORY_ID})"
fi
return 0
}
# =============================================================================
# Rollup (deterministic telemetry into metrics.yaml)
# =============================================================================
# Sum all spans for this epic and write a telemetry block into metrics.yaml.
# Deterministic: no model involved, no fabrication. No-op without yq.
rollup_telemetry() {
[ "$TRACE_ENABLED" = true ] || return 0
[ -n "$TRACE_FILE" ] && [ -f "$TRACE_FILE" ] || return 0
[ -n "$METRICS_FILE" ] && [ -f "$METRICS_FILE" ] || return 0
command -v yq >/dev/null 2>&1 || { log_warn "yq not found - skipping telemetry rollup"; return 0; }
# Aggregate totals + per-phase breakdown from the JSONL spans.
local totals_json
totals_json=$(jq -s '
{
total_cost_usd: (map(.cost_usd) | add // 0),
total_input_tokens: (map(.input_tokens) | add // 0),
total_output_tokens: (map(.output_tokens) | add // 0),
cache_read_tokens: (map(.cache_read) | add // 0),
phases_total: length,
phases_errored: (map(select(.is_error == true)) | length),
by_phase: (group_by(.name) | map({
key: (.[0].name // "unknown"),
value: {
calls: length,
cost_usd: (map(.cost_usd) | add // 0),
input_tokens: (map(.input_tokens) | add // 0),
output_tokens: (map(.output_tokens) | add // 0)
}
}) | from_entries)
}' "$TRACE_FILE" 2>/dev/null)
[ -z "$totals_json" ] && return 0
# Tool Call Success Rate (phase-level proxy): non-errored phases / total.
local rate
rate=$(printf '%s' "$totals_json" | jq -r '
if .phases_total > 0
then (((.phases_total - .phases_errored) / .phases_total) * 100 * 10 | round / 10)
else 0 end' 2>/dev/null)
# Write the telemetry block (yq reads the JSON via env to avoid quoting hell).
TELEMETRY_JSON="$totals_json" yq -i '.telemetry = (strenv(TELEMETRY_JSON) | from_json) | (.telemetry | ..) style=""' "$METRICS_FILE" 2>/dev/null || {
log_warn "Failed to write telemetry block to metrics"
return 0
}
yq -i ".telemetry.trace_id = \"$TRACE_ID\"" "$METRICS_FILE" 2>/dev/null || true
# 2026 business-outcome rates (gap #9). Derived deterministically from the
# counters already in metrics.yaml — no model, no estimation.
# - Tool Call Success Rate (phase-level proxy): non-errored phases / total
# - Task Completion Rate: completed stories / total stories
# - Escalation Rate: stories that exhausted retries / total stories
local total completed max_retries
total=$(yq '.stories.total // 0' "$METRICS_FILE" 2>/dev/null); total="${total:-0}"
completed=$(yq '.stories.completed // 0' "$METRICS_FILE" 2>/dev/null); completed="${completed:-0}"
max_retries=$(yq '.fix_loop.max_retries_hit // 0' "$METRICS_FILE" 2>/dev/null); max_retries="${max_retries:-0}"
local completion_rate=0 escalation_rate=0
if [ "$total" -gt 0 ] 2>/dev/null; then
completion_rate=$(awk "BEGIN { printf \"%.1f\", ($completed / $total) * 100 }")
escalation_rate=$(awk "BEGIN { printf \"%.1f\", ($max_retries / $total) * 100 }")
fi
yq -i ".telemetry.tool_call_success_rate = $rate" "$METRICS_FILE" 2>/dev/null || true
yq -i ".telemetry.task_completion_rate = $completion_rate" "$METRICS_FILE" 2>/dev/null || true
yq -i ".telemetry.escalation_rate = $escalation_rate" "$METRICS_FILE" 2>/dev/null || true
log "Telemetry rolled up into metrics: $METRICS_FILE"
return 0
}

View File

@ -43,6 +43,11 @@ cleanup() {
# Disable trap during cleanup # Disable trap during cleanup
trap - EXIT INT TERM trap - EXIT INT TERM
# Stop any in-flight heartbeat background process first
if type stop_phase_heartbeat >/dev/null 2>&1; then
stop_phase_heartbeat
fi
echo "" echo ""
log "Cleaning up (exit code: $exit_code)..." log "Cleaning up (exit code: $exit_code)..."
@ -58,6 +63,11 @@ cleanup() {
finalize_metrics "${#STORIES[@]}" "${COMPLETED:-0}" "${FAILED:-0}" "${SKIPPED:-0}" "$duration" finalize_metrics "${#STORIES[@]}" "${COMPLETED:-0}" "${FAILED:-0}" "${SKIPPED:-0}" "$duration"
log "Metrics finalized: $METRICS_FILE" log "Metrics finalized: $METRICS_FILE"
fi fi
# Roll up trace spans into a deterministic telemetry block (no-op unless traced)
if type rollup_telemetry >/dev/null 2>&1; then
rollup_telemetry
fi
fi fi
# Report git status # Report git status
@ -97,8 +107,9 @@ cleanup() {
# Kill orphaned node/test processes # Kill orphaned node/test processes
kill_orphaned_test_processes kill_orphaned_test_processes
# Clean up phase output temp file # Clean up phase output temp files
rm -f "$PHASE_OUTPUT_FILE" 2>/dev/null rm -f "$PHASE_OUTPUT_FILE" 2>/dev/null
rm -f "$PHASE_STREAM_FILE" 2>/dev/null
# Save log to repo before exiting # Save log to repo before exiting
save_log_to_repo save_log_to_repo
@ -134,6 +145,7 @@ LIB_DIR="$SCRIPT_DIR/epic-execute-lib"
[ -f "$LIB_DIR/tdd-flow.sh" ] && source "$LIB_DIR/tdd-flow.sh" [ -f "$LIB_DIR/tdd-flow.sh" ] && source "$LIB_DIR/tdd-flow.sh"
[ -f "$LIB_DIR/contract-harness.sh" ] && source "$LIB_DIR/contract-harness.sh" [ -f "$LIB_DIR/contract-harness.sh" ] && source "$LIB_DIR/contract-harness.sh"
[ -f "$LIB_DIR/contract-exec.sh" ] && source "$LIB_DIR/contract-exec.sh" [ -f "$LIB_DIR/contract-exec.sh" ] && source "$LIB_DIR/contract-exec.sh"
[ -f "$LIB_DIR/observability.sh" ] && source "$LIB_DIR/observability.sh"
STORIES_DIR="$PROJECT_ROOT/docs/stories" STORIES_DIR="$PROJECT_ROOT/docs/stories"
SPRINT_ARTIFACTS_DIR="$PROJECT_ROOT/docs/sprint-artifacts" SPRINT_ARTIFACTS_DIR="$PROJECT_ROOT/docs/sprint-artifacts"
@ -509,23 +521,92 @@ log_prompt_size() {
# Temp file for current phase output (reused across phases, cleaned up on exit) # Temp file for current phase output (reused across phases, cleaned up on exit)
PHASE_OUTPUT_FILE="/tmp/bmad-phase-output-$$.txt" PHASE_OUTPUT_FILE="/tmp/bmad-phase-output-$$.txt"
# Raw stream-json (JSONL) for the current phase when tracing is enabled.
PHASE_STREAM_FILE="/tmp/bmad-phase-stream-$$.jsonl"
# Run claude and pipe output directly to file + LOG_FILE (no bash variable) # Run claude and pipe output directly to file + LOG_FILE (no bash variable).
#
# When BMAD_TRACE=1, uses --output-format stream-json so the per-call telemetry
# envelope (session id, tokens, cost, latency) is captured. The raw JSONL is
# tee'd to PHASE_STREAM_FILE *before* the live jq renderer, so a renderer
# failure can never corrupt capture (Open Decision 2). The clean .result text
# is then written to PHASE_OUTPUT_FILE, so all downstream parsers that read
# PHASE_OUTPUT_FILE keep working unchanged — and parse cleaner text than before.
#
# When tracing is off, falls back to the legacy text path verbatim.
#
# Callers set CURRENT_PHASE/CURRENT_STORY_ID (via set_span_context) beforehand.
#
# Arguments: # Arguments:
# $1 - prompt text (use "-f" as first arg to use file-based prompt) # $1 - prompt text (use "-f" as first arg to use file-based prompt)
# $2 - prompt file path (only when $1 is "-f") # $2 - prompt file path (only when $1 is "-f")
# Sets: PHASE_OUTPUT_FILE with the output # Sets: PHASE_OUTPUT_FILE with the clean assistant text
run_claude_to_file() { run_claude_to_file() {
# Truncate phase output file # Truncate phase output file
: > "$PHASE_OUTPUT_FILE" : > "$PHASE_OUTPUT_FILE"
# Legacy text path (tracing disabled) — unchanged behavior.
if [ "${TRACE_ENABLED:-false}" != true ]; then
if [ "$1" = "-f" ]; then
local prompt_file="$2"
claude --dangerously-skip-permissions -f "$prompt_file" 2>&1 | tee -a "$LOG_FILE" > "$PHASE_OUTPUT_FILE" || true
else
local prompt="$1"
claude --dangerously-skip-permissions -p "$prompt" 2>&1 | tee -a "$LOG_FILE" > "$PHASE_OUTPUT_FILE" || true
fi
return 0
fi
# Traced path: stream-json + telemetry capture.
: > "$PHASE_STREAM_FILE"
# Intra-phase heartbeat: liveness trail so a hard kill mid-phase is debuggable.
start_phase_heartbeat "$PHASE_STREAM_FILE"
# Live renderer prints assistant text to the terminal/log. It reads the
# raw JSONL *after* the tee has already persisted it, so if jq chokes on a
# partial line the capture in PHASE_STREAM_FILE is unaffected (cosmetic only).
# stdin is redirected from /dev/null to suppress the CLI's "no stdin" warning.
if [ "$1" = "-f" ]; then if [ "$1" = "-f" ]; then
local prompt_file="$2" local prompt_file="$2"
claude --dangerously-skip-permissions -f "$prompt_file" 2>&1 | tee -a "$LOG_FILE" > "$PHASE_OUTPUT_FILE" || true claude --dangerously-skip-permissions --output-format stream-json --verbose -f "$prompt_file" </dev/null 2>>"$LOG_FILE" \
| tee -a "$PHASE_STREAM_FILE" \
| jq -r --unbuffered 'select(.type=="assistant") | .message.content[]? | select(.type=="text") | .text' 2>/dev/null \
| tee -a "$LOG_FILE" || true
else else
local prompt="$1" local prompt="$1"
claude --dangerously-skip-permissions -p "$prompt" 2>&1 | tee -a "$LOG_FILE" > "$PHASE_OUTPUT_FILE" || true claude --dangerously-skip-permissions --output-format stream-json --verbose -p "$prompt" </dev/null 2>>"$LOG_FILE" \
| tee -a "$PHASE_STREAM_FILE" \
| jq -r --unbuffered 'select(.type=="assistant") | .message.content[]? | select(.type=="text") | .text' 2>/dev/null \
| tee -a "$LOG_FILE" || true
fi fi
# Stop the heartbeat now that the phase's claude call has returned.
stop_phase_heartbeat
# Derive the clean assistant text from the final result envelope so that
# all downstream parsers (extract_json_result, check_phase_completion) keep
# reading PHASE_OUTPUT_FILE exactly as before. The result line is always last.
local result_text=""
if [ -s "$PHASE_STREAM_FILE" ]; then
result_text=$(tail -n 1 "$PHASE_STREAM_FILE" 2>/dev/null | jq -r 'select(.type=="result") | .result // empty' 2>/dev/null) || true
fi
if [ -n "$result_text" ]; then
printf '%s' "$result_text" > "$PHASE_OUTPUT_FILE"
else
# No result envelope (crash/timeout) — fall back to raw stream so the
# downstream "unclear result" handling still has something to inspect.
cp "$PHASE_STREAM_FILE" "$PHASE_OUTPUT_FILE" 2>/dev/null || true
fi
# Record one span for this phase. Status is best-effort from the JSON result
# block in the clean text; the core telemetry comes from the envelope itself.
local span_status=""
if [ -n "$result_text" ] && type extract_json_result >/dev/null 2>&1; then
extract_json_result "$result_text" >/dev/null 2>&1 || true
span_status=$(get_result_status 2>/dev/null || echo "")
fi
record_span "$PHASE_STREAM_FILE" "$span_status"
} }
# Read the tail of phase output for completion signal parsing. # Read the tail of phase output for completion signal parsing.
@ -1279,6 +1360,15 @@ fi
mkdir -p "$UAT_DIR" mkdir -p "$UAT_DIR"
mkdir -p "$SPRINTS_DIR" mkdir -p "$SPRINTS_DIR"
# Observability: enforce jq when tracing is enabled (hard prerequisite), then
# mint the epic-level trace id and create the span file. No-op unless BMAD_TRACE=1.
if type require_observability_deps >/dev/null 2>&1; then
require_observability_deps
fi
if type init_observability >/dev/null 2>&1; then
init_observability
fi
# Initialize metrics collection # Initialize metrics collection
EPIC_START_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ") EPIC_START_TIME=$(date -u +"%Y-%m-%dT%H:%M:%SZ")
EPIC_START_SECONDS=$(date +%s) EPIC_START_SECONDS=$(date +%s)
@ -1543,6 +1633,7 @@ Do NOT use 'git add -A' or 'git add .' - only stage files you created or modifie
fi fi
# Execute in isolated context — pipe to file to avoid memory bloat # Execute in isolated context — pipe to file to avoid memory bloat
set_span_context "dev" "$story_id"
run_claude_to_file "$dev_prompt" run_claude_to_file "$dev_prompt"
local result local result
result=$(read_phase_tail) result=$(read_phase_tail)
@ -1682,6 +1773,7 @@ Stage any fixes with explicit file paths: git add <file1> <file2> ..."
fi fi
# Execute in isolated context — pipe to file to avoid memory bloat # Execute in isolated context — pipe to file to avoid memory bloat
set_span_context "review" "$story_id"
run_claude_to_file "$review_prompt" run_claude_to_file "$review_prompt"
local result local result
result=$(read_phase_tail) result=$(read_phase_tail)
@ -1928,6 +2020,7 @@ Address all review findings now. This is attempt $attempt_num of 3."
fi fi
# Pipe to file to avoid memory bloat # Pipe to file to avoid memory bloat
set_span_context "fix" "$story_id"
run_claude_to_file "-f" "$temp_prompt_file" run_claude_to_file "-f" "$temp_prompt_file"
rm -f "$temp_prompt_file" rm -f "$temp_prompt_file"
else else
@ -1937,6 +2030,7 @@ Address all review findings now. This is attempt $attempt_num of 3."
fi fi
# Execute in isolated context — pipe to file to avoid memory bloat # Execute in isolated context — pipe to file to avoid memory bloat
set_span_context "fix" "$story_id"
run_claude_to_file "$fix_prompt" run_claude_to_file "$fix_prompt"
fi fi
@ -2395,6 +2489,7 @@ Stage any fixes with: git add <file1> <file2> ..."
fi fi
# Pipe to file to avoid memory bloat # Pipe to file to avoid memory bloat
set_span_context "arch" "$story_id"
run_claude_to_file "$arch_prompt" run_claude_to_file "$arch_prompt"
local result local result
result=$(read_phase_tail) result=$(read_phase_tail)
@ -2472,6 +2567,7 @@ Stage any fixes with: git add <file1> <file2> ..."
fi fi
# Pipe to file to avoid memory bloat # Pipe to file to avoid memory bloat
set_span_context "test_quality" "$story_id"
run_claude_to_file "$quality_prompt" run_claude_to_file "$quality_prompt"
local result local result
result=$(read_phase_tail) result=$(read_phase_tail)
@ -2607,6 +2703,7 @@ Analyze traceability now. Read story files on-demand as needed."
fi fi
# Pipe to file to avoid memory bloat # Pipe to file to avoid memory bloat
set_span_context "trace" "epic-$EPIC_ID"
run_claude_to_file "$trace_prompt" run_claude_to_file "$trace_prompt"
local result local result
result=$(read_phase_tail) result=$(read_phase_tail)
@ -2685,6 +2782,7 @@ Generate missing tests now."
fi fi
# Pipe to file to avoid memory bloat # Pipe to file to avoid memory bloat
set_span_context "trace_fix" "epic-$EPIC_ID"
run_claude_to_file "$fix_prompt" run_claude_to_file "$fix_prompt"
local result local result
result=$(read_phase_tail) result=$(read_phase_tail)
@ -3039,6 +3137,7 @@ Generate the UAT document now. Read story files on-demand as needed."
fi fi
# Pipe to file to avoid memory bloat # Pipe to file to avoid memory bloat
set_span_context "uat" "epic-$EPIC_ID"
run_claude_to_file "$uat_prompt" run_claude_to_file "$uat_prompt"
local result local result
result=$(read_phase_tail) result=$(read_phase_tail)