BMAD-METHOD/docs/improvements/agent-loop-2026-gap-analysi...

281 lines
21 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: Agent Loop 2026 Framework — Gap Analysis
---
# Agent Loop 2026 Framework — Gap Analysis
**Date:** 2026-06-15
**Workflows Reviewed:** epic-execute, epic-chain (+ `scripts/epic-execute-lib/`)
**Benchmark:** "Agent Loop Evaluation Framework (2026 Standard)" — Manus AI 5-category rubric
**Status:** Active
---
## Overview
This document scores the epic-execute / epic-chain shell automation against the 2026 Agent Loop Evaluation Framework (context engineering, tool design, runtime governance, error recovery, observability). Each criterion is rated on the rubric's Legacy↔2026 axis (15) with concrete `file:line` evidence, followed by a prioritized remediation roadmap.
The headline: the **objective machinery is already strong** — durable external state, per-phase context isolation, real-tooling gates, and deterministic state transitions are at or near 2026 standard. The gaps cluster in **governance, observability, and "prose where code belongs"** — places where a rule is told to the model instead of enforced in the harness.
**Scope note:** This is a point-in-time review of the scripts as of the date above. Line numbers reference the current `scripts/epic-execute.sh` (3,328 lines), `scripts/epic-chain.sh` (946 lines), and `scripts/epic-execute-lib/*.sh`.
---
## Scorecard
| # | Category | Criterion | Score (15) | Verdict |
|---|----------|-----------|:-----------:|---------|
| 1 | Context & Memory | Context Strategy | 3 | Partial — isolation yes, summarization no |
| 1 | Context & Memory | Tool Disclosure | 5 | 2026 — lean per-phase step templates |
| 1 | Context & Memory | Memory Persistence | 5 | 2026 — checkpoints / metrics / decision log |
| 1 | Context & Memory | Scaffolding Type | 3 | Partial — strong gates, residual personas |
| 2 | Tool Design | Tool Scoping | 2 | Legacy — `--dangerously-skip-permissions` + `eval` |
| 2 | Tool Design | Input/Output Validation | 3 | Partial — schema declared, not enforced |
| 2 | Tool Design | Deterministic Delegation | 5 | 2026 — state/math/writes all in bash |
| 2 | Tool Design | Failure Handling | 4 | 2026-leaning — self-heal loops |
| 3 | Governance (OWASP) | Policy Enforcement | 3 | Partial — one structural, one prompt-only |
| 3 | Governance (OWASP) | Privilege Tiers | 2 | Legacy — one standing autonomous tier, no HITL |
| 3 | Governance (OWASP) | Identity Management | 2 | Legacy — ambient long-lived creds |
| 3 | Governance (OWASP) | Memory Security | 2 | Legacy — unvalidated write-back + re-ingest |
| 4 | Error Recovery / RALF | Session Architecture | 3 | Partial — fresh process/phase, no watchdog |
| 4 | Error Recovery / RALF | Error Diagnosis | 3 | Partial — self-heal yes, retry wrapper dead code |
| 4 | Error Recovery / RALF | Loop Guardrails | 4 | 2026 — hard caps; missing time/token budgets |
| 4 | Error Recovery / RALF | Exit Conditions | 4 | 2026 — deterministic gates; some advisory |
| 5 | Observability | Tracing Infrastructure | 1 | Legacy — no trace ID, no token/cost/OTel |
| 5 | Observability | Evaluation Metrics | 3 | Partial — raw counters, self-graded scores |
| 5 | Observability | Evaluation Integration | 4 | 2026-leaning — embedded inline gates |
**Category averages:** Context **4.0** · Tools **3.75** · Governance **2.25** · Error Recovery **3.5** · Observability **2.67**
---
## What's Already at the 2026 Bar
### Memory Persistence (5/5)
Durable external state survives session death — the rubric's "external state files" pattern:
- Checkpoint files with 7-day expiry — `utils.sh:500-551`, written on exit `epic-execute.sh:74-82`
- Metrics YAML that **resumes and accumulates** counters across sessions — `epic-execute.sh:646-664`, `:779-785`
- Per-story design plans persisted for post-resume dev phases — `epic-execute.sh:144-145`
- Decision log + sprint-status.yaml as external workflow state — `decision-log.sh`, `epic-execute.sh:832-906`
### Tool Disclosure (5/5)
Progressive disclosure: loads lean ~49KB step templates per phase instead of embedding ~40KB workflow YAML — `epic-execute.sh:551-563`, per-phase template selection `:179-183`.
### Deterministic Delegation (5/5)
The LLM emits intent only (a status enum + findings); all counting, duration math, status writes, and pass/fail verdicts run in bash — `epic-execute.sh:693-797`, `:803-906`; `contract-exec.sh:144-170`. Playwright specs are **generated from the harness** (`contract-exec.sh:220-296`), not authored by the model.
### Embedded Probes (4/5)
Quality gates run **inline per story**, not offline batch: arch / test-quality / traceability / static-analysis / regression / contract. The static-analysis and contract gates run **real tooling** (`tsc`, lint, build, pytest, `curl`, `playwright`) whose actual output gates progression and can fail the epic's exit code — `epic-execute.sh:1986-2307`, `:3203-3208`.
---
## Improvement Areas
### HIGH Priority
#### 1. Observability foundation: no trace ID, no token/cost telemetry (Tracing 1/5)
The weakest foundational layer — and the rubric explicitly advises fixing the lowest foundational layer first.
There is no session/trace ID, no OpenTelemetry, and zero token/cost/latency capture. LLM calls are an opaque blob teed to a PID-named text log; the only correlation key is `$$`.
**Evidence:**
- Plain-text logging, no structured fields/spans — `epic-execute.sh:199-217`
- LLM call uninstrumented (no tokens/cost/latency/model captured) — `run_claude_to_file`, `epic-execute.sh:518-529`
- Metrics schema has no tokens/cost/trace ID — `epic-execute.sh:668-688`
- Cost figures in reports are openly **estimated** ("may vary 50200%") — `epic-chain-execution-report.md:225-248`
**Fix:** Generate a session/trace ID at startup and thread it through every phase + the metrics/log/decision-log writes. Capture real usage by invoking `claude` with `--output-format json` (or `--output-format stream-json`) and parsing the `usage` / `total_cost_usd` fields per phase. This unlocks every downstream metric.
---
#### 2. Resilience layer is fully built but never wired in (Error Recovery)
`execute_claude_with_retry`, `run_with_timeout`, and `CLAUDE_TIMEOUT` (600s default) all exist in `utils.sh:55-150` — but have **zero callers**. Every phase uses bare `run_claude_to_file` (`epic-execute.sh:518-529`) with `|| true`, so:
- A hung phase blocks forever (`CLAUDE_TIMEOUT` never applied to the real path)
- Transient errors (429/timeout/503) are not retried
- Crashes are swallowed and silently judged "incomplete"
There is also no watchdog/supervisor and no stuck-loop progress detection — fix loops burn all attempts even on an identical failure set, despite computing failure signatures in `test-failure-filter.sh:139-151`.
**Fix:** Route `run_claude_to_file` through the existing `run_with_timeout` + retry wrapper (mostly plumbing — the code already exists). Add a progress check that compares failure signatures across fix iterations and aborts early when the set is unchanged.
---
#### 3. Governance is structurally bypassable (Privilege 2/5, Policy partial)
The strong git policies — `check_sensitive_files` (`epic-execute.sh:351-391`), `git add -u` (`:2937`), `check_branch_protection` (`utils.sh:444-474`) — only guard the **final `commit_story` path**. Mid-phase, every agent runs `--dangerously-skip-permissions` (`:524`, `:527`) and can `git add -A` / commit anything itself. The "don't use git add -A" rule is injected as **prose in 7+ places** (`:597-598`, `:1772`, `:1875`, …) but enforced in code only at commit time. There is no read/write/destructive privilege separation and no HITL gate ("AUTOMATED… do NOT pause for user confirmation" — `:593-595`).
**Fix:** Move the staging policy into a git **pre-commit hook** (structural, not prose) so it governs agent-authored commits too; re-run `check_sensitive_files` on any commit. Introduce an approval tier (even a coarse env-gated one) for destructive operations.
---
#### 4. Memory poisoning loop is unguarded (Memory Security 2/5)
`append_to_decision_log` writes raw agent output straight to disk (`decision-log.sh:54-76`), and `get_decision_log_context` re-injects the whole log **verbatim** into the next phase's prompt (`decision-log.sh:80-87`) — the exact RAG/memory-poisoning loop OWASP 2026 (ASI06) warns against, with no validation, segmentation, or provenance tagging. `add_metrics_issue` / `record_fix_attempt` also interpolate agent-influenced strings directly into `yq` expressions (`epic-execute.sh:725`, `:745`) — a YAML/expression-injection surface.
**Fix:** Validate and length-bound decision-log entries before commit; tag provenance (which phase/story produced each entry); sanitize or parameterize strings before they enter `yq` expressions.
---
#### 5. Output contracts declared but not enforced (I/O Validation 3/5)
JSON result schemas are prescribed to the model and parsed with `jq`, but malformed/missing output **silently falls back to grepping prose** (`json-output.sh:311-387`, `check_phase_completion_fuzzy` in `utils.sh`). The documented consequence: **9 stories were mis-marked failed** because the model didn't emit the exact `IMPLEMENTATION COMPLETE:` phrase, requiring manual correction (`epic-chain-execution-report.md:254-272`).
**Fix:** Reject non-conforming output and force a bounded retry instead of degrading to regex. Make the JSON result block mandatory (fail the phase if absent). Relatedly, promote the advisory gates (arch / test-quality / traceability / regression — currently "proceed with documented concerns") to blocking where the risk warrants it, matching the deterministic behavior of the static-analysis and contract gates.
---
### MEDIUM Priority
#### 6. Tool Scoping & sandboxing (Tool Scoping 2/5)
`claude --dangerously-skip-permissions` grants the full unrestricted toolset with no per-phase allowlist or sandbox (`epic-execute.sh:524`, `:527`). Harness commands run via raw `eval` on YAML-derived strings — an injection surface — in `contract-exec.sh:43,53,86,168` and `contract-harness.sh:333,351,368`. The production-scope datastore guard is advisory (`log_warn`), not a block (`contract-harness.sh:194-213`).
**Fix:** Run harness commands as argv arrays (no `eval`). Consider a per-phase tool allowlist and/or containerized execution. Promote the production-scope datastore guard from warning to hard block.
#### 7. Context summarization (Context Strategy 3/5)
Per-phase context isolation is excellent (fresh `claude` process per phase, paths-not-contents handoff), but there is no **anchored iterative summarization**: cross-phase carryover is raw grep/sed-extracted text or tail-truncation (decision context truncated to 20KB at `epic-execute.sh:1466`), and the only control is a hard 150KB cap (`MAX_PROMPT_SIZE`, `:398`), not a utilization band.
**Fix:** Add a summarization step between phases — hold an anchor block (story + ACs + constraints) constant while condensing completed-phase outcomes into a structured summary; target 6080% utilization rather than a hard truncate.
#### 8. Identity Management (2/5)
Every `claude` call inherits the operator's ambient, long-lived credentials; harness secrets are consumed as ambient env vars (`contract-harness.sh:205`, `:254`). No unique per-task identity, short-lived tokens, or credential scoping.
**Fix:** Where feasible, issue short-lived/scoped credentials per run; vault harness secrets rather than relying on ambient env.
#### 9. Evaluation Metrics — derive rates, separate the judge (Metrics 3/5)
The raw inputs exist (completed/failed/skipped, fix attempts, `max_retries_hit`) but are never computed into **Task Completion Rate / Escalation Rate / Tool Call Success Rate**. Rubric scores (test-quality ≥70, traceability P0=100%) are **self-graded by the same executing model** (`json-output.sh:473-496`) rather than by an independent calibrated judge.
**Fix:** Compute and persist the derived rates in the metrics YAML. Introduce a separate, cheaper judge model (e.g., Haiku) for binary rubric scoring so the executor isn't grading its own work.
---
### LOW Priority
#### 10. `yq`-dependent durability
Metrics, sprint-status, and issue persistence silently degrade without `yq` installed (`epic-execute.sh:707`, `:790`). The otherwise-excellent memory layer is best-effort, not guaranteed.
**Fix:** Either declare `yq` a hard prerequisite (fail fast at startup) or harden the sed/awk fallbacks to full parity.
#### 11. Vestigial "full workflow YAML" priority tier
`CONTENT_PRIORITY_LOW` is still described as "Full workflow YAML (truncate first)" (`epic-execute.sh:404`), a legacy fallback path no active builder uses. Remove to avoid confusion.
#### 12. Gate status not persisted by standalone runs
`validation.gate_status` is written by the **chain** wrapper (`epic-chain.sh:626-638`), not the inner execute loop, so a standalone `epic-execute.sh` run leaves `gate_status: PENDING`.
---
## Two Structural Themes
1. **Prose where code belongs.** The recurring pattern — git rules, "do NOT pause", personas like *"You ARE an adversarial reviewer"* (`:1616`) — is the compensatory scaffolding the rubric flags: a rule told to the model that the harness could instead enforce. The codebase is mid-migration; the objective gates (contracts, real tooling, JSON results) are already constitutive, but the soft rules haven't caught up.
2. **Built-but-unwired.** The retry/timeout resilience layer (`utils.sh:55-150`) is the clearest example — fully implemented, zero callers. The capability gap is often plumbing, not net-new code.
---
## Suggested Sequencing
Per the rubric's "fix the lowest-scoring foundational layer first":
1. **Observability** (#1) — trace/session ID + real token/cost/latency from `claude --output-format json`. Foundation for everything; currently 1/5.
2. **Wire the existing retry/timeout layer** (#2) — pure plumbing, already-written code, large RALF payoff.
3. **Governance** (#3, #4) — pre-commit hook + sensitive-file re-check on agent commits; validate decision-log/metrics writes before commit.
4. **Enforce JSON contracts** (#5) — fail-and-retry on missing signal instead of fuzzy fallback; promote advisory gates to blocking.
5. **Context summarization** (#7) — anchored iterative summarization targeting a utilization band.
The smallest, highest-leverage starting points are **#2 (retry wiring)** and **#3 (pre-commit hook)**.
---
## References
- Benchmark source: "Agent Loop Evaluation Framework (2026 Standard)," Manus AI — context engineering, tool design, OWASP Top 10 for Agentic Apps 2026, RALF loop, OpenTelemetry-first observability.
- Prior review: [`epic-workflows-v1.md`](./epic-workflows-v1.md) (2026-01-02) — overlaps on the `--dangerously-skip-permissions` finding (#1/#3, #6 here).
---
## Appendix A — Observability Deep Dive
**Added:** 2026-06-15 · Expands HIGH-priority item #1 and Evaluation Metrics (#9).
### A.1 Root cause: a single discard point
Every LLM call routes through `run_claude_to_file` (`epic-execute.sh:518-529`), which uses the CLI's **default `text` output format**:
```bash
claude --dangerously-skip-permissions -p "$prompt" 2>&1 | tee -a "$LOG_FILE" > "$PHASE_OUTPUT_FILE" || true
```
Only rendered assistant text survives. The chain report generator does the same (`epic-chain.sh:884`). The 1/5 Tracing score is the consequence of this one choice — not an architectural limit. The telemetry is produced on every call and thrown away.
### A.2 What `claude --output-format json` already returns (verified)
Tested against the installed CLI (v2.1.177). The result envelope contains every field 2026 observability requires:
| Field (verified present) | Example | Rubric need it satisfies |
|---|---|---|
| `session_id` | `f6ff5b55-…` | Trace/session ID (today: PID `$$`) |
| `total_cost_usd` | `0.0586` | Real cost (today: fabricated) |
| `usage.input_tokens` / `output_tokens` | 2629 / 4 | Token spend |
| `usage.cache_read_input_tokens` / `cache_creation_input_tokens` | 15362 / 3718 | Cache efficiency |
| `modelUsage[model].costUSD` + per-model tokens | Opus + Haiku sub-agent | Per-model cost attribution |
| `modelUsage[model].contextWindow` | 1000000 | Enables context-utilization % |
| `duration_ms` / `duration_api_ms` / `ttft_ms` | 1757 / 2522 / 1754 | Per-call latency |
| `num_turns`, `stop_reason`, `is_error`, `api_error_status`, `permission_denials` | 1 / end_turn / false / null / [] | Tool-call success / error telemetry |
The CLI also exposes `--output-format stream-json` (live JSONL ending with the same `result` envelope) and `--json-schema <schema>` for structured-output enforcement. All three require `--print`, which the script already passes.
### A.3 The current report is actively misleading, not merely empty
Because real telemetry is discarded, the chain report **fabricates** it (`epic-chain-execution-report.md:225-248`):
- Token table derived from `Est. Calls = stories × 2` and `~16K input/call` assumptions — arithmetic on story counts, not measurement.
- Cost table priced against **Claude Sonnet 3.5 ($3/$15)** and **Opus ($15/$75)** — neither is the model that ran (`claude-opus-4-8[1m]`); the real `total_cost_usd` was available and discarded.
- Carries the disclaimer *"Actual usage may vary by 50-200%."*
An authoritative-looking cost table that is invented is worse than a blank cell — it is unfalsifiable noise where ground truth was one flag away.
### A.4 Synergy: this fix lifts three other findings
The same envelope partially closes gaps scored elsewhere:
1. **Context Strategy (#7).** `contextWindow` + `input_tokens + cache_read + cache_creation` yields exact per-phase utilization, making the 6080% target measurable and enforceable for free.
2. **I/O Validation (#5) + the 9-mismark incident.** Parsing `.result` (clean final message) instead of scraping interleaved stdout, plus `--json-schema` to make the status field structurally mandatory, removes the fuzzy-regex fallback (`json-output.sh:311-387`) that mismarked 9 stories (`epic-chain-execution-report.md:254-272`). That incident is fundamentally an output-format problem.
3. **Evaluation Metrics (#9).** Enables the rubric's business metrics: Task Completion Rate (`completed/total`), Escalation Rate (`max_retries_hit/stories` + real `is_error` rate), Tool Call Success Rate (`is_error=false` phases ÷ total).
### A.5 Target design (fits the existing architecture)
Constraints: preserve the memory-safe "pipe to file, read 32KB tail" pattern, and keep the live `tee` to the log.
1. **Switch to `stream-json`, not plain `json`.** Plain `json` buffers and kills the live tee. `--output-format stream-json --include-partial-messages` streams live *and* makes the **last JSON line** the `result` envelope; `read_phase_tail` still captures it (parse the last line where `.type=="result"`). Memory-safety preserved.
2. **One append-only trace file per epic**, using the OTel span data model (convertible to OTLP later):
`docs/sprint-artifacts/traces/epic-<id>-trace.jsonl` — one span per phase:
```json
{"trace_id":"<epic-uuid>","span_id":"<claude session_id>","parent":"<story_id>",
"name":"dev","story_id":"4-3","model":"claude-opus-4-8[1m]",
"input_tokens":2629,"output_tokens":4,"cache_read":15362,"cost_usd":0.058,
"duration_ms":1757,"ttft_ms":1754,"num_turns":1,"is_error":false,
"ctx_util_pct":2.1,"status":"COMPLETE","ts":"2026-06-15T…"}
```
Generate one epic-level `trace_id` (`uuidgen`) at startup; each call's `session_id` is the `span_id`, `story_id` the parent. This is the single correlating ID `$$` never provided.
3. **Deterministic rollup into `metrics.yaml`** — add a `telemetry:` block summed from the JSONL (no model, no fabrication): `total_cost_usd`, `total_input_tokens`, `total_output_tokens`, `cache_read_tokens`, `by_phase`. The chain report then reads measured numbers; the `Estimated Token Usage` section is deleted.
4. **OTel bridge (phase 2, optional).** JSONL-with-OTel-fields is the pragmatic 80%. A later post-processor converts spans → OTLP without touching the hot path.
### A.6 Caveats
- **jq dependency** — telemetry parsing needs `jq` (same soft-dep fragility as `yq`, item #10). Degrade gracefully (skip span, don't crash); consider making `jq` a hard startup prerequisite.
- **Cost includes sub-agents** — `modelUsage` surfaced an internal Haiku call inside an Opus phase. Record `modelUsage` verbatim; don't flatten to one model.
- **Cache tokens dominate** — in testing, cache-read (15K) was 6× fresh input (2.6K). Report fresh vs. cache-read separately; compute utilization from the sum.
- **`stream-json` is noisier on disk** — log grows faster (every partial chunk). The existing 64KB inter-story log truncation (`epic-execute.sh:3182`) mitigates; confirm it suffices.
### A.7 Why this is the right place to start
Observability scored lowest yet is the cheapest high-priority fix and the only one that drags three other findings upward. The data already exists on every call; the work is plumbing (a format switch + a `record_span` helper + a deterministic rollup), not building telemetry infrastructure.