--- title: Agent Loop 2026 Framework — Gap Analysis --- # Agent Loop 2026 Framework — Gap Analysis **Date:** 2026-06-15 **Workflows Reviewed:** epic-execute, epic-chain (+ `scripts/epic-execute-lib/`) **Benchmark:** "Agent Loop Evaluation Framework (2026 Standard)" — Manus AI 5-category rubric **Status:** Active --- ## Overview This document scores the epic-execute / epic-chain shell automation against the 2026 Agent Loop Evaluation Framework (context engineering, tool design, runtime governance, error recovery, observability). Each criterion is rated on the rubric's Legacy↔2026 axis (1–5) with concrete `file:line` evidence, followed by a prioritized remediation roadmap. The headline: the **objective machinery is already strong** — durable external state, per-phase context isolation, real-tooling gates, and deterministic state transitions are at or near 2026 standard. The gaps cluster in **governance, observability, and "prose where code belongs"** — places where a rule is told to the model instead of enforced in the harness. **Scope note:** This is a point-in-time review of the scripts as of the date above. Line numbers reference the current `scripts/epic-execute.sh` (3,328 lines), `scripts/epic-chain.sh` (946 lines), and `scripts/epic-execute-lib/*.sh`. --- ## Scorecard | # | Category | Criterion | Score (1–5) | Verdict | |---|----------|-----------|:-----------:|---------| | 1 | Context & Memory | Context Strategy | 3 | Partial — isolation yes, summarization no | | 1 | Context & Memory | Tool Disclosure | 5 | 2026 — lean per-phase step templates | | 1 | Context & Memory | Memory Persistence | 5 | 2026 — checkpoints / metrics / decision log | | 1 | Context & Memory | Scaffolding Type | 3 | Partial — strong gates, residual personas | | 2 | Tool Design | Tool Scoping | 2 | Legacy — `--dangerously-skip-permissions` + `eval` | | 2 | Tool Design | Input/Output Validation | 3 | Partial — schema declared, not enforced | | 2 | Tool Design | Deterministic Delegation | 5 | 2026 — state/math/writes all in bash | | 2 | Tool Design | Failure Handling | 4 | 2026-leaning — self-heal loops | | 3 | Governance (OWASP) | Policy Enforcement | 3 | Partial — one structural, one prompt-only | | 3 | Governance (OWASP) | Privilege Tiers | 2 | Legacy — one standing autonomous tier, no HITL | | 3 | Governance (OWASP) | Identity Management | 2 | Legacy — ambient long-lived creds | | 3 | Governance (OWASP) | Memory Security | 2 | Legacy — unvalidated write-back + re-ingest | | 4 | Error Recovery / RALF | Session Architecture | 3 | Partial — fresh process/phase, no watchdog | | 4 | Error Recovery / RALF | Error Diagnosis | 3 | Partial — self-heal yes, retry wrapper dead code | | 4 | Error Recovery / RALF | Loop Guardrails | 4 | 2026 — hard caps; missing time/token budgets | | 4 | Error Recovery / RALF | Exit Conditions | 4 | 2026 — deterministic gates; some advisory | | 5 | Observability | Tracing Infrastructure | 1 | Legacy — no trace ID, no token/cost/OTel | | 5 | Observability | Evaluation Metrics | 3 | Partial — raw counters, self-graded scores | | 5 | Observability | Evaluation Integration | 4 | 2026-leaning — embedded inline gates | **Category averages:** Context **4.0** · Tools **3.75** · Governance **2.25** · Error Recovery **3.5** · Observability **2.67** --- ## What's Already at the 2026 Bar ### Memory Persistence (5/5) Durable external state survives session death — the rubric's "external state files" pattern: - Checkpoint files with 7-day expiry — `utils.sh:500-551`, written on exit `epic-execute.sh:74-82` - Metrics YAML that **resumes and accumulates** counters across sessions — `epic-execute.sh:646-664`, `:779-785` - Per-story design plans persisted for post-resume dev phases — `epic-execute.sh:144-145` - Decision log + sprint-status.yaml as external workflow state — `decision-log.sh`, `epic-execute.sh:832-906` ### Tool Disclosure (5/5) Progressive disclosure: loads lean ~4–9KB step templates per phase instead of embedding ~40KB workflow YAML — `epic-execute.sh:551-563`, per-phase template selection `:179-183`. ### Deterministic Delegation (5/5) The LLM emits intent only (a status enum + findings); all counting, duration math, status writes, and pass/fail verdicts run in bash — `epic-execute.sh:693-797`, `:803-906`; `contract-exec.sh:144-170`. Playwright specs are **generated from the harness** (`contract-exec.sh:220-296`), not authored by the model. ### Embedded Probes (4/5) Quality gates run **inline per story**, not offline batch: arch / test-quality / traceability / static-analysis / regression / contract. The static-analysis and contract gates run **real tooling** (`tsc`, lint, build, pytest, `curl`, `playwright`) whose actual output gates progression and can fail the epic's exit code — `epic-execute.sh:1986-2307`, `:3203-3208`. --- ## Improvement Areas ### HIGH Priority #### 1. Observability foundation: no trace ID, no token/cost telemetry (Tracing 1/5) The weakest foundational layer — and the rubric explicitly advises fixing the lowest foundational layer first. There is no session/trace ID, no OpenTelemetry, and zero token/cost/latency capture. LLM calls are an opaque blob teed to a PID-named text log; the only correlation key is `$$`. **Evidence:** - Plain-text logging, no structured fields/spans — `epic-execute.sh:199-217` - LLM call uninstrumented (no tokens/cost/latency/model captured) — `run_claude_to_file`, `epic-execute.sh:518-529` - Metrics schema has no tokens/cost/trace ID — `epic-execute.sh:668-688` - Cost figures in reports are openly **estimated** ("may vary 50–200%") — `epic-chain-execution-report.md:225-248` **Fix:** Generate a session/trace ID at startup and thread it through every phase + the metrics/log/decision-log writes. Capture real usage by invoking `claude` with `--output-format json` (or `--output-format stream-json`) and parsing the `usage` / `total_cost_usd` fields per phase. This unlocks every downstream metric. --- #### 2. Resilience layer is fully built but never wired in (Error Recovery) `execute_claude_with_retry`, `run_with_timeout`, and `CLAUDE_TIMEOUT` (600s default) all exist in `utils.sh:55-150` — but have **zero callers**. Every phase uses bare `run_claude_to_file` (`epic-execute.sh:518-529`) with `|| true`, so: - A hung phase blocks forever (`CLAUDE_TIMEOUT` never applied to the real path) - Transient errors (429/timeout/503) are not retried - Crashes are swallowed and silently judged "incomplete" There is also no watchdog/supervisor and no stuck-loop progress detection — fix loops burn all attempts even on an identical failure set, despite computing failure signatures in `test-failure-filter.sh:139-151`. **Fix:** Route `run_claude_to_file` through the existing `run_with_timeout` + retry wrapper (mostly plumbing — the code already exists). Add a progress check that compares failure signatures across fix iterations and aborts early when the set is unchanged. --- #### 3. Governance is structurally bypassable (Privilege 2/5, Policy partial) The strong git policies — `check_sensitive_files` (`epic-execute.sh:351-391`), `git add -u` (`:2937`), `check_branch_protection` (`utils.sh:444-474`) — only guard the **final `commit_story` path**. Mid-phase, every agent runs `--dangerously-skip-permissions` (`:524`, `:527`) and can `git add -A` / commit anything itself. The "don't use git add -A" rule is injected as **prose in 7+ places** (`:597-598`, `:1772`, `:1875`, …) but enforced in code only at commit time. There is no read/write/destructive privilege separation and no HITL gate ("AUTOMATED… do NOT pause for user confirmation" — `:593-595`). **Fix:** Move the staging policy into a git **pre-commit hook** (structural, not prose) so it governs agent-authored commits too; re-run `check_sensitive_files` on any commit. Introduce an approval tier (even a coarse env-gated one) for destructive operations. --- #### 4. Memory poisoning loop is unguarded (Memory Security 2/5) `append_to_decision_log` writes raw agent output straight to disk (`decision-log.sh:54-76`), and `get_decision_log_context` re-injects the whole log **verbatim** into the next phase's prompt (`decision-log.sh:80-87`) — the exact RAG/memory-poisoning loop OWASP 2026 (ASI06) warns against, with no validation, segmentation, or provenance tagging. `add_metrics_issue` / `record_fix_attempt` also interpolate agent-influenced strings directly into `yq` expressions (`epic-execute.sh:725`, `:745`) — a YAML/expression-injection surface. **Fix:** Validate and length-bound decision-log entries before commit; tag provenance (which phase/story produced each entry); sanitize or parameterize strings before they enter `yq` expressions. --- #### 5. Output contracts declared but not enforced (I/O Validation 3/5) JSON result schemas are prescribed to the model and parsed with `jq`, but malformed/missing output **silently falls back to grepping prose** (`json-output.sh:311-387`, `check_phase_completion_fuzzy` in `utils.sh`). The documented consequence: **9 stories were mis-marked failed** because the model didn't emit the exact `IMPLEMENTATION COMPLETE:` phrase, requiring manual correction (`epic-chain-execution-report.md:254-272`). **Fix:** Reject non-conforming output and force a bounded retry instead of degrading to regex. Make the JSON result block mandatory (fail the phase if absent). Relatedly, promote the advisory gates (arch / test-quality / traceability / regression — currently "proceed with documented concerns") to blocking where the risk warrants it, matching the deterministic behavior of the static-analysis and contract gates. --- ### MEDIUM Priority #### 6. Tool Scoping & sandboxing (Tool Scoping 2/5) `claude --dangerously-skip-permissions` grants the full unrestricted toolset with no per-phase allowlist or sandbox (`epic-execute.sh:524`, `:527`). Harness commands run via raw `eval` on YAML-derived strings — an injection surface — in `contract-exec.sh:43,53,86,168` and `contract-harness.sh:333,351,368`. The production-scope datastore guard is advisory (`log_warn`), not a block (`contract-harness.sh:194-213`). **Fix:** Run harness commands as argv arrays (no `eval`). Consider a per-phase tool allowlist and/or containerized execution. Promote the production-scope datastore guard from warning to hard block. #### 7. Context summarization (Context Strategy 3/5) Per-phase context isolation is excellent (fresh `claude` process per phase, paths-not-contents handoff), but there is no **anchored iterative summarization**: cross-phase carryover is raw grep/sed-extracted text or tail-truncation (decision context truncated to 20KB at `epic-execute.sh:1466`), and the only control is a hard 150KB cap (`MAX_PROMPT_SIZE`, `:398`), not a utilization band. **Fix:** Add a summarization step between phases — hold an anchor block (story + ACs + constraints) constant while condensing completed-phase outcomes into a structured summary; target 60–80% utilization rather than a hard truncate. #### 8. Identity Management (2/5) Every `claude` call inherits the operator's ambient, long-lived credentials; harness secrets are consumed as ambient env vars (`contract-harness.sh:205`, `:254`). No unique per-task identity, short-lived tokens, or credential scoping. **Fix:** Where feasible, issue short-lived/scoped credentials per run; vault harness secrets rather than relying on ambient env. #### 9. Evaluation Metrics — derive rates, separate the judge (Metrics 3/5) The raw inputs exist (completed/failed/skipped, fix attempts, `max_retries_hit`) but are never computed into **Task Completion Rate / Escalation Rate / Tool Call Success Rate**. Rubric scores (test-quality ≥70, traceability P0=100%) are **self-graded by the same executing model** (`json-output.sh:473-496`) rather than by an independent calibrated judge. **Fix:** Compute and persist the derived rates in the metrics YAML. Introduce a separate, cheaper judge model (e.g., Haiku) for binary rubric scoring so the executor isn't grading its own work. --- ### LOW Priority #### 10. `yq`-dependent durability Metrics, sprint-status, and issue persistence silently degrade without `yq` installed (`epic-execute.sh:707`, `:790`). The otherwise-excellent memory layer is best-effort, not guaranteed. **Fix:** Either declare `yq` a hard prerequisite (fail fast at startup) or harden the sed/awk fallbacks to full parity. #### 11. Vestigial "full workflow YAML" priority tier `CONTENT_PRIORITY_LOW` is still described as "Full workflow YAML (truncate first)" (`epic-execute.sh:404`), a legacy fallback path no active builder uses. Remove to avoid confusion. #### 12. Gate status not persisted by standalone runs `validation.gate_status` is written by the **chain** wrapper (`epic-chain.sh:626-638`), not the inner execute loop, so a standalone `epic-execute.sh` run leaves `gate_status: PENDING`. --- ## Two Structural Themes 1. **Prose where code belongs.** The recurring pattern — git rules, "do NOT pause", personas like *"You ARE an adversarial reviewer"* (`:1616`) — is the compensatory scaffolding the rubric flags: a rule told to the model that the harness could instead enforce. The codebase is mid-migration; the objective gates (contracts, real tooling, JSON results) are already constitutive, but the soft rules haven't caught up. 2. **Built-but-unwired.** The retry/timeout resilience layer (`utils.sh:55-150`) is the clearest example — fully implemented, zero callers. The capability gap is often plumbing, not net-new code. --- ## Suggested Sequencing Per the rubric's "fix the lowest-scoring foundational layer first": 1. **Observability** (#1) — trace/session ID + real token/cost/latency from `claude --output-format json`. Foundation for everything; currently 1/5. 2. **Wire the existing retry/timeout layer** (#2) — pure plumbing, already-written code, large RALF payoff. 3. **Governance** (#3, #4) — pre-commit hook + sensitive-file re-check on agent commits; validate decision-log/metrics writes before commit. 4. **Enforce JSON contracts** (#5) — fail-and-retry on missing signal instead of fuzzy fallback; promote advisory gates to blocking. 5. **Context summarization** (#7) — anchored iterative summarization targeting a utilization band. The smallest, highest-leverage starting points are **#2 (retry wiring)** and **#3 (pre-commit hook)**. --- ## References - Benchmark source: "Agent Loop Evaluation Framework (2026 Standard)," Manus AI — context engineering, tool design, OWASP Top 10 for Agentic Apps 2026, RALF loop, OpenTelemetry-first observability. - Prior review: [`epic-workflows-v1.md`](./epic-workflows-v1.md) (2026-01-02) — overlaps on the `--dangerously-skip-permissions` finding (#1/#3, #6 here). --- ## Appendix A — Observability Deep Dive **Added:** 2026-06-15 · Expands HIGH-priority item #1 and Evaluation Metrics (#9). ### A.1 Root cause: a single discard point Every LLM call routes through `run_claude_to_file` (`epic-execute.sh:518-529`), which uses the CLI's **default `text` output format**: ```bash claude --dangerously-skip-permissions -p "$prompt" 2>&1 | tee -a "$LOG_FILE" > "$PHASE_OUTPUT_FILE" || true ``` Only rendered assistant text survives. The chain report generator does the same (`epic-chain.sh:884`). The 1/5 Tracing score is the consequence of this one choice — not an architectural limit. The telemetry is produced on every call and thrown away. ### A.2 What `claude --output-format json` already returns (verified) Tested against the installed CLI (v2.1.177). The result envelope contains every field 2026 observability requires: | Field (verified present) | Example | Rubric need it satisfies | |---|---|---| | `session_id` | `f6ff5b55-…` | Trace/session ID (today: PID `$$`) | | `total_cost_usd` | `0.0586` | Real cost (today: fabricated) | | `usage.input_tokens` / `output_tokens` | 2629 / 4 | Token spend | | `usage.cache_read_input_tokens` / `cache_creation_input_tokens` | 15362 / 3718 | Cache efficiency | | `modelUsage[model].costUSD` + per-model tokens | Opus + Haiku sub-agent | Per-model cost attribution | | `modelUsage[model].contextWindow` | 1000000 | Enables context-utilization % | | `duration_ms` / `duration_api_ms` / `ttft_ms` | 1757 / 2522 / 1754 | Per-call latency | | `num_turns`, `stop_reason`, `is_error`, `api_error_status`, `permission_denials` | 1 / end_turn / false / null / [] | Tool-call success / error telemetry | The CLI also exposes `--output-format stream-json` (live JSONL ending with the same `result` envelope) and `--json-schema ` for structured-output enforcement. All three require `--print`, which the script already passes. ### A.3 The current report is actively misleading, not merely empty Because real telemetry is discarded, the chain report **fabricates** it (`epic-chain-execution-report.md:225-248`): - Token table derived from `Est. Calls = stories × 2` and `~16K input/call` assumptions — arithmetic on story counts, not measurement. - Cost table priced against **Claude Sonnet 3.5 ($3/$15)** and **Opus ($15/$75)** — neither is the model that ran (`claude-opus-4-8[1m]`); the real `total_cost_usd` was available and discarded. - Carries the disclaimer *"Actual usage may vary by 50-200%."* An authoritative-looking cost table that is invented is worse than a blank cell — it is unfalsifiable noise where ground truth was one flag away. ### A.4 Synergy: this fix lifts three other findings The same envelope partially closes gaps scored elsewhere: 1. **Context Strategy (#7).** `contextWindow` + `input_tokens + cache_read + cache_creation` yields exact per-phase utilization, making the 60–80% target measurable and enforceable for free. 2. **I/O Validation (#5) + the 9-mismark incident.** Parsing `.result` (clean final message) instead of scraping interleaved stdout, plus `--json-schema` to make the status field structurally mandatory, removes the fuzzy-regex fallback (`json-output.sh:311-387`) that mismarked 9 stories (`epic-chain-execution-report.md:254-272`). That incident is fundamentally an output-format problem. 3. **Evaluation Metrics (#9).** Enables the rubric's business metrics: Task Completion Rate (`completed/total`), Escalation Rate (`max_retries_hit/stories` + real `is_error` rate), Tool Call Success Rate (`is_error=false` phases ÷ total). ### A.5 Target design (fits the existing architecture) Constraints: preserve the memory-safe "pipe to file, read 32KB tail" pattern, and keep the live `tee` to the log. 1. **Switch to `stream-json`, not plain `json`.** Plain `json` buffers and kills the live tee. `--output-format stream-json --include-partial-messages` streams live *and* makes the **last JSON line** the `result` envelope; `read_phase_tail` still captures it (parse the last line where `.type=="result"`). Memory-safety preserved. 2. **One append-only trace file per epic**, using the OTel span data model (convertible to OTLP later): `docs/sprint-artifacts/traces/epic--trace.jsonl` — one span per phase: ```json {"trace_id":"","span_id":"","parent":"", "name":"dev","story_id":"4-3","model":"claude-opus-4-8[1m]", "input_tokens":2629,"output_tokens":4,"cache_read":15362,"cost_usd":0.058, "duration_ms":1757,"ttft_ms":1754,"num_turns":1,"is_error":false, "ctx_util_pct":2.1,"status":"COMPLETE","ts":"2026-06-15T…"} ``` Generate one epic-level `trace_id` (`uuidgen`) at startup; each call's `session_id` is the `span_id`, `story_id` the parent. This is the single correlating ID `$$` never provided. 3. **Deterministic rollup into `metrics.yaml`** — add a `telemetry:` block summed from the JSONL (no model, no fabrication): `total_cost_usd`, `total_input_tokens`, `total_output_tokens`, `cache_read_tokens`, `by_phase`. The chain report then reads measured numbers; the `Estimated Token Usage` section is deleted. 4. **OTel bridge (phase 2, optional).** JSONL-with-OTel-fields is the pragmatic 80%. A later post-processor converts spans → OTLP without touching the hot path. ### A.6 Caveats - **jq dependency** — telemetry parsing needs `jq` (same soft-dep fragility as `yq`, item #10). Degrade gracefully (skip span, don't crash); consider making `jq` a hard startup prerequisite. - **Cost includes sub-agents** — `modelUsage` surfaced an internal Haiku call inside an Opus phase. Record `modelUsage` verbatim; don't flatten to one model. - **Cache tokens dominate** — in testing, cache-read (15K) was 6× fresh input (2.6K). Report fresh vs. cache-read separately; compute utilization from the sum. - **`stream-json` is noisier on disk** — log grows faster (every partial chunk). The existing 64KB inter-story log truncation (`epic-execute.sh:3182`) mitigates; confirm it suffices. ### A.7 Why this is the right place to start Observability scored lowest yet is the cheapest high-priority fix and the only one that drags three other findings upward. The data already exists on every call; the work is plumbing (a format switch + a `record_span` helper + a deterministic rollup), not building telemetry infrastructure.