21 KiB
| title |
|---|
| Agent Loop 2026 Framework — Gap Analysis |
Agent Loop 2026 Framework — Gap Analysis
Date: 2026-06-15
Workflows Reviewed: epic-execute, epic-chain (+ scripts/epic-execute-lib/)
Benchmark: "Agent Loop Evaluation Framework (2026 Standard)" — Manus AI 5-category rubric
Status: Active
Overview
This document scores the epic-execute / epic-chain shell automation against the 2026 Agent Loop Evaluation Framework (context engineering, tool design, runtime governance, error recovery, observability). Each criterion is rated on the rubric's Legacy↔2026 axis (1–5) with concrete file:line evidence, followed by a prioritized remediation roadmap.
The headline: the objective machinery is already strong — durable external state, per-phase context isolation, real-tooling gates, and deterministic state transitions are at or near 2026 standard. The gaps cluster in governance, observability, and "prose where code belongs" — places where a rule is told to the model instead of enforced in the harness.
Scope note: This is a point-in-time review of the scripts as of the date above. Line numbers reference the current scripts/epic-execute.sh (3,328 lines), scripts/epic-chain.sh (946 lines), and scripts/epic-execute-lib/*.sh.
Scorecard
| # | Category | Criterion | Score (1–5) | Verdict |
|---|---|---|---|---|
| 1 | Context & Memory | Context Strategy | 3 | Partial — isolation yes, summarization no |
| 1 | Context & Memory | Tool Disclosure | 5 | 2026 — lean per-phase step templates |
| 1 | Context & Memory | Memory Persistence | 5 | 2026 — checkpoints / metrics / decision log |
| 1 | Context & Memory | Scaffolding Type | 3 | Partial — strong gates, residual personas |
| 2 | Tool Design | Tool Scoping | 2 | Legacy — --dangerously-skip-permissions + eval |
| 2 | Tool Design | Input/Output Validation | 3 | Partial — schema declared, not enforced |
| 2 | Tool Design | Deterministic Delegation | 5 | 2026 — state/math/writes all in bash |
| 2 | Tool Design | Failure Handling | 4 | 2026-leaning — self-heal loops |
| 3 | Governance (OWASP) | Policy Enforcement | 3 | Partial — one structural, one prompt-only |
| 3 | Governance (OWASP) | Privilege Tiers | 2 | Legacy — one standing autonomous tier, no HITL |
| 3 | Governance (OWASP) | Identity Management | 2 | Legacy — ambient long-lived creds |
| 3 | Governance (OWASP) | Memory Security | 2 | Legacy — unvalidated write-back + re-ingest |
| 4 | Error Recovery / RALF | Session Architecture | 3 | Partial — fresh process/phase, no watchdog |
| 4 | Error Recovery / RALF | Error Diagnosis | 3 | Partial — self-heal yes, retry wrapper dead code |
| 4 | Error Recovery / RALF | Loop Guardrails | 4 | 2026 — hard caps; missing time/token budgets |
| 4 | Error Recovery / RALF | Exit Conditions | 4 | 2026 — deterministic gates; some advisory |
| 5 | Observability | Tracing Infrastructure | 1 | Legacy — no trace ID, no token/cost/OTel |
| 5 | Observability | Evaluation Metrics | 3 | Partial — raw counters, self-graded scores |
| 5 | Observability | Evaluation Integration | 4 | 2026-leaning — embedded inline gates |
Category averages: Context 4.0 · Tools 3.75 · Governance 2.25 · Error Recovery 3.5 · Observability 2.67
What's Already at the 2026 Bar
Memory Persistence (5/5)
Durable external state survives session death — the rubric's "external state files" pattern:
- Checkpoint files with 7-day expiry —
utils.sh:500-551, written on exitepic-execute.sh:74-82 - Metrics YAML that resumes and accumulates counters across sessions —
epic-execute.sh:646-664,:779-785 - Per-story design plans persisted for post-resume dev phases —
epic-execute.sh:144-145 - Decision log + sprint-status.yaml as external workflow state —
decision-log.sh,epic-execute.sh:832-906
Tool Disclosure (5/5)
Progressive disclosure: loads lean ~4–9KB step templates per phase instead of embedding ~40KB workflow YAML — epic-execute.sh:551-563, per-phase template selection :179-183.
Deterministic Delegation (5/5)
The LLM emits intent only (a status enum + findings); all counting, duration math, status writes, and pass/fail verdicts run in bash — epic-execute.sh:693-797, :803-906; contract-exec.sh:144-170. Playwright specs are generated from the harness (contract-exec.sh:220-296), not authored by the model.
Embedded Probes (4/5)
Quality gates run inline per story, not offline batch: arch / test-quality / traceability / static-analysis / regression / contract. The static-analysis and contract gates run real tooling (tsc, lint, build, pytest, curl, playwright) whose actual output gates progression and can fail the epic's exit code — epic-execute.sh:1986-2307, :3203-3208.
Improvement Areas
HIGH Priority
1. Observability foundation: no trace ID, no token/cost telemetry (Tracing 1/5)
The weakest foundational layer — and the rubric explicitly advises fixing the lowest foundational layer first.
There is no session/trace ID, no OpenTelemetry, and zero token/cost/latency capture. LLM calls are an opaque blob teed to a PID-named text log; the only correlation key is $$.
Evidence:
- Plain-text logging, no structured fields/spans —
epic-execute.sh:199-217 - LLM call uninstrumented (no tokens/cost/latency/model captured) —
run_claude_to_file,epic-execute.sh:518-529 - Metrics schema has no tokens/cost/trace ID —
epic-execute.sh:668-688 - Cost figures in reports are openly estimated ("may vary 50–200%") —
epic-chain-execution-report.md:225-248
Fix: Generate a session/trace ID at startup and thread it through every phase + the metrics/log/decision-log writes. Capture real usage by invoking claude with --output-format json (or --output-format stream-json) and parsing the usage / total_cost_usd fields per phase. This unlocks every downstream metric.
2. Resilience layer is fully built but never wired in (Error Recovery)
execute_claude_with_retry, run_with_timeout, and CLAUDE_TIMEOUT (600s default) all exist in utils.sh:55-150 — but have zero callers. Every phase uses bare run_claude_to_file (epic-execute.sh:518-529) with || true, so:
- A hung phase blocks forever (
CLAUDE_TIMEOUTnever applied to the real path) - Transient errors (429/timeout/503) are not retried
- Crashes are swallowed and silently judged "incomplete"
There is also no watchdog/supervisor and no stuck-loop progress detection — fix loops burn all attempts even on an identical failure set, despite computing failure signatures in test-failure-filter.sh:139-151.
Fix: Route run_claude_to_file through the existing run_with_timeout + retry wrapper (mostly plumbing — the code already exists). Add a progress check that compares failure signatures across fix iterations and aborts early when the set is unchanged.
3. Governance is structurally bypassable (Privilege 2/5, Policy partial)
The strong git policies — check_sensitive_files (epic-execute.sh:351-391), git add -u (:2937), check_branch_protection (utils.sh:444-474) — only guard the final commit_story path. Mid-phase, every agent runs --dangerously-skip-permissions (:524, :527) and can git add -A / commit anything itself. The "don't use git add -A" rule is injected as prose in 7+ places (:597-598, :1772, :1875, …) but enforced in code only at commit time. There is no read/write/destructive privilege separation and no HITL gate ("AUTOMATED… do NOT pause for user confirmation" — :593-595).
Fix: Move the staging policy into a git pre-commit hook (structural, not prose) so it governs agent-authored commits too; re-run check_sensitive_files on any commit. Introduce an approval tier (even a coarse env-gated one) for destructive operations.
4. Memory poisoning loop is unguarded (Memory Security 2/5)
append_to_decision_log writes raw agent output straight to disk (decision-log.sh:54-76), and get_decision_log_context re-injects the whole log verbatim into the next phase's prompt (decision-log.sh:80-87) — the exact RAG/memory-poisoning loop OWASP 2026 (ASI06) warns against, with no validation, segmentation, or provenance tagging. add_metrics_issue / record_fix_attempt also interpolate agent-influenced strings directly into yq expressions (epic-execute.sh:725, :745) — a YAML/expression-injection surface.
Fix: Validate and length-bound decision-log entries before commit; tag provenance (which phase/story produced each entry); sanitize or parameterize strings before they enter yq expressions.
5. Output contracts declared but not enforced (I/O Validation 3/5)
JSON result schemas are prescribed to the model and parsed with jq, but malformed/missing output silently falls back to grepping prose (json-output.sh:311-387, check_phase_completion_fuzzy in utils.sh). The documented consequence: 9 stories were mis-marked failed because the model didn't emit the exact IMPLEMENTATION COMPLETE: phrase, requiring manual correction (epic-chain-execution-report.md:254-272).
Fix: Reject non-conforming output and force a bounded retry instead of degrading to regex. Make the JSON result block mandatory (fail the phase if absent). Relatedly, promote the advisory gates (arch / test-quality / traceability / regression — currently "proceed with documented concerns") to blocking where the risk warrants it, matching the deterministic behavior of the static-analysis and contract gates.
MEDIUM Priority
6. Tool Scoping & sandboxing (Tool Scoping 2/5)
claude --dangerously-skip-permissions grants the full unrestricted toolset with no per-phase allowlist or sandbox (epic-execute.sh:524, :527). Harness commands run via raw eval on YAML-derived strings — an injection surface — in contract-exec.sh:43,53,86,168 and contract-harness.sh:333,351,368. The production-scope datastore guard is advisory (log_warn), not a block (contract-harness.sh:194-213).
Fix: Run harness commands as argv arrays (no eval). Consider a per-phase tool allowlist and/or containerized execution. Promote the production-scope datastore guard from warning to hard block.
7. Context summarization (Context Strategy 3/5)
Per-phase context isolation is excellent (fresh claude process per phase, paths-not-contents handoff), but there is no anchored iterative summarization: cross-phase carryover is raw grep/sed-extracted text or tail-truncation (decision context truncated to 20KB at epic-execute.sh:1466), and the only control is a hard 150KB cap (MAX_PROMPT_SIZE, :398), not a utilization band.
Fix: Add a summarization step between phases — hold an anchor block (story + ACs + constraints) constant while condensing completed-phase outcomes into a structured summary; target 60–80% utilization rather than a hard truncate.
8. Identity Management (2/5)
Every claude call inherits the operator's ambient, long-lived credentials; harness secrets are consumed as ambient env vars (contract-harness.sh:205, :254). No unique per-task identity, short-lived tokens, or credential scoping.
Fix: Where feasible, issue short-lived/scoped credentials per run; vault harness secrets rather than relying on ambient env.
9. Evaluation Metrics — derive rates, separate the judge (Metrics 3/5)
The raw inputs exist (completed/failed/skipped, fix attempts, max_retries_hit) but are never computed into Task Completion Rate / Escalation Rate / Tool Call Success Rate. Rubric scores (test-quality ≥70, traceability P0=100%) are self-graded by the same executing model (json-output.sh:473-496) rather than by an independent calibrated judge.
Fix: Compute and persist the derived rates in the metrics YAML. Introduce a separate, cheaper judge model (e.g., Haiku) for binary rubric scoring so the executor isn't grading its own work.
LOW Priority
10. yq-dependent durability
Metrics, sprint-status, and issue persistence silently degrade without yq installed (epic-execute.sh:707, :790). The otherwise-excellent memory layer is best-effort, not guaranteed.
Fix: Either declare yq a hard prerequisite (fail fast at startup) or harden the sed/awk fallbacks to full parity.
11. Vestigial "full workflow YAML" priority tier
CONTENT_PRIORITY_LOW is still described as "Full workflow YAML (truncate first)" (epic-execute.sh:404), a legacy fallback path no active builder uses. Remove to avoid confusion.
12. Gate status not persisted by standalone runs
validation.gate_status is written by the chain wrapper (epic-chain.sh:626-638), not the inner execute loop, so a standalone epic-execute.sh run leaves gate_status: PENDING.
Two Structural Themes
-
Prose where code belongs. The recurring pattern — git rules, "do NOT pause", personas like "You ARE an adversarial reviewer" (
:1616) — is the compensatory scaffolding the rubric flags: a rule told to the model that the harness could instead enforce. The codebase is mid-migration; the objective gates (contracts, real tooling, JSON results) are already constitutive, but the soft rules haven't caught up. -
Built-but-unwired. The retry/timeout resilience layer (
utils.sh:55-150) is the clearest example — fully implemented, zero callers. The capability gap is often plumbing, not net-new code.
Suggested Sequencing
Per the rubric's "fix the lowest-scoring foundational layer first":
- Observability (#1) — trace/session ID + real token/cost/latency from
claude --output-format json. Foundation for everything; currently 1/5. - Wire the existing retry/timeout layer (#2) — pure plumbing, already-written code, large RALF payoff.
- Governance (#3, #4) — pre-commit hook + sensitive-file re-check on agent commits; validate decision-log/metrics writes before commit.
- Enforce JSON contracts (#5) — fail-and-retry on missing signal instead of fuzzy fallback; promote advisory gates to blocking.
- Context summarization (#7) — anchored iterative summarization targeting a utilization band.
The smallest, highest-leverage starting points are #2 (retry wiring) and #3 (pre-commit hook).
References
- Benchmark source: "Agent Loop Evaluation Framework (2026 Standard)," Manus AI — context engineering, tool design, OWASP Top 10 for Agentic Apps 2026, RALF loop, OpenTelemetry-first observability.
- Prior review:
epic-workflows-v1.md(2026-01-02) — overlaps on the--dangerously-skip-permissionsfinding (#1/#3, #6 here).
Appendix A — Observability Deep Dive
Added: 2026-06-15 · Expands HIGH-priority item #1 and Evaluation Metrics (#9).
A.1 Root cause: a single discard point
Every LLM call routes through run_claude_to_file (epic-execute.sh:518-529), which uses the CLI's default text output format:
claude --dangerously-skip-permissions -p "$prompt" 2>&1 | tee -a "$LOG_FILE" > "$PHASE_OUTPUT_FILE" || true
Only rendered assistant text survives. The chain report generator does the same (epic-chain.sh:884). The 1/5 Tracing score is the consequence of this one choice — not an architectural limit. The telemetry is produced on every call and thrown away.
A.2 What claude --output-format json already returns (verified)
Tested against the installed CLI (v2.1.177). The result envelope contains every field 2026 observability requires:
| Field (verified present) | Example | Rubric need it satisfies |
|---|---|---|
session_id |
f6ff5b55-… |
Trace/session ID (today: PID $$) |
total_cost_usd |
0.0586 |
Real cost (today: fabricated) |
usage.input_tokens / output_tokens |
2629 / 4 | Token spend |
usage.cache_read_input_tokens / cache_creation_input_tokens |
15362 / 3718 | Cache efficiency |
modelUsage[model].costUSD + per-model tokens |
Opus + Haiku sub-agent | Per-model cost attribution |
modelUsage[model].contextWindow |
1000000 | Enables context-utilization % |
duration_ms / duration_api_ms / ttft_ms |
1757 / 2522 / 1754 | Per-call latency |
num_turns, stop_reason, is_error, api_error_status, permission_denials |
1 / end_turn / false / null / [] | Tool-call success / error telemetry |
The CLI also exposes --output-format stream-json (live JSONL ending with the same result envelope) and --json-schema <schema> for structured-output enforcement. All three require --print, which the script already passes.
A.3 The current report is actively misleading, not merely empty
Because real telemetry is discarded, the chain report fabricates it (epic-chain-execution-report.md:225-248):
- Token table derived from
Est. Calls = stories × 2and~16K input/callassumptions — arithmetic on story counts, not measurement. - Cost table priced against Claude Sonnet 3.5 ($3/$15) and Opus ($15/$75) — neither is the model that ran (
claude-opus-4-8[1m]); the realtotal_cost_usdwas available and discarded. - Carries the disclaimer "Actual usage may vary by 50-200%."
An authoritative-looking cost table that is invented is worse than a blank cell — it is unfalsifiable noise where ground truth was one flag away.
A.4 Synergy: this fix lifts three other findings
The same envelope partially closes gaps scored elsewhere:
- Context Strategy (#7).
contextWindow+input_tokens + cache_read + cache_creationyields exact per-phase utilization, making the 60–80% target measurable and enforceable for free. - I/O Validation (#5) + the 9-mismark incident. Parsing
.result(clean final message) instead of scraping interleaved stdout, plus--json-schemato make the status field structurally mandatory, removes the fuzzy-regex fallback (json-output.sh:311-387) that mismarked 9 stories (epic-chain-execution-report.md:254-272). That incident is fundamentally an output-format problem. - Evaluation Metrics (#9). Enables the rubric's business metrics: Task Completion Rate (
completed/total), Escalation Rate (max_retries_hit/stories+ realis_errorrate), Tool Call Success Rate (is_error=falsephases ÷ total).
A.5 Target design (fits the existing architecture)
Constraints: preserve the memory-safe "pipe to file, read 32KB tail" pattern, and keep the live tee to the log.
- Switch to
stream-json, not plainjson. Plainjsonbuffers and kills the live tee.--output-format stream-json --include-partial-messagesstreams live and makes the last JSON line theresultenvelope;read_phase_tailstill captures it (parse the last line where.type=="result"). Memory-safety preserved. - One append-only trace file per epic, using the OTel span data model (convertible to OTLP later):
docs/sprint-artifacts/traces/epic-<id>-trace.jsonl— one span per phase:
Generate one epic-level{"trace_id":"<epic-uuid>","span_id":"<claude session_id>","parent":"<story_id>", "name":"dev","story_id":"4-3","model":"claude-opus-4-8[1m]", "input_tokens":2629,"output_tokens":4,"cache_read":15362,"cost_usd":0.058, "duration_ms":1757,"ttft_ms":1754,"num_turns":1,"is_error":false, "ctx_util_pct":2.1,"status":"COMPLETE","ts":"2026-06-15T…"}trace_id(uuidgen) at startup; each call'ssession_idis thespan_id,story_idthe parent. This is the single correlating ID$$never provided. - Deterministic rollup into
metrics.yaml— add atelemetry:block summed from the JSONL (no model, no fabrication):total_cost_usd,total_input_tokens,total_output_tokens,cache_read_tokens,by_phase. The chain report then reads measured numbers; theEstimated Token Usagesection is deleted. - OTel bridge (phase 2, optional). JSONL-with-OTel-fields is the pragmatic 80%. A later post-processor converts spans → OTLP without touching the hot path.
A.6 Caveats
- jq dependency — telemetry parsing needs
jq(same soft-dep fragility asyq, item #10). Degrade gracefully (skip span, don't crash); consider makingjqa hard startup prerequisite. - Cost includes sub-agents —
modelUsagesurfaced an internal Haiku call inside an Opus phase. RecordmodelUsageverbatim; don't flatten to one model. - Cache tokens dominate — in testing, cache-read (15K) was 6× fresh input (2.6K). Report fresh vs. cache-read separately; compute utilization from the sum.
stream-jsonis noisier on disk — log grows faster (every partial chunk). The existing 64KB inter-story log truncation (epic-execute.sh:3182) mitigates; confirm it suffices.
A.7 Why this is the right place to start
Observability scored lowest yet is the cheapest high-priority fix and the only one that drags three other findings upward. The data already exists on every call; the work is plumbing (a format switch + a record_span helper + a deterministic rollup), not building telemetry infrastructure.