16 KiB
BMAD Epic-Execute Improvements Analysis v2
Date: 2026-01-26
Analyzed Script: scripts/epic-execute.sh
Purpose: Improve the performance and reliability of code generated by the epic-execute automation
Executive Summary
The current epic-execute script orchestrates AI agents to implement stories through multiple phases (dev, architecture compliance, code review, test quality, traceability, UAT generation). While the multi-phase approach with self-healing fix loops is sound, there are fundamental improvements that would significantly enhance the reliability of generated code.
Core Problem: The system relies on AI self-reporting ("tests pass", "build succeeds") rather than actual verification. This creates systematic blind spots where AI can hallucinate success.
Current Flow Analysis
Story Execution Pipeline:
1. DEV PHASE → AI implements story
2. ARCH COMPLIANCE → AI validates against architecture.md
3. CODE REVIEW → AI reviews in fresh context (adversarial)
4. TEST QUALITY → AI reviews tests for quality patterns
5. TRACEABILITY → AI maps acceptance criteria to tests
6. UAT GENERATION → AI creates manual test document
Each phase runs in isolated context (good for adversarial review) but relies entirely on AI judgment for validation.
Identified Weaknesses
1. No Real Tooling Verification
The script trusts AI to report:
- "Tests pass" - but did they actually run?
- "Build succeeds" - no actual build command executed
- "No type errors" - no actual type checking
Impact: AI can hallucinate success. Code that doesn't compile can be marked as "complete".
2. No Baseline/Regression Testing
Each story is executed in isolation. There's no verification that:
- Story N doesn't break Story N-1
- Overall test count doesn't decrease
- Coverage doesn't regress
Impact: Later stories can silently break earlier work.
3. Tests Written After Implementation (Not TDD)
Current flow: Implement → Write tests → Review
Impact:
- Tests often test implementation, not requirements
- Tests may be written to pass, not to verify behavior
- Missing edge cases because dev already knows the code
4. No Pre-Implementation Design Review
The dev phase jumps straight into coding. For complex stories, this leads to:
- Architectural decisions made implicitly during coding
- Refactoring when initial approach doesn't work
- Inconsistent patterns within the epic
5. AI Marking Its Own Homework
The same system (Claude) both:
- Writes the code
- Reviews the code
- Says "tests pass"
Impact: Systematic blind spots get reinforced across phases.
6. Context Loss Between Phases
Each phase runs in fresh context. Good for adversarial review, but:
- Fix phase doesn't know WHY dev made certain decisions
- Review doesn't know about trade-offs considered
- Decisions get made, lost, and remade differently
7. No Incremental Validation
Complex stories are implemented all-at-once. If something fails at the end, the entire implementation might need redoing.
8. Completion Signal Parsing is Fragile
The script uses regex to find "IMPLEMENTATION COMPLETE" in output. If AI outputs this string in a different context (explaining what to output, quoting instructions), it triggers false positive.
9. Missing Dependency Validation
Before implementing a story, no check that:
- Required npm packages are installed
- Required services are running
- Prerequisite stories are complete
Recommended Improvements
1. Static Analysis Gate (HIGH IMPACT, LOW EFFORT) ✅ IMPLEMENTED
Add a real tooling validation step that runs actual commands between dev and review phases.
execute_static_analysis_gate() {
local story_id="$1"
local failures=0
log ">>> STATIC ANALYSIS GATE: $story_id"
# Detect project type and run appropriate checks
if [ -f "$PROJECT_ROOT/package.json" ]; then
# TypeScript/JavaScript project
# 1. Type checking (catches type errors AI might miss)
if grep -q '"typecheck"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
log "Running type check..."
if ! npm run typecheck 2>&1 | tee -a "$LOG_FILE"; then
log_error "Type check failed"
((failures++))
fi
fi
# 2. Linting (catches code style/quality issues)
if grep -q '"lint"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
log "Running lint..."
if ! npm run lint 2>&1 | tee -a "$LOG_FILE"; then
log_error "Lint failed"
((failures++))
fi
fi
# 3. Build (catches compilation errors)
if grep -q '"build"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
log "Running build..."
if ! npm run build 2>&1 | tee -a "$LOG_FILE"; then
log_error "Build failed"
((failures++))
fi
fi
# 4. Tests (catches actual test failures)
log "Running tests..."
TEST_OUTPUT=$(npm test 2>&1) || true
echo "$TEST_OUTPUT" >> "$LOG_FILE"
if echo "$TEST_OUTPUT" | grep -qE "failed|FAIL|Error"; then
log_error "Tests failed"
# Capture actual failures for fix phase
ACTUAL_TEST_FAILURES=$(echo "$TEST_OUTPUT" | grep -A 10 "FAIL\|Error" || true)
((failures++))
fi
fi
if [ $failures -gt 0 ]; then
log_error "Static analysis gate failed with $failures issue(s)"
return 1
fi
log_success "Static analysis gate passed"
return 0
}
Insertion Point: After execute_dev_phase(), before arch compliance.
2. Regression Test Gate (HIGH IMPACT, MEDIUM EFFORT) ✅ IMPLEMENTED
Track test baseline and verify no regressions after each story.
# Initialize at epic start
init_regression_baseline() {
if [ -f "$PROJECT_ROOT/package.json" ]; then
# Capture baseline test count
BASELINE_TEST_OUTPUT=$(npm test -- --json 2>/dev/null || npm test 2>&1)
BASELINE_PASSING_TESTS=$(echo "$BASELINE_TEST_OUTPUT" | grep -oE '[0-9]+ passing' | grep -oE '[0-9]+' | head -1 || echo "0")
# Capture baseline coverage if available
if grep -q '"coverage"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
BASELINE_COVERAGE=$(npm run coverage -- --json 2>/dev/null | jq '.total.lines.pct' 2>/dev/null || echo "0")
fi
log "Regression baseline: $BASELINE_PASSING_TESTS passing tests, ${BASELINE_COVERAGE}% coverage"
fi
}
execute_regression_gate() {
local story_id="$1"
log ">>> REGRESSION GATE: $story_id"
# Get current test count
CURRENT_TEST_OUTPUT=$(npm test -- --json 2>/dev/null || npm test 2>&1)
CURRENT_PASSING_TESTS=$(echo "$CURRENT_TEST_OUTPUT" | grep -oE '[0-9]+ passing' | grep -oE '[0-9]+' | head -1 || echo "0")
# Check for regression
if [ "$CURRENT_PASSING_TESTS" -lt "$BASELINE_PASSING_TESTS" ]; then
log_error "REGRESSION DETECTED: Test count decreased ($BASELINE_PASSING_TESTS -> $CURRENT_PASSING_TESTS)"
add_metrics_issue "$story_id" "regression" "Test count decreased from $BASELINE_PASSING_TESTS to $CURRENT_PASSING_TESTS"
return 1
fi
# Update baseline for next story
BASELINE_PASSING_TESTS=$CURRENT_PASSING_TESTS
log_success "Regression gate passed: $CURRENT_PASSING_TESTS tests passing (was $BASELINE_PASSING_TESTS)"
return 0
}
Insertion Point: After review passes, before marking story done.
3. Pre-Implementation Design Phase (HIGH IMPACT, MEDIUM EFFORT) ✅ IMPLEMENTED
Add a design phase before implementation to catch architectural issues early.
execute_design_phase() {
local story_file="$1"
local story_id=$(basename "$story_file" .md)
log ">>> DESIGN PHASE: $story_id"
local story_contents=$(cat "$story_file")
local arch_contents=""
if [ -f "$PROJECT_ROOT/docs/architecture.md" ]; then
arch_contents=$(cat "$PROJECT_ROOT/docs/architecture.md")
fi
local design_prompt="You are a senior developer planning the implementation of a story.
## Your Task
Create an implementation plan for: $story_id
Do NOT write any code yet. Output only your design plan.
## Story
<story>
$story_contents
</story>
## Architecture Reference
<architecture>
$arch_contents
</architecture>
## Required Output
Output your implementation plan in this exact format:
\`\`\`
DESIGN START
files_to_modify:
- path: <file path>
action: create|modify
purpose: <why this file>
patterns_to_use:
- <pattern name>: <how it will be applied>
dependencies:
- <package>: <installed|needs-install>
acceptance_criteria_mapping:
- AC1: <which files/functions will implement this>
- AC2: <which files/functions will implement this>
risks:
- <potential issue and mitigation>
estimated_test_files:
- <test file path>: <what it will test>
DESIGN END
\`\`\`
Be specific. This plan will be validated against architecture before implementation begins."
local result
result=\$(claude --dangerously-skip-permissions -p "\$design_prompt" 2>&1) || true
echo "\$result" >> "\$LOG_FILE"
# Extract and save design for later phases
LAST_DESIGN=\$(echo "\$result" | sed -n '/DESIGN START/,/DESIGN END/p')
if [ -n "\$LAST_DESIGN" ]; then
# Save to decision log for context in later phases
echo "## Design: \$story_id - \$(date)" >> "\$DECISION_LOG"
echo "\$LAST_DESIGN" >> "\$DECISION_LOG"
log_success "Design phase complete: \$story_id"
return 0
else
log_error "Design phase did not produce valid output"
return 1
fi
}
Insertion Point: Before execute_dev_phase(). Pass the design to the dev prompt.
4. Real Test Output in Fix Loops (MEDIUM IMPACT, LOW EFFORT) ✅ IMPLEMENTED
Capture and pass real test output to fix phases instead of just AI-extracted findings.
# In execute_review_phase or after static analysis gate
capture_real_failures() {
# Run tests and capture actual output
TEST_OUTPUT=$(npm test 2>&1) || true
# Extract actual failures with context
ACTUAL_FAILURES=$(echo "$TEST_OUTPUT" | grep -B 2 -A 10 "FAIL\|Error\|AssertionError\|Expected\|Received" || true)
# Extract lint errors if any
LINT_OUTPUT=$(npm run lint 2>&1) || true
LINT_ERRORS=$(echo "$LINT_OUTPUT" | grep -E "error|warning" || true)
# Extract type errors if any
TYPE_OUTPUT=$(npm run typecheck 2>&1) || true
TYPE_ERRORS=$(echo "$TYPE_OUTPUT" | grep -E "error TS|Error:" || true)
}
# Then in execute_fix_phase, add to the prompt:
fix_prompt+="
## Actual Tooling Output
### Test Failures (from npm test)
\`\`\`
$ACTUAL_FAILURES
\`\`\`
### Lint Errors (from npm run lint)
\`\`\`
$LINT_ERRORS
\`\`\`
### Type Errors (from npm run typecheck)
\`\`\`
$TYPE_ERRORS
\`\`\`
Use this ACTUAL output to guide your fixes, not just the review findings.
"
5. Cumulative Decision Log (MEDIUM IMPACT, LOW EFFORT) ✅ IMPLEMENTED
Maintain a persistent decision log to preserve context across phases.
DECISION_LOG="$SPRINT_ARTIFACTS_DIR/epic-${EPIC_ID}-decisions.md"
init_decision_log() {
mkdir -p "$(dirname "$DECISION_LOG")"
cat > "$DECISION_LOG" << EOF
# Epic $EPIC_ID Decision Log
This file tracks implementation decisions for context continuity.
---
EOF
}
append_to_decision_log() {
local phase="$1"
local story_id="$2"
local content="$3"
cat >> "$DECISION_LOG" << EOF
## $phase: $story_id
**Timestamp:** $(date '+%Y-%m-%d %H:%M:%S')
$content
---
EOF
}
# Then pass to subsequent phases:
"## Previous Implementation Context
The following decisions have been made in this epic:
$(cat "$DECISION_LOG")
Respect these decisions unless you have a specific reason to deviate.
"
6. Test-First Enforcement (HIGH IMPACT, HIGH EFFORT)
Restructure the flow to enforce TDD principles.
Proposed New Flow:
1. DESIGN PHASE (Dev) → Plan implementation approach
2. TEST SPEC PHASE (TEA) → Write test specifications based on ACs
3. TEST IMPL PHASE (TEA) → Implement failing tests
4. VERIFICATION → Confirm tests fail appropriately
5. DEV PHASE (Dev) → Implement to make tests pass
6. STATIC ANALYSIS GATE → Real tooling verification
7. REVIEW PHASE → Adversarial review
8. REGRESSION GATE → Ensure no regressions
This ensures tests actually test requirements rather than implementation details.
7. Structured Output Validation (MEDIUM IMPACT, MEDIUM EFFORT)
Replace fragile regex parsing with structured JSON output.
# Add to prompts:
"Output your result as JSON:
\`\`\`json
{
\"status\": \"COMPLETE\" | \"BLOCKED\" | \"FAILED\",
\"story_id\": \"...\",
\"summary\": \"...\",
\"files_changed\": [...],
\"tests_added\": N,
\"decisions\": [{\"what\": \"...\", \"why\": \"...\"}],
\"concerns\": [...]
}
\`\`\`"
# Parse with jq:
result_json=$(echo "$result" | sed -n '/```json/,/```/p' | sed '1d;$d')
status=$(echo "$result_json" | jq -r '.status')
Implementation Priority Matrix
| Priority | Improvement | Impact | Effort | Status | Rationale |
|---|---|---|---|---|---|
| 1 | Static Analysis Gate | HIGH | LOW | ✅ DONE | Catches real errors AI misses |
| 2 | Real Test Output in Fix | MEDIUM | LOW | ✅ DONE | Quick win, better fixes |
| 3 | Decision Log | MEDIUM | LOW | ✅ DONE | Easy context preservation |
| 4 | Regression Gate | HIGH | MEDIUM | ✅ DONE | Prevents silent breakage |
| 5 | Design Phase | HIGH | MEDIUM | ✅ DONE | Catches issues early |
| 6 | Structured JSON Output | MEDIUM | MEDIUM | Improves reliability | |
| 7 | Test-First Flow | HIGH | HIGH | Fundamental quality improvement |
Proposed Enhanced Flow
Epic Execution Pipeline v2:
SETUP
├── Validate workflows
├── Initialize metrics
├── Initialize decision log
└── Initialize regression baseline
FOR EACH STORY:
├── 1. DESIGN PHASE (NEW)
│ ├── Generate implementation plan
│ ├── Validate against architecture
│ └── Save to decision log
│
├── 2. DEV PHASE
│ ├── Pass design context
│ ├── Implement story
│ └── Stage changes
│
├── 3. STATIC ANALYSIS GATE (NEW)
│ ├── Run npm run typecheck
│ ├── Run npm run lint
│ ├── Run npm run build
│ ├── Run npm test
│ └── Capture real failures
│
├── 4. ARCH COMPLIANCE
│ └── (existing logic)
│
├── 5. CODE REVIEW + FIX LOOP
│ ├── Pass real test output to fix phase
│ └── (existing logic)
│
├── 6. TEST QUALITY
│ └── (existing logic)
│
├── 7. REGRESSION GATE (NEW)
│ ├── Compare test count to baseline
│ ├── Compare coverage to baseline
│ └── Fail if regression detected
│
└── 8. COMMIT
└── (existing logic)
POST-STORIES:
├── Traceability check
├── UAT generation
└── Finalize metrics
Quick Wins (Implement First) ✅ ALL COMPLETE
- Static Analysis Gate - Single most impactful change ✅
- Real Test Output in Fix Loops - Minimal code change, better fixes ✅
- Decision Log - Simple file append, preserves context ✅
These three changes alone would dramatically improve code reliability with minimal refactoring effort.
Additionally implemented: 4. Regression Gate - Prevents silent breakage ✅ 5. Design Phase - Catches architectural issues early ✅
Implementation: All features are modularized in scripts/epic-execute-lib/ with graceful degradation and skip flags (--skip-design, --skip-regression).
Metrics to Track
After implementing these improvements, track:
- False Positive Rate: Stories marked "complete" that have real errors
- Fix Loop Efficiency: Average fix attempts per story (should decrease)
- Regression Rate: Stories that break previous functionality
- Time to First Working Implementation: Should decrease with design phase
Conclusion
The current epic-execute script has a solid foundation with multi-phase validation and self-healing fix loops. However, it fundamentally relies on AI self-reporting for validation. Adding real tooling verification (Static Analysis Gate) would catch the majority of "AI said it works but it doesn't" issues.
The recommended implementation order prioritizes high-impact, low-effort changes first, building toward a comprehensive TDD-based flow that would fundamentally improve code quality.