18 KiB

Raw Blame History

title
BMAD Epic-Execute Improvements Analysis v2

BMAD Epic-Execute Improvements Analysis v2

Date: 2026-01-26 Analyzed Script: scripts/epic-execute.sh Purpose: Improve the performance and reliability of code generated by the epic-execute automation

Executive Summary

The current epic-execute script orchestrates AI agents to implement stories through multiple phases (dev, architecture compliance, code review, test quality, traceability, UAT generation). While the multi-phase approach with self-healing fix loops is sound, there are fundamental improvements that would significantly enhance the reliability of generated code.

Core Problem: The system relies on AI self-reporting ("tests pass", "build succeeds") rather than actual verification. This creates systematic blind spots where AI can hallucinate success.

Current Flow Analysis

Story Execution Pipeline:
1. DEV PHASE         → AI implements story
2. ARCH COMPLIANCE   → AI validates against architecture.md
3. CODE REVIEW       → AI reviews in fresh context (adversarial)
4. TEST QUALITY      → AI reviews tests for quality patterns
5. TRACEABILITY      → AI maps acceptance criteria to tests
6. UAT GENERATION    → AI creates manual test document

Each phase runs in isolated context (good for adversarial review) but relies entirely on AI judgment for validation.

Identified Weaknesses

1. No Real Tooling Verification

The script trusts AI to report:

"Tests pass" - but did they actually run?
"Build succeeds" - no actual build command executed
"No type errors" - no actual type checking

Impact: AI can hallucinate success. Code that doesn't compile can be marked as "complete".

2. No Baseline/Regression Testing

Each story is executed in isolation. There's no verification that:

Story N doesn't break Story N-1
Overall test count doesn't decrease
Coverage doesn't regress

Impact: Later stories can silently break earlier work.

3. Tests Written After Implementation (Not TDD)

Current flow: Implement → Write tests → Review

Impact:

Tests often test implementation, not requirements
Tests may be written to pass, not to verify behavior
Missing edge cases because dev already knows the code

4. No Pre-Implementation Design Review

The dev phase jumps straight into coding. For complex stories, this leads to:

Architectural decisions made implicitly during coding
Refactoring when initial approach doesn't work
Inconsistent patterns within the epic

5. AI Marking Its Own Homework

The same system (Claude) both:

Writes the code
Reviews the code
Says "tests pass"

Impact: Systematic blind spots get reinforced across phases.

6. Context Loss Between Phases

Each phase runs in fresh context. Good for adversarial review, but:

Fix phase doesn't know WHY dev made certain decisions
Review doesn't know about trade-offs considered
Decisions get made, lost, and remade differently

7. No Incremental Validation

Complex stories are implemented all-at-once. If something fails at the end, the entire implementation might need redoing.

8. Completion Signal Parsing is Fragile

The script uses regex to find "IMPLEMENTATION COMPLETE" in output. If AI outputs this string in a different context (explaining what to output, quoting instructions), it triggers false positive.

9. Missing Dependency Validation

Before implementing a story, no check that:

Required npm packages are installed
Required services are running
Prerequisite stories are complete

Recommended Improvements

1. Static Analysis Gate (HIGH IMPACT, LOW EFFORT) ✅ IMPLEMENTED

Add a real tooling validation step that runs actual commands between dev and review phases.

execute_static_analysis_gate() {
    local story_id="$1"
    local failures=0

    log ">>> STATIC ANALYSIS GATE: $story_id"

    # Detect project type and run appropriate checks
    if [ -f "$PROJECT_ROOT/package.json" ]; then
        # TypeScript/JavaScript project

        # 1. Type checking (catches type errors AI might miss)
        if grep -q '"typecheck"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
            log "Running type check..."
            if ! npm run typecheck 2>&1 | tee -a "$LOG_FILE"; then
                log_error "Type check failed"
                ((failures++))
            fi
        fi

        # 2. Linting (catches code style/quality issues)
        if grep -q '"lint"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
            log "Running lint..."
            if ! npm run lint 2>&1 | tee -a "$LOG_FILE"; then
                log_error "Lint failed"
                ((failures++))
            fi
        fi

        # 3. Build (catches compilation errors)
        if grep -q '"build"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
            log "Running build..."
            if ! npm run build 2>&1 | tee -a "$LOG_FILE"; then
                log_error "Build failed"
                ((failures++))
            fi
        fi

        # 4. Tests (catches actual test failures)
        log "Running tests..."
        TEST_OUTPUT=$(npm test 2>&1) || true
        echo "$TEST_OUTPUT" >> "$LOG_FILE"

        if echo "$TEST_OUTPUT" | grep -qE "failed|FAIL|Error"; then
            log_error "Tests failed"
            # Capture actual failures for fix phase
            ACTUAL_TEST_FAILURES=$(echo "$TEST_OUTPUT" | grep -A 10 "FAIL\|Error" || true)
            ((failures++))
        fi
    fi

    if [ $failures -gt 0 ]; then
        log_error "Static analysis gate failed with $failures issue(s)"
        return 1
    fi

    log_success "Static analysis gate passed"
    return 0
}

Insertion Point: After execute_dev_phase(), before arch compliance.

2. Regression Test Gate (HIGH IMPACT, MEDIUM EFFORT) ✅ IMPLEMENTED

Track test baseline and verify no regressions after each story.

# Initialize at epic start
init_regression_baseline() {
    if [ -f "$PROJECT_ROOT/package.json" ]; then
        # Capture baseline test count
        BASELINE_TEST_OUTPUT=$(npm test -- --json 2>/dev/null || npm test 2>&1)
        BASELINE_PASSING_TESTS=$(echo "$BASELINE_TEST_OUTPUT" | grep -oE '[0-9]+ passing' | grep -oE '[0-9]+' | head -1 || echo "0")

        # Capture baseline coverage if available
        if grep -q '"coverage"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
            BASELINE_COVERAGE=$(npm run coverage -- --json 2>/dev/null | jq '.total.lines.pct' 2>/dev/null || echo "0")
        fi

        log "Regression baseline: $BASELINE_PASSING_TESTS passing tests, ${BASELINE_COVERAGE}% coverage"
    fi
}

execute_regression_gate() {
    local story_id="$1"

    log ">>> REGRESSION GATE: $story_id"

    # Get current test count
    CURRENT_TEST_OUTPUT=$(npm test -- --json 2>/dev/null || npm test 2>&1)
    CURRENT_PASSING_TESTS=$(echo "$CURRENT_TEST_OUTPUT" | grep -oE '[0-9]+ passing' | grep -oE '[0-9]+' | head -1 || echo "0")

    # Check for regression
    if [ "$CURRENT_PASSING_TESTS" -lt "$BASELINE_PASSING_TESTS" ]; then
        log_error "REGRESSION DETECTED: Test count decreased ($BASELINE_PASSING_TESTS -> $CURRENT_PASSING_TESTS)"
        add_metrics_issue "$story_id" "regression" "Test count decreased from $BASELINE_PASSING_TESTS to $CURRENT_PASSING_TESTS"
        return 1
    fi

    # Update baseline for next story
    BASELINE_PASSING_TESTS=$CURRENT_PASSING_TESTS
    log_success "Regression gate passed: $CURRENT_PASSING_TESTS tests passing (was $BASELINE_PASSING_TESTS)"
    return 0
}

Insertion Point: After review passes, before marking story done.

3. Pre-Implementation Design Phase (HIGH IMPACT, MEDIUM EFFORT) ✅ IMPLEMENTED

Add a design phase before implementation to catch architectural issues early.

execute_design_phase() {
    local story_file="$1"
    local story_id=$(basename "$story_file" .md)

    log ">>> DESIGN PHASE: $story_id"

    local story_contents=$(cat "$story_file")
    local arch_contents=""
    if [ -f "$PROJECT_ROOT/docs/architecture.md" ]; then
        arch_contents=$(cat "$PROJECT_ROOT/docs/architecture.md")
    fi

    local design_prompt="You are a senior developer planning the implementation of a story.

## Your Task

Create an implementation plan for: $story_id

Do NOT write any code yet. Output only your design plan.

## Story

<story>
$story_contents
</story>

## Architecture Reference

<architecture>
$arch_contents
</architecture>

## Required Output

Output your implementation plan in this exact format:

\`\`\`
DESIGN START
files_to_modify:
  - path: <file path>
    action: create|modify
    purpose: <why this file>

patterns_to_use:
  - <pattern name>: <how it will be applied>

dependencies:
  - <package>: <installed|needs-install>

acceptance_criteria_mapping:
  - AC1: <which files/functions will implement this>
  - AC2: <which files/functions will implement this>

risks:
  - <potential issue and mitigation>

estimated_test_files:
  - <test file path>: <what it will test>
DESIGN END
\`\`\`

Be specific. This plan will be validated against architecture before implementation begins."

    local result
    result=\$(claude --dangerously-skip-permissions -p "\$design_prompt" 2>&1) || true

    echo "\$result" >> "\$LOG_FILE"

    # Extract and save design for later phases
    LAST_DESIGN=\$(echo "\$result" | sed -n '/DESIGN START/,/DESIGN END/p')

    if [ -n "\$LAST_DESIGN" ]; then
        # Save to decision log for context in later phases
        echo "## Design: \$story_id - \$(date)" >> "\$DECISION_LOG"
        echo "\$LAST_DESIGN" >> "\$DECISION_LOG"
        log_success "Design phase complete: \$story_id"
        return 0
    else
        log_error "Design phase did not produce valid output"
        return 1
    fi
}

Insertion Point: Before execute_dev_phase(). Pass the design to the dev prompt.

4. Real Test Output in Fix Loops (MEDIUM IMPACT, LOW EFFORT) ✅ IMPLEMENTED

Capture and pass real test output to fix phases instead of just AI-extracted findings.

# In execute_review_phase or after static analysis gate
capture_real_failures() {
    # Run tests and capture actual output
    TEST_OUTPUT=$(npm test 2>&1) || true

    # Extract actual failures with context
    ACTUAL_FAILURES=$(echo "$TEST_OUTPUT" | grep -B 2 -A 10 "FAIL\|Error\|AssertionError\|Expected\|Received" || true)

    # Extract lint errors if any
    LINT_OUTPUT=$(npm run lint 2>&1) || true
    LINT_ERRORS=$(echo "$LINT_OUTPUT" | grep -E "error|warning" || true)

    # Extract type errors if any
    TYPE_OUTPUT=$(npm run typecheck 2>&1) || true
    TYPE_ERRORS=$(echo "$TYPE_OUTPUT" | grep -E "error TS|Error:" || true)
}

# Then in execute_fix_phase, add to the prompt:
fix_prompt+="
## Actual Tooling Output

### Test Failures (from npm test)
\`\`\`
$ACTUAL_FAILURES
\`\`\`

### Lint Errors (from npm run lint)
\`\`\`
$LINT_ERRORS
\`\`\`

### Type Errors (from npm run typecheck)
\`\`\`
$TYPE_ERRORS
\`\`\`

Use this ACTUAL output to guide your fixes, not just the review findings.
"

5. Cumulative Decision Log (MEDIUM IMPACT, LOW EFFORT) ✅ IMPLEMENTED

Maintain a persistent decision log to preserve context across phases.

DECISION_LOG="$SPRINT_ARTIFACTS_DIR/epic-${EPIC_ID}-decisions.md"

init_decision_log() {
    mkdir -p "$(dirname "$DECISION_LOG")"
    cat > "$DECISION_LOG" << EOF
# Epic $EPIC_ID Decision Log

This file tracks implementation decisions for context continuity.

---
EOF
}

append_to_decision_log() {
    local phase="$1"
    local story_id="$2"
    local content="$3"

    cat >> "$DECISION_LOG" << EOF

## $phase: $story_id
**Timestamp:** $(date '+%Y-%m-%d %H:%M:%S')

$content

---
EOF
}

# Then pass to subsequent phases:
"## Previous Implementation Context

The following decisions have been made in this epic:

$(cat "$DECISION_LOG")

Respect these decisions unless you have a specific reason to deviate.
"

6. Test-First Enforcement (HIGH IMPACT, HIGH EFFORT) ✅ IMPLEMENTED

Restructure the flow to enforce TDD principles.

Implemented Flow:

1. DESIGN PHASE (Dev)       → Plan implementation approach
2. TEST SPEC PHASE (TEA)    → Write test specifications based on ACs
3. TEST IMPL PHASE (TEA)    → Implement failing tests
4. VERIFICATION             → Confirm tests fail appropriately
5. DEV PHASE (Dev)          → Implement to make tests pass
6. STATIC ANALYSIS GATE     → Real tooling verification
7. REVIEW PHASE             → Adversarial review
8. REGRESSION GATE          → Ensure no regressions

This ensures tests actually test requirements rather than implementation details.

Implementation: Module at scripts/epic-execute-lib/tdd-flow.sh with functions:

execute_test_spec_phase() - Generates BDD test specifications from acceptance criteria
execute_test_impl_phase() - Creates failing tests from specifications
execute_test_verification_phase() - Verifies tests fail correctly before implementation
build_test_spec_context_for_dev() - Provides test context to dev phase

Skip flags: --skip-tdd, --skip-test-spec, --skip-test-impl

7. Structured Output Validation (MEDIUM IMPACT, MEDIUM EFFORT) ✅ IMPLEMENTED

Replace fragile regex parsing with structured JSON output.

Implementation: Module at scripts/epic-execute-lib/json-output.sh with functions:

extract_json_result() - Parse JSON from Claude output
get_result_status() - Extract status field
get_result_files() - Extract files_changed array
get_result_issues() - Extract issues for fix loops
check_phase_completion() - Unified completion detection with JSON + text fallback

# Prompts now request JSON output:
"Output your result as JSON:
\`\`\`json
{
  \"status\": \"COMPLETE\" | \"BLOCKED\" | \"FAILED\",
  \"story_id\": \"...\",
  \"summary\": \"...\",
  \"files_changed\": [...],
  \"tests_added\": N,
  \"decisions\": [{\"what\": \"...\", \"why\": \"...\"}],
  \"concerns\": [...]
}
\`\`\`"

# Parsing with fallback:
check_phase_completion "$result" "dev" "$story_id"
# Returns: 0 (complete), 1 (failed), 2 (unclear)

Skip flag: --legacy-output (disables JSON parsing, uses text-only detection)

Implementation Priority Matrix

Priority	Improvement	Impact	Effort	Status	Rationale
1	Static Analysis Gate	HIGH	LOW	✅ DONE	Catches real errors AI misses
2	Real Test Output in Fix	MEDIUM	LOW	✅ DONE	Quick win, better fixes
3	Decision Log	MEDIUM	LOW	✅ DONE	Easy context preservation
4	Regression Gate	HIGH	MEDIUM	✅ DONE	Prevents silent breakage
5	Design Phase	HIGH	MEDIUM	✅ DONE	Catches issues early
6	Structured JSON Output	MEDIUM	MEDIUM	✅ DONE	Improves reliability
7	Test-First Flow	HIGH	HIGH	✅ DONE	Fundamental quality improvement

Proposed Enhanced Flow

Epic Execution Pipeline v2:

SETUP
├── Validate workflows
├── Initialize metrics
├── Initialize decision log
└── Initialize regression baseline

FOR EACH STORY:
├── 1. DESIGN PHASE (NEW)
│   ├── Generate implementation plan
│   ├── Validate against architecture
│   └── Save to decision log
│
├── 2. DEV PHASE
│   ├── Pass design context
│   ├── Implement story
│   └── Stage changes
│
├── 3. STATIC ANALYSIS GATE (NEW)
│   ├── Run npm run typecheck
│   ├── Run npm run lint
│   ├── Run npm run build
│   ├── Run npm test
│   └── Capture real failures
│
├── 4. ARCH COMPLIANCE
│   └── (existing logic)
│
├── 5. CODE REVIEW + FIX LOOP
│   ├── Pass real test output to fix phase
│   └── (existing logic)
│
├── 6. TEST QUALITY
│   └── (existing logic)
│
├── 7. REGRESSION GATE (NEW)
│   ├── Compare test count to baseline
│   ├── Compare coverage to baseline
│   └── Fail if regression detected
│
└── 8. COMMIT
    └── (existing logic)

POST-STORIES:
├── Traceability check
├── UAT generation
└── Finalize metrics

Quick Wins (Implement First) ✅ ALL COMPLETE

Static Analysis Gate - Single most impactful change ✅
Real Test Output in Fix Loops - Minimal code change, better fixes ✅
Decision Log - Simple file append, preserves context ✅

These three changes alone would dramatically improve code reliability with minimal refactoring effort.

Additionally implemented: 4. Regression Gate - Prevents silent breakage ✅ 5. Design Phase - Catches architectural issues early ✅ 6. Structured JSON Output - Reliable completion signal parsing ✅ 7. Test-First Flow - TDD workflow with test specs before implementation ✅

Implementation: All features are modularized in scripts/epic-execute-lib/ with graceful degradation and skip flags:

--skip-design - Skip pre-implementation design phase
--skip-regression - Skip regression test gate
--skip-tdd - Skip test-first development phases
--skip-test-spec - Skip test specification phase only
--skip-test-impl - Skip test implementation phase only
--legacy-output - Use legacy text-based output parsing (no JSON)

Metrics to Track

After implementing these improvements, track:

False Positive Rate: Stories marked "complete" that have real errors
Fix Loop Efficiency: Average fix attempts per story (should decrease)
Regression Rate: Stories that break previous functionality
Time to First Working Implementation: Should decrease with design phase

Conclusion

The current epic-execute script has a solid foundation with multi-phase validation and self-healing fix loops. However, it fundamentally relies on AI self-reporting for validation. Adding real tooling verification (Static Analysis Gate) would catch the majority of "AI said it works but it doesn't" issues.

The recommended implementation order prioritizes high-impact, low-effort changes first, building toward a comprehensive TDD-based flow that would fundamentally improve code quality.

18 KiB Raw Blame History