BMAD-METHOD/docs/bmad_improvements_v2.md

# BMAD Epic-Execute Improvements Analysis v2

**Date:** 2026-01-26
**Analyzed Script:** `scripts/epic-execute.sh`
**Purpose:** Improve the performance and reliability of code generated by the epic-execute automation

---

## Executive Summary

The current epic-execute script orchestrates AI agents to implement stories through multiple phases (dev, architecture compliance, code review, test quality, traceability, UAT generation). While the multi-phase approach with self-healing fix loops is sound, there are fundamental improvements that would significantly enhance the reliability of generated code.

**Core Problem:** The system relies on AI self-reporting ("tests pass", "build succeeds") rather than actual verification. This creates systematic blind spots where AI can hallucinate success.

---

## Current Flow Analysis

```
Story Execution Pipeline:
1. DEV PHASE         → AI implements story
2. ARCH COMPLIANCE   → AI validates against architecture.md
3. CODE REVIEW       → AI reviews in fresh context (adversarial)
4. TEST QUALITY      → AI reviews tests for quality patterns
5. TRACEABILITY      → AI maps acceptance criteria to tests
6. UAT GENERATION    → AI creates manual test document
```

Each phase runs in isolated context (good for adversarial review) but relies entirely on AI judgment for validation.

---

## Identified Weaknesses

### 1. No Real Tooling Verification

The script trusts AI to report:
- "Tests pass" - but did they actually run?
- "Build succeeds" - no actual build command executed
- "No type errors" - no actual type checking

**Impact:** AI can hallucinate success. Code that doesn't compile can be marked as "complete".

### 2. No Baseline/Regression Testing

Each story is executed in isolation. There's no verification that:
- Story N doesn't break Story N-1
- Overall test count doesn't decrease
- Coverage doesn't regress

**Impact:** Later stories can silently break earlier work.

### 3. Tests Written After Implementation (Not TDD)

Current flow: Implement → Write tests → Review

**Impact:**
- Tests often test implementation, not requirements
- Tests may be written to pass, not to verify behavior
- Missing edge cases because dev already knows the code

### 4. No Pre-Implementation Design Review

The dev phase jumps straight into coding. For complex stories, this leads to:
- Architectural decisions made implicitly during coding
- Refactoring when initial approach doesn't work
- Inconsistent patterns within the epic

### 5. AI Marking Its Own Homework

The same system (Claude) both:
- Writes the code
- Reviews the code
- Says "tests pass"

**Impact:** Systematic blind spots get reinforced across phases.

### 6. Context Loss Between Phases

Each phase runs in fresh context. Good for adversarial review, but:
- Fix phase doesn't know WHY dev made certain decisions
- Review doesn't know about trade-offs considered
- Decisions get made, lost, and remade differently

### 7. No Incremental Validation

Complex stories are implemented all-at-once. If something fails at the end, the entire implementation might need redoing.

### 8. Completion Signal Parsing is Fragile

The script uses regex to find "IMPLEMENTATION COMPLETE" in output. If AI outputs this string in a different context (explaining what to output, quoting instructions), it triggers false positive.

### 9. Missing Dependency Validation

Before implementing a story, no check that:
- Required npm packages are installed
- Required services are running
- Prerequisite stories are complete

---

## Recommended Improvements

### 1. Static Analysis Gate (HIGH IMPACT, LOW EFFORT) ✅ IMPLEMENTED

Add a real tooling validation step that runs actual commands between dev and review phases.

```bash
execute_static_analysis_gate() {
    local story_id="$1"
    local failures=0

    log ">>> STATIC ANALYSIS GATE: $story_id"

    # Detect project type and run appropriate checks
    if [ -f "$PROJECT_ROOT/package.json" ]; then
        # TypeScript/JavaScript project

        # 1. Type checking (catches type errors AI might miss)
        if grep -q '"typecheck"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
            log "Running type check..."
            if ! npm run typecheck 2>&1 | tee -a "$LOG_FILE"; then
                log_error "Type check failed"
                ((failures++))
            fi
        fi

        # 2. Linting (catches code style/quality issues)
        if grep -q '"lint"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
            log "Running lint..."
            if ! npm run lint 2>&1 | tee -a "$LOG_FILE"; then
                log_error "Lint failed"
                ((failures++))
            fi
        fi

        # 3. Build (catches compilation errors)
        if grep -q '"build"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
            log "Running build..."
            if ! npm run build 2>&1 | tee -a "$LOG_FILE"; then
                log_error "Build failed"
                ((failures++))
            fi
        fi

        # 4. Tests (catches actual test failures)
        log "Running tests..."
        TEST_OUTPUT=$(npm test 2>&1) || true
        echo "$TEST_OUTPUT" >> "$LOG_FILE"

        if echo "$TEST_OUTPUT" | grep -qE "failed|FAIL|Error"; then
            log_error "Tests failed"
            # Capture actual failures for fix phase
            ACTUAL_TEST_FAILURES=$(echo "$TEST_OUTPUT" | grep -A 10 "FAIL\|Error" || true)
            ((failures++))
        fi
    fi

    if [ $failures -gt 0 ]; then
        log_error "Static analysis gate failed with $failures issue(s)"
        return 1
    fi

    log_success "Static analysis gate passed"
    return 0
}
```

**Insertion Point:** After `execute_dev_phase()`, before arch compliance.

---

### 2. Regression Test Gate (HIGH IMPACT, MEDIUM EFFORT) ✅ IMPLEMENTED

Track test baseline and verify no regressions after each story.

```bash
# Initialize at epic start
init_regression_baseline() {
    if [ -f "$PROJECT_ROOT/package.json" ]; then
        # Capture baseline test count
        BASELINE_TEST_OUTPUT=$(npm test -- --json 2>/dev/null || npm test 2>&1)
        BASELINE_PASSING_TESTS=$(echo "$BASELINE_TEST_OUTPUT" | grep -oE '[0-9]+ passing' | grep -oE '[0-9]+' | head -1 || echo "0")

        # Capture baseline coverage if available
        if grep -q '"coverage"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
            BASELINE_COVERAGE=$(npm run coverage -- --json 2>/dev/null | jq '.total.lines.pct' 2>/dev/null || echo "0")
        fi

        log "Regression baseline: $BASELINE_PASSING_TESTS passing tests, ${BASELINE_COVERAGE}% coverage"
    fi
}

execute_regression_gate() {
    local story_id="$1"

    log ">>> REGRESSION GATE: $story_id"

    # Get current test count
    CURRENT_TEST_OUTPUT=$(npm test -- --json 2>/dev/null || npm test 2>&1)
    CURRENT_PASSING_TESTS=$(echo "$CURRENT_TEST_OUTPUT" | grep -oE '[0-9]+ passing' | grep -oE '[0-9]+' | head -1 || echo "0")

    # Check for regression
    if [ "$CURRENT_PASSING_TESTS" -lt "$BASELINE_PASSING_TESTS" ]; then
        log_error "REGRESSION DETECTED: Test count decreased ($BASELINE_PASSING_TESTS -> $CURRENT_PASSING_TESTS)"
        add_metrics_issue "$story_id" "regression" "Test count decreased from $BASELINE_PASSING_TESTS to $CURRENT_PASSING_TESTS"
        return 1
    fi

    # Update baseline for next story
    BASELINE_PASSING_TESTS=$CURRENT_PASSING_TESTS
    log_success "Regression gate passed: $CURRENT_PASSING_TESTS tests passing (was $BASELINE_PASSING_TESTS)"
    return 0
}
```

**Insertion Point:** After review passes, before marking story done.

---

### 3. Pre-Implementation Design Phase (HIGH IMPACT, MEDIUM EFFORT) ✅ IMPLEMENTED

Add a design phase before implementation to catch architectural issues early.

```bash
execute_design_phase() {
    local story_file="$1"
    local story_id=$(basename "$story_file" .md)

    log ">>> DESIGN PHASE: $story_id"

    local story_contents=$(cat "$story_file")
    local arch_contents=""
    if [ -f "$PROJECT_ROOT/docs/architecture.md" ]; then
        arch_contents=$(cat "$PROJECT_ROOT/docs/architecture.md")
    fi

    local design_prompt="You are a senior developer planning the implementation of a story.

## Your Task

Create an implementation plan for: $story_id

Do NOT write any code yet. Output only your design plan.

## Story

<story>
$story_contents
</story>

## Architecture Reference

<architecture>
$arch_contents
</architecture>

## Required Output

Output your implementation plan in this exact format:

\`\`\`
DESIGN START
files_to_modify:
  - path: <file path>
    action: create|modify
    purpose: <why this file>

patterns_to_use:
  - <pattern name>: <how it will be applied>

dependencies:
  - <package>: <installed|needs-install>

acceptance_criteria_mapping:
  - AC1: <which files/functions will implement this>
  - AC2: <which files/functions will implement this>

risks:
  - <potential issue and mitigation>

estimated_test_files:
  - <test file path>: <what it will test>
DESIGN END
\`\`\`

Be specific. This plan will be validated against architecture before implementation begins."

    local result
    result=\$(claude --dangerously-skip-permissions -p "\$design_prompt" 2>&1) || true

    echo "\$result" >> "\$LOG_FILE"

    # Extract and save design for later phases
    LAST_DESIGN=\$(echo "\$result" | sed -n '/DESIGN START/,/DESIGN END/p')

    if [ -n "\$LAST_DESIGN" ]; then
        # Save to decision log for context in later phases
        echo "## Design: \$story_id - \$(date)" >> "\$DECISION_LOG"
        echo "\$LAST_DESIGN" >> "\$DECISION_LOG"
        log_success "Design phase complete: \$story_id"
        return 0
    else
        log_error "Design phase did not produce valid output"
        return 1
    fi
}
```

**Insertion Point:** Before `execute_dev_phase()`. Pass the design to the dev prompt.

---

### 4. Real Test Output in Fix Loops (MEDIUM IMPACT, LOW EFFORT) ✅ IMPLEMENTED

Capture and pass real test output to fix phases instead of just AI-extracted findings.

```bash
# In execute_review_phase or after static analysis gate
capture_real_failures() {
    # Run tests and capture actual output
    TEST_OUTPUT=$(npm test 2>&1) || true

    # Extract actual failures with context
    ACTUAL_FAILURES=$(echo "$TEST_OUTPUT" | grep -B 2 -A 10 "FAIL\|Error\|AssertionError\|Expected\|Received" || true)

    # Extract lint errors if any
    LINT_OUTPUT=$(npm run lint 2>&1) || true
    LINT_ERRORS=$(echo "$LINT_OUTPUT" | grep -E "error|warning" || true)

    # Extract type errors if any
    TYPE_OUTPUT=$(npm run typecheck 2>&1) || true
    TYPE_ERRORS=$(echo "$TYPE_OUTPUT" | grep -E "error TS|Error:" || true)
}

# Then in execute_fix_phase, add to the prompt:
fix_prompt+="
## Actual Tooling Output

### Test Failures (from npm test)
\`\`\`
$ACTUAL_FAILURES
\`\`\`

### Lint Errors (from npm run lint)
\`\`\`
$LINT_ERRORS
\`\`\`

### Type Errors (from npm run typecheck)
\`\`\`
$TYPE_ERRORS
\`\`\`

Use this ACTUAL output to guide your fixes, not just the review findings.
"
```

---

### 5. Cumulative Decision Log (MEDIUM IMPACT, LOW EFFORT) ✅ IMPLEMENTED

Maintain a persistent decision log to preserve context across phases.

```bash
DECISION_LOG="$SPRINT_ARTIFACTS_DIR/epic-${EPIC_ID}-decisions.md"

init_decision_log() {
    mkdir -p "$(dirname "$DECISION_LOG")"
    cat > "$DECISION_LOG" << EOF
# Epic $EPIC_ID Decision Log

This file tracks implementation decisions for context continuity.

---
EOF
}

append_to_decision_log() {
    local phase="$1"
    local story_id="$2"
    local content="$3"

    cat >> "$DECISION_LOG" << EOF

## $phase: $story_id
**Timestamp:** $(date '+%Y-%m-%d %H:%M:%S')

$content

---
EOF
}

# Then pass to subsequent phases:
"## Previous Implementation Context

The following decisions have been made in this epic:

$(cat "$DECISION_LOG")

Respect these decisions unless you have a specific reason to deviate.
"
```

---

### 6. Test-First Enforcement (HIGH IMPACT, HIGH EFFORT) ✅ IMPLEMENTED

Restructure the flow to enforce TDD principles.

**Implemented Flow:**

```
1. DESIGN PHASE (Dev)       → Plan implementation approach
2. TEST SPEC PHASE (TEA)    → Write test specifications based on ACs
3. TEST IMPL PHASE (TEA)    → Implement failing tests
4. VERIFICATION             → Confirm tests fail appropriately
5. DEV PHASE (Dev)          → Implement to make tests pass
6. STATIC ANALYSIS GATE     → Real tooling verification
7. REVIEW PHASE             → Adversarial review
8. REGRESSION GATE          → Ensure no regressions
```

This ensures tests actually test requirements rather than implementation details.

**Implementation:** Module at `scripts/epic-execute-lib/tdd-flow.sh` with functions:
- `execute_test_spec_phase()` - Generates BDD test specifications from acceptance criteria
- `execute_test_impl_phase()` - Creates failing tests from specifications
- `execute_test_verification_phase()` - Verifies tests fail correctly before implementation
- `build_test_spec_context_for_dev()` - Provides test context to dev phase

**Skip flags:** `--skip-tdd`, `--skip-test-spec`, `--skip-test-impl`

---

### 7. Structured Output Validation (MEDIUM IMPACT, MEDIUM EFFORT) ✅ IMPLEMENTED

Replace fragile regex parsing with structured JSON output.

**Implementation:** Module at `scripts/epic-execute-lib/json-output.sh` with functions:
- `extract_json_result()` - Parse JSON from Claude output
- `get_result_status()` - Extract status field
- `get_result_files()` - Extract files_changed array
- `get_result_issues()` - Extract issues for fix loops
- `check_phase_completion()` - Unified completion detection with JSON + text fallback

```bash
# Prompts now request JSON output:
"Output your result as JSON:
\`\`\`json
{
  \"status\": \"COMPLETE\" | \"BLOCKED\" | \"FAILED\",
  \"story_id\": \"...\",
  \"summary\": \"...\",
  \"files_changed\": [...],
  \"tests_added\": N,
  \"decisions\": [{\"what\": \"...\", \"why\": \"...\"}],
  \"concerns\": [...]
}
\`\`\`"

# Parsing with fallback:
check_phase_completion "$result" "dev" "$story_id"
# Returns: 0 (complete), 1 (failed), 2 (unclear)
```

**Skip flag:** `--legacy-output` (disables JSON parsing, uses text-only detection)

---

## Implementation Priority Matrix

| Priority | Improvement | Impact | Effort | Status | Rationale |
|----------|-------------|--------|--------|--------|-----------|
| 1 | Static Analysis Gate | HIGH | LOW | ✅ DONE | Catches real errors AI misses |
| 2 | Real Test Output in Fix | MEDIUM | LOW | ✅ DONE | Quick win, better fixes |
| 3 | Decision Log | MEDIUM | LOW | ✅ DONE | Easy context preservation |
| 4 | Regression Gate | HIGH | MEDIUM | ✅ DONE | Prevents silent breakage |
| 5 | Design Phase | HIGH | MEDIUM | ✅ DONE | Catches issues early |
| 6 | Structured JSON Output | MEDIUM | MEDIUM | ✅ DONE | Improves reliability |
| 7 | Test-First Flow | HIGH | HIGH | ✅ DONE | Fundamental quality improvement |

---

## Proposed Enhanced Flow

```
Epic Execution Pipeline v2:

SETUP
├── Validate workflows
├── Initialize metrics
├── Initialize decision log
└── Initialize regression baseline

FOR EACH STORY:
├── 1. DESIGN PHASE (NEW)
│   ├── Generate implementation plan
│   ├── Validate against architecture
│   └── Save to decision log
│
├── 2. DEV PHASE
│   ├── Pass design context
│   ├── Implement story
│   └── Stage changes
│
├── 3. STATIC ANALYSIS GATE (NEW)
│   ├── Run npm run typecheck
│   ├── Run npm run lint
│   ├── Run npm run build
│   ├── Run npm test
│   └── Capture real failures
│
├── 4. ARCH COMPLIANCE
│   └── (existing logic)
│
├── 5. CODE REVIEW + FIX LOOP
│   ├── Pass real test output to fix phase
│   └── (existing logic)
│
├── 6. TEST QUALITY
│   └── (existing logic)
│
├── 7. REGRESSION GATE (NEW)
│   ├── Compare test count to baseline
│   ├── Compare coverage to baseline
│   └── Fail if regression detected
│
└── 8. COMMIT
    └── (existing logic)

POST-STORIES:
├── Traceability check
├── UAT generation
└── Finalize metrics
```

---

## Quick Wins (Implement First) ✅ ALL COMPLETE

1. **Static Analysis Gate** - Single most impactful change ✅
2. **Real Test Output in Fix Loops** - Minimal code change, better fixes ✅
3. **Decision Log** - Simple file append, preserves context ✅

These three changes alone would dramatically improve code reliability with minimal refactoring effort.

**Additionally implemented:**
4. **Regression Gate** - Prevents silent breakage ✅
5. **Design Phase** - Catches architectural issues early ✅
6. **Structured JSON Output** - Reliable completion signal parsing ✅
7. **Test-First Flow** - TDD workflow with test specs before implementation ✅

**Implementation:** All features are modularized in `scripts/epic-execute-lib/` with graceful degradation and skip flags:
- `--skip-design` - Skip pre-implementation design phase
- `--skip-regression` - Skip regression test gate
- `--skip-tdd` - Skip test-first development phases
- `--skip-test-spec` - Skip test specification phase only
- `--skip-test-impl` - Skip test implementation phase only
- `--legacy-output` - Use legacy text-based output parsing (no JSON)

---

## Metrics to Track

After implementing these improvements, track:

- **False Positive Rate**: Stories marked "complete" that have real errors
- **Fix Loop Efficiency**: Average fix attempts per story (should decrease)
- **Regression Rate**: Stories that break previous functionality
- **Time to First Working Implementation**: Should decrease with design phase

---

## Conclusion

The current epic-execute script has a solid foundation with multi-phase validation and self-healing fix loops. However, it fundamentally relies on AI self-reporting for validation. Adding real tooling verification (Static Analysis Gate) would catch the majority of "AI said it works but it doesn't" issues.

The recommended implementation order prioritizes high-impact, low-effort changes first, building toward a comprehensive TDD-based flow that would fundamentally improve code quality.