--- title: BMAD Epic-Execute Improvements Analysis v2 --- # BMAD Epic-Execute Improvements Analysis v2 **Date:** 2026-01-26 **Analyzed Script:** `scripts/epic-execute.sh` **Purpose:** Improve the performance and reliability of code generated by the epic-execute automation --- ## Executive Summary The current epic-execute script orchestrates AI agents to implement stories through multiple phases (dev, architecture compliance, code review, test quality, traceability, UAT generation). While the multi-phase approach with self-healing fix loops is sound, there are fundamental improvements that would significantly enhance the reliability of generated code. **Core Problem:** The system relies on AI self-reporting ("tests pass", "build succeeds") rather than actual verification. This creates systematic blind spots where AI can hallucinate success. --- ## Current Flow Analysis ``` Story Execution Pipeline: 1. DEV PHASE → AI implements story 2. ARCH COMPLIANCE → AI validates against architecture.md 3. CODE REVIEW → AI reviews in fresh context (adversarial) 4. TEST QUALITY → AI reviews tests for quality patterns 5. TRACEABILITY → AI maps acceptance criteria to tests 6. UAT GENERATION → AI creates manual test document ``` Each phase runs in isolated context (good for adversarial review) but relies entirely on AI judgment for validation. --- ## Identified Weaknesses ### 1. No Real Tooling Verification The script trusts AI to report: - "Tests pass" - but did they actually run? - "Build succeeds" - no actual build command executed - "No type errors" - no actual type checking **Impact:** AI can hallucinate success. Code that doesn't compile can be marked as "complete". ### 2. No Baseline/Regression Testing Each story is executed in isolation. There's no verification that: - Story N doesn't break Story N-1 - Overall test count doesn't decrease - Coverage doesn't regress **Impact:** Later stories can silently break earlier work. ### 3. Tests Written After Implementation (Not TDD) Current flow: Implement → Write tests → Review **Impact:** - Tests often test implementation, not requirements - Tests may be written to pass, not to verify behavior - Missing edge cases because dev already knows the code ### 4. No Pre-Implementation Design Review The dev phase jumps straight into coding. For complex stories, this leads to: - Architectural decisions made implicitly during coding - Refactoring when initial approach doesn't work - Inconsistent patterns within the epic ### 5. AI Marking Its Own Homework The same system (Claude) both: - Writes the code - Reviews the code - Says "tests pass" **Impact:** Systematic blind spots get reinforced across phases. ### 6. Context Loss Between Phases Each phase runs in fresh context. Good for adversarial review, but: - Fix phase doesn't know WHY dev made certain decisions - Review doesn't know about trade-offs considered - Decisions get made, lost, and remade differently ### 7. No Incremental Validation Complex stories are implemented all-at-once. If something fails at the end, the entire implementation might need redoing. ### 8. Completion Signal Parsing is Fragile The script uses regex to find "IMPLEMENTATION COMPLETE" in output. If AI outputs this string in a different context (explaining what to output, quoting instructions), it triggers false positive. ### 9. Missing Dependency Validation Before implementing a story, no check that: - Required npm packages are installed - Required services are running - Prerequisite stories are complete --- ## Recommended Improvements ### 1. Static Analysis Gate (HIGH IMPACT, LOW EFFORT) ✅ IMPLEMENTED Add a real tooling validation step that runs actual commands between dev and review phases. ```bash execute_static_analysis_gate() { local story_id="$1" local failures=0 log ">>> STATIC ANALYSIS GATE: $story_id" # Detect project type and run appropriate checks if [ -f "$PROJECT_ROOT/package.json" ]; then # TypeScript/JavaScript project # 1. Type checking (catches type errors AI might miss) if grep -q '"typecheck"' "$PROJECT_ROOT/package.json" 2>/dev/null; then log "Running type check..." if ! npm run typecheck 2>&1 | tee -a "$LOG_FILE"; then log_error "Type check failed" ((failures++)) fi fi # 2. Linting (catches code style/quality issues) if grep -q '"lint"' "$PROJECT_ROOT/package.json" 2>/dev/null; then log "Running lint..." if ! npm run lint 2>&1 | tee -a "$LOG_FILE"; then log_error "Lint failed" ((failures++)) fi fi # 3. Build (catches compilation errors) if grep -q '"build"' "$PROJECT_ROOT/package.json" 2>/dev/null; then log "Running build..." if ! npm run build 2>&1 | tee -a "$LOG_FILE"; then log_error "Build failed" ((failures++)) fi fi # 4. Tests (catches actual test failures) log "Running tests..." TEST_OUTPUT=$(npm test 2>&1) || true echo "$TEST_OUTPUT" >> "$LOG_FILE" if echo "$TEST_OUTPUT" | grep -qE "failed|FAIL|Error"; then log_error "Tests failed" # Capture actual failures for fix phase ACTUAL_TEST_FAILURES=$(echo "$TEST_OUTPUT" | grep -A 10 "FAIL\|Error" || true) ((failures++)) fi fi if [ $failures -gt 0 ]; then log_error "Static analysis gate failed with $failures issue(s)" return 1 fi log_success "Static analysis gate passed" return 0 } ``` **Insertion Point:** After `execute_dev_phase()`, before arch compliance. --- ### 2. Regression Test Gate (HIGH IMPACT, MEDIUM EFFORT) ✅ IMPLEMENTED Track test baseline and verify no regressions after each story. ```bash # Initialize at epic start init_regression_baseline() { if [ -f "$PROJECT_ROOT/package.json" ]; then # Capture baseline test count BASELINE_TEST_OUTPUT=$(npm test -- --json 2>/dev/null || npm test 2>&1) BASELINE_PASSING_TESTS=$(echo "$BASELINE_TEST_OUTPUT" | grep -oE '[0-9]+ passing' | grep -oE '[0-9]+' | head -1 || echo "0") # Capture baseline coverage if available if grep -q '"coverage"' "$PROJECT_ROOT/package.json" 2>/dev/null; then BASELINE_COVERAGE=$(npm run coverage -- --json 2>/dev/null | jq '.total.lines.pct' 2>/dev/null || echo "0") fi log "Regression baseline: $BASELINE_PASSING_TESTS passing tests, ${BASELINE_COVERAGE}% coverage" fi } execute_regression_gate() { local story_id="$1" log ">>> REGRESSION GATE: $story_id" # Get current test count CURRENT_TEST_OUTPUT=$(npm test -- --json 2>/dev/null || npm test 2>&1) CURRENT_PASSING_TESTS=$(echo "$CURRENT_TEST_OUTPUT" | grep -oE '[0-9]+ passing' | grep -oE '[0-9]+' | head -1 || echo "0") # Check for regression if [ "$CURRENT_PASSING_TESTS" -lt "$BASELINE_PASSING_TESTS" ]; then log_error "REGRESSION DETECTED: Test count decreased ($BASELINE_PASSING_TESTS -> $CURRENT_PASSING_TESTS)" add_metrics_issue "$story_id" "regression" "Test count decreased from $BASELINE_PASSING_TESTS to $CURRENT_PASSING_TESTS" return 1 fi # Update baseline for next story BASELINE_PASSING_TESTS=$CURRENT_PASSING_TESTS log_success "Regression gate passed: $CURRENT_PASSING_TESTS tests passing (was $BASELINE_PASSING_TESTS)" return 0 } ``` **Insertion Point:** After review passes, before marking story done. --- ### 3. Pre-Implementation Design Phase (HIGH IMPACT, MEDIUM EFFORT) ✅ IMPLEMENTED Add a design phase before implementation to catch architectural issues early. ```bash execute_design_phase() { local story_file="$1" local story_id=$(basename "$story_file" .md) log ">>> DESIGN PHASE: $story_id" local story_contents=$(cat "$story_file") local arch_contents="" if [ -f "$PROJECT_ROOT/docs/architecture.md" ]; then arch_contents=$(cat "$PROJECT_ROOT/docs/architecture.md") fi local design_prompt="You are a senior developer planning the implementation of a story. ## Your Task Create an implementation plan for: $story_id Do NOT write any code yet. Output only your design plan. ## Story $story_contents ## Architecture Reference $arch_contents ## Required Output Output your implementation plan in this exact format: \`\`\` DESIGN START files_to_modify: - path: action: create|modify purpose: patterns_to_use: - : dependencies: - : acceptance_criteria_mapping: - AC1: - AC2: risks: - estimated_test_files: - : DESIGN END \`\`\` Be specific. This plan will be validated against architecture before implementation begins." local result result=\$(claude --dangerously-skip-permissions -p "\$design_prompt" 2>&1) || true echo "\$result" >> "\$LOG_FILE" # Extract and save design for later phases LAST_DESIGN=\$(echo "\$result" | sed -n '/DESIGN START/,/DESIGN END/p') if [ -n "\$LAST_DESIGN" ]; then # Save to decision log for context in later phases echo "## Design: \$story_id - \$(date)" >> "\$DECISION_LOG" echo "\$LAST_DESIGN" >> "\$DECISION_LOG" log_success "Design phase complete: \$story_id" return 0 else log_error "Design phase did not produce valid output" return 1 fi } ``` **Insertion Point:** Before `execute_dev_phase()`. Pass the design to the dev prompt. --- ### 4. Real Test Output in Fix Loops (MEDIUM IMPACT, LOW EFFORT) ✅ IMPLEMENTED Capture and pass real test output to fix phases instead of just AI-extracted findings. ```bash # In execute_review_phase or after static analysis gate capture_real_failures() { # Run tests and capture actual output TEST_OUTPUT=$(npm test 2>&1) || true # Extract actual failures with context ACTUAL_FAILURES=$(echo "$TEST_OUTPUT" | grep -B 2 -A 10 "FAIL\|Error\|AssertionError\|Expected\|Received" || true) # Extract lint errors if any LINT_OUTPUT=$(npm run lint 2>&1) || true LINT_ERRORS=$(echo "$LINT_OUTPUT" | grep -E "error|warning" || true) # Extract type errors if any TYPE_OUTPUT=$(npm run typecheck 2>&1) || true TYPE_ERRORS=$(echo "$TYPE_OUTPUT" | grep -E "error TS|Error:" || true) } # Then in execute_fix_phase, add to the prompt: fix_prompt+=" ## Actual Tooling Output ### Test Failures (from npm test) \`\`\` $ACTUAL_FAILURES \`\`\` ### Lint Errors (from npm run lint) \`\`\` $LINT_ERRORS \`\`\` ### Type Errors (from npm run typecheck) \`\`\` $TYPE_ERRORS \`\`\` Use this ACTUAL output to guide your fixes, not just the review findings. " ``` --- ### 5. Cumulative Decision Log (MEDIUM IMPACT, LOW EFFORT) ✅ IMPLEMENTED Maintain a persistent decision log to preserve context across phases. ```bash DECISION_LOG="$SPRINT_ARTIFACTS_DIR/epic-${EPIC_ID}-decisions.md" init_decision_log() { mkdir -p "$(dirname "$DECISION_LOG")" cat > "$DECISION_LOG" << EOF # Epic $EPIC_ID Decision Log This file tracks implementation decisions for context continuity. --- EOF } append_to_decision_log() { local phase="$1" local story_id="$2" local content="$3" cat >> "$DECISION_LOG" << EOF ## $phase: $story_id **Timestamp:** $(date '+%Y-%m-%d %H:%M:%S') $content --- EOF } # Then pass to subsequent phases: "## Previous Implementation Context The following decisions have been made in this epic: $(cat "$DECISION_LOG") Respect these decisions unless you have a specific reason to deviate. " ``` --- ### 6. Test-First Enforcement (HIGH IMPACT, HIGH EFFORT) ✅ IMPLEMENTED Restructure the flow to enforce TDD principles. **Implemented Flow:** ``` 1. DESIGN PHASE (Dev) → Plan implementation approach 2. TEST SPEC PHASE (TEA) → Write test specifications based on ACs 3. TEST IMPL PHASE (TEA) → Implement failing tests 4. VERIFICATION → Confirm tests fail appropriately 5. DEV PHASE (Dev) → Implement to make tests pass 6. STATIC ANALYSIS GATE → Real tooling verification 7. REVIEW PHASE → Adversarial review 8. REGRESSION GATE → Ensure no regressions ``` This ensures tests actually test requirements rather than implementation details. **Implementation:** Module at `scripts/epic-execute-lib/tdd-flow.sh` with functions: - `execute_test_spec_phase()` - Generates BDD test specifications from acceptance criteria - `execute_test_impl_phase()` - Creates failing tests from specifications - `execute_test_verification_phase()` - Verifies tests fail correctly before implementation - `build_test_spec_context_for_dev()` - Provides test context to dev phase **Skip flags:** `--skip-tdd`, `--skip-test-spec`, `--skip-test-impl` --- ### 7. Structured Output Validation (MEDIUM IMPACT, MEDIUM EFFORT) ✅ IMPLEMENTED Replace fragile regex parsing with structured JSON output. **Implementation:** Module at `scripts/epic-execute-lib/json-output.sh` with functions: - `extract_json_result()` - Parse JSON from Claude output - `get_result_status()` - Extract status field - `get_result_files()` - Extract files_changed array - `get_result_issues()` - Extract issues for fix loops - `check_phase_completion()` - Unified completion detection with JSON + text fallback ```bash # Prompts now request JSON output: "Output your result as JSON: \`\`\`json { \"status\": \"COMPLETE\" | \"BLOCKED\" | \"FAILED\", \"story_id\": \"...\", \"summary\": \"...\", \"files_changed\": [...], \"tests_added\": N, \"decisions\": [{\"what\": \"...\", \"why\": \"...\"}], \"concerns\": [...] } \`\`\`" # Parsing with fallback: check_phase_completion "$result" "dev" "$story_id" # Returns: 0 (complete), 1 (failed), 2 (unclear) ``` **Skip flag:** `--legacy-output` (disables JSON parsing, uses text-only detection) --- ## Implementation Priority Matrix | Priority | Improvement | Impact | Effort | Status | Rationale | |----------|-------------|--------|--------|--------|-----------| | 1 | Static Analysis Gate | HIGH | LOW | ✅ DONE | Catches real errors AI misses | | 2 | Real Test Output in Fix | MEDIUM | LOW | ✅ DONE | Quick win, better fixes | | 3 | Decision Log | MEDIUM | LOW | ✅ DONE | Easy context preservation | | 4 | Regression Gate | HIGH | MEDIUM | ✅ DONE | Prevents silent breakage | | 5 | Design Phase | HIGH | MEDIUM | ✅ DONE | Catches issues early | | 6 | Structured JSON Output | MEDIUM | MEDIUM | ✅ DONE | Improves reliability | | 7 | Test-First Flow | HIGH | HIGH | ✅ DONE | Fundamental quality improvement | --- ## Proposed Enhanced Flow ``` Epic Execution Pipeline v2: SETUP ├── Validate workflows ├── Initialize metrics ├── Initialize decision log └── Initialize regression baseline FOR EACH STORY: ├── 1. DESIGN PHASE (NEW) │ ├── Generate implementation plan │ ├── Validate against architecture │ └── Save to decision log │ ├── 2. DEV PHASE │ ├── Pass design context │ ├── Implement story │ └── Stage changes │ ├── 3. STATIC ANALYSIS GATE (NEW) │ ├── Run npm run typecheck │ ├── Run npm run lint │ ├── Run npm run build │ ├── Run npm test │ └── Capture real failures │ ├── 4. ARCH COMPLIANCE │ └── (existing logic) │ ├── 5. CODE REVIEW + FIX LOOP │ ├── Pass real test output to fix phase │ └── (existing logic) │ ├── 6. TEST QUALITY │ └── (existing logic) │ ├── 7. REGRESSION GATE (NEW) │ ├── Compare test count to baseline │ ├── Compare coverage to baseline │ └── Fail if regression detected │ └── 8. COMMIT └── (existing logic) POST-STORIES: ├── Traceability check ├── UAT generation └── Finalize metrics ``` --- ## Quick Wins (Implement First) ✅ ALL COMPLETE 1. **Static Analysis Gate** - Single most impactful change ✅ 2. **Real Test Output in Fix Loops** - Minimal code change, better fixes ✅ 3. **Decision Log** - Simple file append, preserves context ✅ These three changes alone would dramatically improve code reliability with minimal refactoring effort. **Additionally implemented:** 4. **Regression Gate** - Prevents silent breakage ✅ 5. **Design Phase** - Catches architectural issues early ✅ 6. **Structured JSON Output** - Reliable completion signal parsing ✅ 7. **Test-First Flow** - TDD workflow with test specs before implementation ✅ **Implementation:** All features are modularized in `scripts/epic-execute-lib/` with graceful degradation and skip flags: - `--skip-design` - Skip pre-implementation design phase - `--skip-regression` - Skip regression test gate - `--skip-tdd` - Skip test-first development phases - `--skip-test-spec` - Skip test specification phase only - `--skip-test-impl` - Skip test implementation phase only - `--legacy-output` - Use legacy text-based output parsing (no JSON) --- ## Metrics to Track After implementing these improvements, track: - **False Positive Rate**: Stories marked "complete" that have real errors - **Fix Loop Efficiency**: Average fix attempts per story (should decrease) - **Regression Rate**: Stories that break previous functionality - **Time to First Working Implementation**: Should decrease with design phase --- ## Conclusion The current epic-execute script has a solid foundation with multi-phase validation and self-healing fix loops. However, it fundamentally relies on AI self-reporting for validation. Adding real tooling verification (Static Analysis Gate) would catch the majority of "AI said it works but it doesn't" issues. The recommended implementation order prioritizes high-impact, low-effort changes first, building toward a comprehensive TDD-based flow that would fundamentally improve code quality.