docs: mark completed improvements in bmad_improvements_v2.md

Update status for implemented features: - Static Analysis Gate: ✅ DONE - Real Test Output in Fix: ✅ DONE - Decision Log: ✅ DONE - Regression Gate: ✅ DONE - Design Phase: ✅ DONE Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-01-26 14:24:59 -06:00 · 2026-01-26 14:24:59 -06:00 · 6a5b7b68e8
parent fd744c96f3
commit 6a5b7b68e8
1 changed files with 555 additions and 0 deletions
--- a/docs/bmad_improvements_v2.md
+++ b/docs/bmad_improvements_v2.md
@ -0,0 +1,555 @@
+# BMAD Epic-Execute Improvements Analysis v2
+
+**Date:** 2026-01-26
+**Analyzed Script:** `scripts/epic-execute.sh`
+**Purpose:** Improve the performance and reliability of code generated by the epic-execute automation
+
+---
+
+## Executive Summary
+
+The current epic-execute script orchestrates AI agents to implement stories through multiple phases (dev, architecture compliance, code review, test quality, traceability, UAT generation). While the multi-phase approach with self-healing fix loops is sound, there are fundamental improvements that would significantly enhance the reliability of generated code.
+
+**Core Problem:** The system relies on AI self-reporting ("tests pass", "build succeeds") rather than actual verification. This creates systematic blind spots where AI can hallucinate success.
+
+---
+
+## Current Flow Analysis
+
+```
+Story Execution Pipeline:
+1. DEV PHASE         → AI implements story
+2. ARCH COMPLIANCE   → AI validates against architecture.md
+3. CODE REVIEW       → AI reviews in fresh context (adversarial)
+4. TEST QUALITY      → AI reviews tests for quality patterns
+5. TRACEABILITY      → AI maps acceptance criteria to tests
+6. UAT GENERATION    → AI creates manual test document
+```
+
+Each phase runs in isolated context (good for adversarial review) but relies entirely on AI judgment for validation.
+
+---
+
+## Identified Weaknesses
+
+### 1. No Real Tooling Verification
+
+The script trusts AI to report:
+- "Tests pass" - but did they actually run?
+- "Build succeeds" - no actual build command executed
+- "No type errors" - no actual type checking
+
+**Impact:** AI can hallucinate success. Code that doesn't compile can be marked as "complete".
+
+### 2. No Baseline/Regression Testing
+
+Each story is executed in isolation. There's no verification that:
+- Story N doesn't break Story N-1
+- Overall test count doesn't decrease
+- Coverage doesn't regress
+
+**Impact:** Later stories can silently break earlier work.
+
+### 3. Tests Written After Implementation (Not TDD)
+
+Current flow: Implement → Write tests → Review
+
+**Impact:**
+- Tests often test implementation, not requirements
+- Tests may be written to pass, not to verify behavior
+- Missing edge cases because dev already knows the code
+
+### 4. No Pre-Implementation Design Review
+
+The dev phase jumps straight into coding. For complex stories, this leads to:
+- Architectural decisions made implicitly during coding
+- Refactoring when initial approach doesn't work
+- Inconsistent patterns within the epic
+
+### 5. AI Marking Its Own Homework
+
+The same system (Claude) both:
+- Writes the code
+- Reviews the code
+- Says "tests pass"
+
+**Impact:** Systematic blind spots get reinforced across phases.
+
+### 6. Context Loss Between Phases
+
+Each phase runs in fresh context. Good for adversarial review, but:
+- Fix phase doesn't know WHY dev made certain decisions
+- Review doesn't know about trade-offs considered
+- Decisions get made, lost, and remade differently
+
+### 7. No Incremental Validation
+
+Complex stories are implemented all-at-once. If something fails at the end, the entire implementation might need redoing.
+
+### 8. Completion Signal Parsing is Fragile
+
+The script uses regex to find "IMPLEMENTATION COMPLETE" in output. If AI outputs this string in a different context (explaining what to output, quoting instructions), it triggers false positive.
+
+### 9. Missing Dependency Validation
+
+Before implementing a story, no check that:
+- Required npm packages are installed
+- Required services are running
+- Prerequisite stories are complete
+
+---
+
+## Recommended Improvements
+
+### 1. Static Analysis Gate (HIGH IMPACT, LOW EFFORT) ✅ IMPLEMENTED
+
+Add a real tooling validation step that runs actual commands between dev and review phases.
+
+```bash
+execute_static_analysis_gate() {
+    local story_id="$1"
+    local failures=0
+
+    log ">>> STATIC ANALYSIS GATE: $story_id"
+
+    # Detect project type and run appropriate checks
+    if [ -f "$PROJECT_ROOT/package.json" ]; then
+        # TypeScript/JavaScript project
+
+        # 1. Type checking (catches type errors AI might miss)
+        if grep -q '"typecheck"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
+            log "Running type check..."
+            if ! npm run typecheck 2>&1 | tee -a "$LOG_FILE"; then
+                log_error "Type check failed"
+                ((failures++))
+            fi
+        fi
+
+        # 2. Linting (catches code style/quality issues)
+        if grep -q '"lint"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
+            log "Running lint..."
+            if ! npm run lint 2>&1 | tee -a "$LOG_FILE"; then
+                log_error "Lint failed"
+                ((failures++))
+            fi
+        fi
+
+        # 3. Build (catches compilation errors)
+        if grep -q '"build"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
+            log "Running build..."
+            if ! npm run build 2>&1 | tee -a "$LOG_FILE"; then
+                log_error "Build failed"
+                ((failures++))
+            fi
+        fi
+
+        # 4. Tests (catches actual test failures)
+        log "Running tests..."
+        TEST_OUTPUT=$(npm test 2>&1) || true
+        echo "$TEST_OUTPUT" >> "$LOG_FILE"
+
+        if echo "$TEST_OUTPUT" | grep -qE "failed|FAIL|Error"; then
+            log_error "Tests failed"
+            # Capture actual failures for fix phase
+            ACTUAL_TEST_FAILURES=$(echo "$TEST_OUTPUT" | grep -A 10 "FAIL\|Error" || true)
+            ((failures++))
+        fi
+    fi
+
+    if [ $failures -gt 0 ]; then
+        log_error "Static analysis gate failed with $failures issue(s)"
+        return 1
+    fi
+
+    log_success "Static analysis gate passed"
+    return 0
+}
+```
+
+**Insertion Point:** After `execute_dev_phase()`, before arch compliance.
+
+---
+
+### 2. Regression Test Gate (HIGH IMPACT, MEDIUM EFFORT) ✅ IMPLEMENTED
+
+Track test baseline and verify no regressions after each story.
+
+```bash
+# Initialize at epic start
+init_regression_baseline() {
+    if [ -f "$PROJECT_ROOT/package.json" ]; then
+        # Capture baseline test count
+        BASELINE_TEST_OUTPUT=$(npm test -- --json 2>/dev/null || npm test 2>&1)
+        BASELINE_PASSING_TESTS=$(echo "$BASELINE_TEST_OUTPUT" | grep -oE '[0-9]+ passing' | grep -oE '[0-9]+' | head -1 || echo "0")
+
+        # Capture baseline coverage if available
+        if grep -q '"coverage"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
+            BASELINE_COVERAGE=$(npm run coverage -- --json 2>/dev/null | jq '.total.lines.pct' 2>/dev/null || echo "0")
+        fi
+
+        log "Regression baseline: $BASELINE_PASSING_TESTS passing tests, ${BASELINE_COVERAGE}% coverage"
+    fi
+}
+
+execute_regression_gate() {
+    local story_id="$1"
+
+    log ">>> REGRESSION GATE: $story_id"
+
+    # Get current test count
+    CURRENT_TEST_OUTPUT=$(npm test -- --json 2>/dev/null || npm test 2>&1)
+    CURRENT_PASSING_TESTS=$(echo "$CURRENT_TEST_OUTPUT" | grep -oE '[0-9]+ passing' | grep -oE '[0-9]+' | head -1 || echo "0")
+
+    # Check for regression
+    if [ "$CURRENT_PASSING_TESTS" -lt "$BASELINE_PASSING_TESTS" ]; then
+        log_error "REGRESSION DETECTED: Test count decreased ($BASELINE_PASSING_TESTS -> $CURRENT_PASSING_TESTS)"
+        add_metrics_issue "$story_id" "regression" "Test count decreased from $BASELINE_PASSING_TESTS to $CURRENT_PASSING_TESTS"
+        return 1
+    fi
+
+    # Update baseline for next story
+    BASELINE_PASSING_TESTS=$CURRENT_PASSING_TESTS
+    log_success "Regression gate passed: $CURRENT_PASSING_TESTS tests passing (was $BASELINE_PASSING_TESTS)"
+    return 0
+}
+```
+
+**Insertion Point:** After review passes, before marking story done.
+
+---
+
+### 3. Pre-Implementation Design Phase (HIGH IMPACT, MEDIUM EFFORT) ✅ IMPLEMENTED
+
+Add a design phase before implementation to catch architectural issues early.
+
+```bash
+execute_design_phase() {
+    local story_file="$1"
+    local story_id=$(basename "$story_file" .md)
+
+    log ">>> DESIGN PHASE: $story_id"
+
+    local story_contents=$(cat "$story_file")
+    local arch_contents=""
+    if [ -f "$PROJECT_ROOT/docs/architecture.md" ]; then
+        arch_contents=$(cat "$PROJECT_ROOT/docs/architecture.md")
+    fi
+
+    local design_prompt="You are a senior developer planning the implementation of a story.
+
+## Your Task
+
+Create an implementation plan for: $story_id
+
+Do NOT write any code yet. Output only your design plan.
+
+## Story
+
+<story>
+$story_contents
+</story>
+
+## Architecture Reference
+
+<architecture>
+$arch_contents
+</architecture>
+
+## Required Output
+
+Output your implementation plan in this exact format:
+
+\`\`\`
+DESIGN START
+files_to_modify:
+  - path: <file path>
+    action: create|modify
+    purpose: <why this file>
+
+patterns_to_use:
+  - <pattern name>: <how it will be applied>
+
+dependencies:
+  - <package>: <installed|needs-install>
+
+acceptance_criteria_mapping:
+  - AC1: <which files/functions will implement this>
+  - AC2: <which files/functions will implement this>
+
+risks:
+  - <potential issue and mitigation>
+
+estimated_test_files:
+  - <test file path>: <what it will test>
+DESIGN END
+\`\`\`
+
+Be specific. This plan will be validated against architecture before implementation begins."
+
+    local result
+    result=\$(claude --dangerously-skip-permissions -p "\$design_prompt" 2>&1) || true
+
+    echo "\$result" >> "\$LOG_FILE"
+
+    # Extract and save design for later phases
+    LAST_DESIGN=\$(echo "\$result" | sed -n '/DESIGN START/,/DESIGN END/p')
+
+    if [ -n "\$LAST_DESIGN" ]; then
+        # Save to decision log for context in later phases
+        echo "## Design: \$story_id - \$(date)" >> "\$DECISION_LOG"
+        echo "\$LAST_DESIGN" >> "\$DECISION_LOG"
+        log_success "Design phase complete: \$story_id"
+        return 0
+    else
+        log_error "Design phase did not produce valid output"
+        return 1
+    fi
+}
+```
+
+**Insertion Point:** Before `execute_dev_phase()`. Pass the design to the dev prompt.
+
+---
+
+### 4. Real Test Output in Fix Loops (MEDIUM IMPACT, LOW EFFORT) ✅ IMPLEMENTED
+
+Capture and pass real test output to fix phases instead of just AI-extracted findings.
+
+```bash
+# In execute_review_phase or after static analysis gate
+capture_real_failures() {
+    # Run tests and capture actual output
+    TEST_OUTPUT=$(npm test 2>&1) || true
+
+    # Extract actual failures with context
+    ACTUAL_FAILURES=$(echo "$TEST_OUTPUT" | grep -B 2 -A 10 "FAIL\|Error\|AssertionError\|Expected\|Received" || true)
+
+    # Extract lint errors if any
+    LINT_OUTPUT=$(npm run lint 2>&1) || true
+    LINT_ERRORS=$(echo "$LINT_OUTPUT" | grep -E "error|warning" || true)
+
+    # Extract type errors if any
+    TYPE_OUTPUT=$(npm run typecheck 2>&1) || true
+    TYPE_ERRORS=$(echo "$TYPE_OUTPUT" | grep -E "error TS|Error:" || true)
+}
+
+# Then in execute_fix_phase, add to the prompt:
+fix_prompt+="
+## Actual Tooling Output
+
+### Test Failures (from npm test)
+\`\`\`
+$ACTUAL_FAILURES
+\`\`\`
+
+### Lint Errors (from npm run lint)
+\`\`\`
+$LINT_ERRORS
+\`\`\`
+
+### Type Errors (from npm run typecheck)
+\`\`\`
+$TYPE_ERRORS
+\`\`\`
+
+Use this ACTUAL output to guide your fixes, not just the review findings.
+"
+```
+
+---
+
+### 5. Cumulative Decision Log (MEDIUM IMPACT, LOW EFFORT) ✅ IMPLEMENTED
+
+Maintain a persistent decision log to preserve context across phases.
+
+```bash
+DECISION_LOG="$SPRINT_ARTIFACTS_DIR/epic-${EPIC_ID}-decisions.md"
+
+init_decision_log() {
+    mkdir -p "$(dirname "$DECISION_LOG")"
+    cat > "$DECISION_LOG" << EOF
+# Epic $EPIC_ID Decision Log
+
+This file tracks implementation decisions for context continuity.
+
+---
+EOF
+}
+
+append_to_decision_log() {
+    local phase="$1"
+    local story_id="$2"
+    local content="$3"
+
+    cat >> "$DECISION_LOG" << EOF
+
+## $phase: $story_id
+**Timestamp:** $(date '+%Y-%m-%d %H:%M:%S')
+
+$content
+
+---
+EOF
+}
+
+# Then pass to subsequent phases:
+"## Previous Implementation Context
+
+The following decisions have been made in this epic:
+
+$(cat "$DECISION_LOG")
+
+Respect these decisions unless you have a specific reason to deviate.
+"
+```
+
+---
+
+### 6. Test-First Enforcement (HIGH IMPACT, HIGH EFFORT)
+
+Restructure the flow to enforce TDD principles.
+
+**Proposed New Flow:**
+
+```
+1. DESIGN PHASE (Dev)       → Plan implementation approach
+2. TEST SPEC PHASE (TEA)    → Write test specifications based on ACs
+3. TEST IMPL PHASE (TEA)    → Implement failing tests
+4. VERIFICATION             → Confirm tests fail appropriately
+5. DEV PHASE (Dev)          → Implement to make tests pass
+6. STATIC ANALYSIS GATE     → Real tooling verification
+7. REVIEW PHASE             → Adversarial review
+8. REGRESSION GATE          → Ensure no regressions
+```
+
+This ensures tests actually test requirements rather than implementation details.
+
+---
+
+### 7. Structured Output Validation (MEDIUM IMPACT, MEDIUM EFFORT)
+
+Replace fragile regex parsing with structured JSON output.
+
+```bash
+# Add to prompts:
+"Output your result as JSON:
+\`\`\`json
+{
+  \"status\": \"COMPLETE\" | \"BLOCKED\" | \"FAILED\",
+  \"story_id\": \"...\",
+  \"summary\": \"...\",
+  \"files_changed\": [...],
+  \"tests_added\": N,
+  \"decisions\": [{\"what\": \"...\", \"why\": \"...\"}],
+  \"concerns\": [...]
+}
+\`\`\`"
+
+# Parse with jq:
+result_json=$(echo "$result" | sed -n '/```json/,/```/p' | sed '1d;$d')
+status=$(echo "$result_json" | jq -r '.status')
+```
+
+---
+
+## Implementation Priority Matrix
+
+| Priority | Improvement | Impact | Effort | Status | Rationale |
+|----------|-------------|--------|--------|--------|-----------|
+| 1 | Static Analysis Gate | HIGH | LOW | ✅ DONE | Catches real errors AI misses |
+| 2 | Real Test Output in Fix | MEDIUM | LOW | ✅ DONE | Quick win, better fixes |
+| 3 | Decision Log | MEDIUM | LOW | ✅ DONE | Easy context preservation |
+| 4 | Regression Gate | HIGH | MEDIUM | ✅ DONE | Prevents silent breakage |
+| 5 | Design Phase | HIGH | MEDIUM | ✅ DONE | Catches issues early |
+| 6 | Structured JSON Output | MEDIUM | MEDIUM | | Improves reliability |
+| 7 | Test-First Flow | HIGH | HIGH | | Fundamental quality improvement |
+
+---
+
+## Proposed Enhanced Flow
+
+```
+Epic Execution Pipeline v2:
+
+SETUP
+├── Validate workflows
+├── Initialize metrics
+├── Initialize decision log
+└── Initialize regression baseline
+
+FOR EACH STORY:
+├── 1. DESIGN PHASE (NEW)
+│   ├── Generate implementation plan
+│   ├── Validate against architecture
+│   └── Save to decision log
+│
+├── 2. DEV PHASE
+│   ├── Pass design context
+│   ├── Implement story
+│   └── Stage changes
+│
+├── 3. STATIC ANALYSIS GATE (NEW)
+│   ├── Run npm run typecheck
+│   ├── Run npm run lint
+│   ├── Run npm run build
+│   ├── Run npm test
+│   └── Capture real failures
+│
+├── 4. ARCH COMPLIANCE
+│   └── (existing logic)
+│
+├── 5. CODE REVIEW + FIX LOOP
+│   ├── Pass real test output to fix phase
+│   └── (existing logic)
+│
+├── 6. TEST QUALITY
+│   └── (existing logic)
+│
+├── 7. REGRESSION GATE (NEW)
+│   ├── Compare test count to baseline
+│   ├── Compare coverage to baseline
+│   └── Fail if regression detected
+│
+└── 8. COMMIT
+    └── (existing logic)
+
+POST-STORIES:
+├── Traceability check
+├── UAT generation
+└── Finalize metrics
+```
+
+---
+
+## Quick Wins (Implement First) ✅ ALL COMPLETE
+
+1. **Static Analysis Gate** - Single most impactful change ✅
+2. **Real Test Output in Fix Loops** - Minimal code change, better fixes ✅
+3. **Decision Log** - Simple file append, preserves context ✅
+
+These three changes alone would dramatically improve code reliability with minimal refactoring effort.
+
+**Additionally implemented:**
+4. **Regression Gate** - Prevents silent breakage ✅
+5. **Design Phase** - Catches architectural issues early ✅
+
+**Implementation:** All features are modularized in `scripts/epic-execute-lib/` with graceful degradation and skip flags (`--skip-design`, `--skip-regression`).
+
+---
+
+## Metrics to Track
+
+After implementing these improvements, track:
+
+- **False Positive Rate**: Stories marked "complete" that have real errors
+- **Fix Loop Efficiency**: Average fix attempts per story (should decrease)
+- **Regression Rate**: Stories that break previous functionality
+- **Time to First Working Implementation**: Should decrease with design phase
+
+---
+
+## Conclusion
+
+The current epic-execute script has a solid foundation with multi-phase validation and self-healing fix loops. However, it fundamentally relies on AI self-reporting for validation. Adding real tooling verification (Static Analysis Gate) would catch the majority of "AI said it works but it doesn't" issues.
+
+The recommended implementation order prioritizes high-impact, low-effort changes first, building toward a comprehensive TDD-based flow that would fundamentally improve code quality.