docs: mark completed improvements in bmad_improvements_v2.md
Update status for implemented features: - Static Analysis Gate: ✅ DONE - Real Test Output in Fix: ✅ DONE - Decision Log: ✅ DONE - Regression Gate: ✅ DONE - Design Phase: ✅ DONE Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
This commit is contained in:
parent
fd744c96f3
commit
6a5b7b68e8
|
|
@ -0,0 +1,555 @@
|
|||
# BMAD Epic-Execute Improvements Analysis v2
|
||||
|
||||
**Date:** 2026-01-26
|
||||
**Analyzed Script:** `scripts/epic-execute.sh`
|
||||
**Purpose:** Improve the performance and reliability of code generated by the epic-execute automation
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The current epic-execute script orchestrates AI agents to implement stories through multiple phases (dev, architecture compliance, code review, test quality, traceability, UAT generation). While the multi-phase approach with self-healing fix loops is sound, there are fundamental improvements that would significantly enhance the reliability of generated code.
|
||||
|
||||
**Core Problem:** The system relies on AI self-reporting ("tests pass", "build succeeds") rather than actual verification. This creates systematic blind spots where AI can hallucinate success.
|
||||
|
||||
---
|
||||
|
||||
## Current Flow Analysis
|
||||
|
||||
```
|
||||
Story Execution Pipeline:
|
||||
1. DEV PHASE → AI implements story
|
||||
2. ARCH COMPLIANCE → AI validates against architecture.md
|
||||
3. CODE REVIEW → AI reviews in fresh context (adversarial)
|
||||
4. TEST QUALITY → AI reviews tests for quality patterns
|
||||
5. TRACEABILITY → AI maps acceptance criteria to tests
|
||||
6. UAT GENERATION → AI creates manual test document
|
||||
```
|
||||
|
||||
Each phase runs in isolated context (good for adversarial review) but relies entirely on AI judgment for validation.
|
||||
|
||||
---
|
||||
|
||||
## Identified Weaknesses
|
||||
|
||||
### 1. No Real Tooling Verification
|
||||
|
||||
The script trusts AI to report:
|
||||
- "Tests pass" - but did they actually run?
|
||||
- "Build succeeds" - no actual build command executed
|
||||
- "No type errors" - no actual type checking
|
||||
|
||||
**Impact:** AI can hallucinate success. Code that doesn't compile can be marked as "complete".
|
||||
|
||||
### 2. No Baseline/Regression Testing
|
||||
|
||||
Each story is executed in isolation. There's no verification that:
|
||||
- Story N doesn't break Story N-1
|
||||
- Overall test count doesn't decrease
|
||||
- Coverage doesn't regress
|
||||
|
||||
**Impact:** Later stories can silently break earlier work.
|
||||
|
||||
### 3. Tests Written After Implementation (Not TDD)
|
||||
|
||||
Current flow: Implement → Write tests → Review
|
||||
|
||||
**Impact:**
|
||||
- Tests often test implementation, not requirements
|
||||
- Tests may be written to pass, not to verify behavior
|
||||
- Missing edge cases because dev already knows the code
|
||||
|
||||
### 4. No Pre-Implementation Design Review
|
||||
|
||||
The dev phase jumps straight into coding. For complex stories, this leads to:
|
||||
- Architectural decisions made implicitly during coding
|
||||
- Refactoring when initial approach doesn't work
|
||||
- Inconsistent patterns within the epic
|
||||
|
||||
### 5. AI Marking Its Own Homework
|
||||
|
||||
The same system (Claude) both:
|
||||
- Writes the code
|
||||
- Reviews the code
|
||||
- Says "tests pass"
|
||||
|
||||
**Impact:** Systematic blind spots get reinforced across phases.
|
||||
|
||||
### 6. Context Loss Between Phases
|
||||
|
||||
Each phase runs in fresh context. Good for adversarial review, but:
|
||||
- Fix phase doesn't know WHY dev made certain decisions
|
||||
- Review doesn't know about trade-offs considered
|
||||
- Decisions get made, lost, and remade differently
|
||||
|
||||
### 7. No Incremental Validation
|
||||
|
||||
Complex stories are implemented all-at-once. If something fails at the end, the entire implementation might need redoing.
|
||||
|
||||
### 8. Completion Signal Parsing is Fragile
|
||||
|
||||
The script uses regex to find "IMPLEMENTATION COMPLETE" in output. If AI outputs this string in a different context (explaining what to output, quoting instructions), it triggers false positive.
|
||||
|
||||
### 9. Missing Dependency Validation
|
||||
|
||||
Before implementing a story, no check that:
|
||||
- Required npm packages are installed
|
||||
- Required services are running
|
||||
- Prerequisite stories are complete
|
||||
|
||||
---
|
||||
|
||||
## Recommended Improvements
|
||||
|
||||
### 1. Static Analysis Gate (HIGH IMPACT, LOW EFFORT) ✅ IMPLEMENTED
|
||||
|
||||
Add a real tooling validation step that runs actual commands between dev and review phases.
|
||||
|
||||
```bash
|
||||
execute_static_analysis_gate() {
|
||||
local story_id="$1"
|
||||
local failures=0
|
||||
|
||||
log ">>> STATIC ANALYSIS GATE: $story_id"
|
||||
|
||||
# Detect project type and run appropriate checks
|
||||
if [ -f "$PROJECT_ROOT/package.json" ]; then
|
||||
# TypeScript/JavaScript project
|
||||
|
||||
# 1. Type checking (catches type errors AI might miss)
|
||||
if grep -q '"typecheck"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
|
||||
log "Running type check..."
|
||||
if ! npm run typecheck 2>&1 | tee -a "$LOG_FILE"; then
|
||||
log_error "Type check failed"
|
||||
((failures++))
|
||||
fi
|
||||
fi
|
||||
|
||||
# 2. Linting (catches code style/quality issues)
|
||||
if grep -q '"lint"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
|
||||
log "Running lint..."
|
||||
if ! npm run lint 2>&1 | tee -a "$LOG_FILE"; then
|
||||
log_error "Lint failed"
|
||||
((failures++))
|
||||
fi
|
||||
fi
|
||||
|
||||
# 3. Build (catches compilation errors)
|
||||
if grep -q '"build"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
|
||||
log "Running build..."
|
||||
if ! npm run build 2>&1 | tee -a "$LOG_FILE"; then
|
||||
log_error "Build failed"
|
||||
((failures++))
|
||||
fi
|
||||
fi
|
||||
|
||||
# 4. Tests (catches actual test failures)
|
||||
log "Running tests..."
|
||||
TEST_OUTPUT=$(npm test 2>&1) || true
|
||||
echo "$TEST_OUTPUT" >> "$LOG_FILE"
|
||||
|
||||
if echo "$TEST_OUTPUT" | grep -qE "failed|FAIL|Error"; then
|
||||
log_error "Tests failed"
|
||||
# Capture actual failures for fix phase
|
||||
ACTUAL_TEST_FAILURES=$(echo "$TEST_OUTPUT" | grep -A 10 "FAIL\|Error" || true)
|
||||
((failures++))
|
||||
fi
|
||||
fi
|
||||
|
||||
if [ $failures -gt 0 ]; then
|
||||
log_error "Static analysis gate failed with $failures issue(s)"
|
||||
return 1
|
||||
fi
|
||||
|
||||
log_success "Static analysis gate passed"
|
||||
return 0
|
||||
}
|
||||
```
|
||||
|
||||
**Insertion Point:** After `execute_dev_phase()`, before arch compliance.
|
||||
|
||||
---
|
||||
|
||||
### 2. Regression Test Gate (HIGH IMPACT, MEDIUM EFFORT) ✅ IMPLEMENTED
|
||||
|
||||
Track test baseline and verify no regressions after each story.
|
||||
|
||||
```bash
|
||||
# Initialize at epic start
|
||||
init_regression_baseline() {
|
||||
if [ -f "$PROJECT_ROOT/package.json" ]; then
|
||||
# Capture baseline test count
|
||||
BASELINE_TEST_OUTPUT=$(npm test -- --json 2>/dev/null || npm test 2>&1)
|
||||
BASELINE_PASSING_TESTS=$(echo "$BASELINE_TEST_OUTPUT" | grep -oE '[0-9]+ passing' | grep -oE '[0-9]+' | head -1 || echo "0")
|
||||
|
||||
# Capture baseline coverage if available
|
||||
if grep -q '"coverage"' "$PROJECT_ROOT/package.json" 2>/dev/null; then
|
||||
BASELINE_COVERAGE=$(npm run coverage -- --json 2>/dev/null | jq '.total.lines.pct' 2>/dev/null || echo "0")
|
||||
fi
|
||||
|
||||
log "Regression baseline: $BASELINE_PASSING_TESTS passing tests, ${BASELINE_COVERAGE}% coverage"
|
||||
fi
|
||||
}
|
||||
|
||||
execute_regression_gate() {
|
||||
local story_id="$1"
|
||||
|
||||
log ">>> REGRESSION GATE: $story_id"
|
||||
|
||||
# Get current test count
|
||||
CURRENT_TEST_OUTPUT=$(npm test -- --json 2>/dev/null || npm test 2>&1)
|
||||
CURRENT_PASSING_TESTS=$(echo "$CURRENT_TEST_OUTPUT" | grep -oE '[0-9]+ passing' | grep -oE '[0-9]+' | head -1 || echo "0")
|
||||
|
||||
# Check for regression
|
||||
if [ "$CURRENT_PASSING_TESTS" -lt "$BASELINE_PASSING_TESTS" ]; then
|
||||
log_error "REGRESSION DETECTED: Test count decreased ($BASELINE_PASSING_TESTS -> $CURRENT_PASSING_TESTS)"
|
||||
add_metrics_issue "$story_id" "regression" "Test count decreased from $BASELINE_PASSING_TESTS to $CURRENT_PASSING_TESTS"
|
||||
return 1
|
||||
fi
|
||||
|
||||
# Update baseline for next story
|
||||
BASELINE_PASSING_TESTS=$CURRENT_PASSING_TESTS
|
||||
log_success "Regression gate passed: $CURRENT_PASSING_TESTS tests passing (was $BASELINE_PASSING_TESTS)"
|
||||
return 0
|
||||
}
|
||||
```
|
||||
|
||||
**Insertion Point:** After review passes, before marking story done.
|
||||
|
||||
---
|
||||
|
||||
### 3. Pre-Implementation Design Phase (HIGH IMPACT, MEDIUM EFFORT) ✅ IMPLEMENTED
|
||||
|
||||
Add a design phase before implementation to catch architectural issues early.
|
||||
|
||||
```bash
|
||||
execute_design_phase() {
|
||||
local story_file="$1"
|
||||
local story_id=$(basename "$story_file" .md)
|
||||
|
||||
log ">>> DESIGN PHASE: $story_id"
|
||||
|
||||
local story_contents=$(cat "$story_file")
|
||||
local arch_contents=""
|
||||
if [ -f "$PROJECT_ROOT/docs/architecture.md" ]; then
|
||||
arch_contents=$(cat "$PROJECT_ROOT/docs/architecture.md")
|
||||
fi
|
||||
|
||||
local design_prompt="You are a senior developer planning the implementation of a story.
|
||||
|
||||
## Your Task
|
||||
|
||||
Create an implementation plan for: $story_id
|
||||
|
||||
Do NOT write any code yet. Output only your design plan.
|
||||
|
||||
## Story
|
||||
|
||||
<story>
|
||||
$story_contents
|
||||
</story>
|
||||
|
||||
## Architecture Reference
|
||||
|
||||
<architecture>
|
||||
$arch_contents
|
||||
</architecture>
|
||||
|
||||
## Required Output
|
||||
|
||||
Output your implementation plan in this exact format:
|
||||
|
||||
\`\`\`
|
||||
DESIGN START
|
||||
files_to_modify:
|
||||
- path: <file path>
|
||||
action: create|modify
|
||||
purpose: <why this file>
|
||||
|
||||
patterns_to_use:
|
||||
- <pattern name>: <how it will be applied>
|
||||
|
||||
dependencies:
|
||||
- <package>: <installed|needs-install>
|
||||
|
||||
acceptance_criteria_mapping:
|
||||
- AC1: <which files/functions will implement this>
|
||||
- AC2: <which files/functions will implement this>
|
||||
|
||||
risks:
|
||||
- <potential issue and mitigation>
|
||||
|
||||
estimated_test_files:
|
||||
- <test file path>: <what it will test>
|
||||
DESIGN END
|
||||
\`\`\`
|
||||
|
||||
Be specific. This plan will be validated against architecture before implementation begins."
|
||||
|
||||
local result
|
||||
result=\$(claude --dangerously-skip-permissions -p "\$design_prompt" 2>&1) || true
|
||||
|
||||
echo "\$result" >> "\$LOG_FILE"
|
||||
|
||||
# Extract and save design for later phases
|
||||
LAST_DESIGN=\$(echo "\$result" | sed -n '/DESIGN START/,/DESIGN END/p')
|
||||
|
||||
if [ -n "\$LAST_DESIGN" ]; then
|
||||
# Save to decision log for context in later phases
|
||||
echo "## Design: \$story_id - \$(date)" >> "\$DECISION_LOG"
|
||||
echo "\$LAST_DESIGN" >> "\$DECISION_LOG"
|
||||
log_success "Design phase complete: \$story_id"
|
||||
return 0
|
||||
else
|
||||
log_error "Design phase did not produce valid output"
|
||||
return 1
|
||||
fi
|
||||
}
|
||||
```
|
||||
|
||||
**Insertion Point:** Before `execute_dev_phase()`. Pass the design to the dev prompt.
|
||||
|
||||
---
|
||||
|
||||
### 4. Real Test Output in Fix Loops (MEDIUM IMPACT, LOW EFFORT) ✅ IMPLEMENTED
|
||||
|
||||
Capture and pass real test output to fix phases instead of just AI-extracted findings.
|
||||
|
||||
```bash
|
||||
# In execute_review_phase or after static analysis gate
|
||||
capture_real_failures() {
|
||||
# Run tests and capture actual output
|
||||
TEST_OUTPUT=$(npm test 2>&1) || true
|
||||
|
||||
# Extract actual failures with context
|
||||
ACTUAL_FAILURES=$(echo "$TEST_OUTPUT" | grep -B 2 -A 10 "FAIL\|Error\|AssertionError\|Expected\|Received" || true)
|
||||
|
||||
# Extract lint errors if any
|
||||
LINT_OUTPUT=$(npm run lint 2>&1) || true
|
||||
LINT_ERRORS=$(echo "$LINT_OUTPUT" | grep -E "error|warning" || true)
|
||||
|
||||
# Extract type errors if any
|
||||
TYPE_OUTPUT=$(npm run typecheck 2>&1) || true
|
||||
TYPE_ERRORS=$(echo "$TYPE_OUTPUT" | grep -E "error TS|Error:" || true)
|
||||
}
|
||||
|
||||
# Then in execute_fix_phase, add to the prompt:
|
||||
fix_prompt+="
|
||||
## Actual Tooling Output
|
||||
|
||||
### Test Failures (from npm test)
|
||||
\`\`\`
|
||||
$ACTUAL_FAILURES
|
||||
\`\`\`
|
||||
|
||||
### Lint Errors (from npm run lint)
|
||||
\`\`\`
|
||||
$LINT_ERRORS
|
||||
\`\`\`
|
||||
|
||||
### Type Errors (from npm run typecheck)
|
||||
\`\`\`
|
||||
$TYPE_ERRORS
|
||||
\`\`\`
|
||||
|
||||
Use this ACTUAL output to guide your fixes, not just the review findings.
|
||||
"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 5. Cumulative Decision Log (MEDIUM IMPACT, LOW EFFORT) ✅ IMPLEMENTED
|
||||
|
||||
Maintain a persistent decision log to preserve context across phases.
|
||||
|
||||
```bash
|
||||
DECISION_LOG="$SPRINT_ARTIFACTS_DIR/epic-${EPIC_ID}-decisions.md"
|
||||
|
||||
init_decision_log() {
|
||||
mkdir -p "$(dirname "$DECISION_LOG")"
|
||||
cat > "$DECISION_LOG" << EOF
|
||||
# Epic $EPIC_ID Decision Log
|
||||
|
||||
This file tracks implementation decisions for context continuity.
|
||||
|
||||
---
|
||||
EOF
|
||||
}
|
||||
|
||||
append_to_decision_log() {
|
||||
local phase="$1"
|
||||
local story_id="$2"
|
||||
local content="$3"
|
||||
|
||||
cat >> "$DECISION_LOG" << EOF
|
||||
|
||||
## $phase: $story_id
|
||||
**Timestamp:** $(date '+%Y-%m-%d %H:%M:%S')
|
||||
|
||||
$content
|
||||
|
||||
---
|
||||
EOF
|
||||
}
|
||||
|
||||
# Then pass to subsequent phases:
|
||||
"## Previous Implementation Context
|
||||
|
||||
The following decisions have been made in this epic:
|
||||
|
||||
$(cat "$DECISION_LOG")
|
||||
|
||||
Respect these decisions unless you have a specific reason to deviate.
|
||||
"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 6. Test-First Enforcement (HIGH IMPACT, HIGH EFFORT)
|
||||
|
||||
Restructure the flow to enforce TDD principles.
|
||||
|
||||
**Proposed New Flow:**
|
||||
|
||||
```
|
||||
1. DESIGN PHASE (Dev) → Plan implementation approach
|
||||
2. TEST SPEC PHASE (TEA) → Write test specifications based on ACs
|
||||
3. TEST IMPL PHASE (TEA) → Implement failing tests
|
||||
4. VERIFICATION → Confirm tests fail appropriately
|
||||
5. DEV PHASE (Dev) → Implement to make tests pass
|
||||
6. STATIC ANALYSIS GATE → Real tooling verification
|
||||
7. REVIEW PHASE → Adversarial review
|
||||
8. REGRESSION GATE → Ensure no regressions
|
||||
```
|
||||
|
||||
This ensures tests actually test requirements rather than implementation details.
|
||||
|
||||
---
|
||||
|
||||
### 7. Structured Output Validation (MEDIUM IMPACT, MEDIUM EFFORT)
|
||||
|
||||
Replace fragile regex parsing with structured JSON output.
|
||||
|
||||
```bash
|
||||
# Add to prompts:
|
||||
"Output your result as JSON:
|
||||
\`\`\`json
|
||||
{
|
||||
\"status\": \"COMPLETE\" | \"BLOCKED\" | \"FAILED\",
|
||||
\"story_id\": \"...\",
|
||||
\"summary\": \"...\",
|
||||
\"files_changed\": [...],
|
||||
\"tests_added\": N,
|
||||
\"decisions\": [{\"what\": \"...\", \"why\": \"...\"}],
|
||||
\"concerns\": [...]
|
||||
}
|
||||
\`\`\`"
|
||||
|
||||
# Parse with jq:
|
||||
result_json=$(echo "$result" | sed -n '/```json/,/```/p' | sed '1d;$d')
|
||||
status=$(echo "$result_json" | jq -r '.status')
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Implementation Priority Matrix
|
||||
|
||||
| Priority | Improvement | Impact | Effort | Status | Rationale |
|
||||
|----------|-------------|--------|--------|--------|-----------|
|
||||
| 1 | Static Analysis Gate | HIGH | LOW | ✅ DONE | Catches real errors AI misses |
|
||||
| 2 | Real Test Output in Fix | MEDIUM | LOW | ✅ DONE | Quick win, better fixes |
|
||||
| 3 | Decision Log | MEDIUM | LOW | ✅ DONE | Easy context preservation |
|
||||
| 4 | Regression Gate | HIGH | MEDIUM | ✅ DONE | Prevents silent breakage |
|
||||
| 5 | Design Phase | HIGH | MEDIUM | ✅ DONE | Catches issues early |
|
||||
| 6 | Structured JSON Output | MEDIUM | MEDIUM | | Improves reliability |
|
||||
| 7 | Test-First Flow | HIGH | HIGH | | Fundamental quality improvement |
|
||||
|
||||
---
|
||||
|
||||
## Proposed Enhanced Flow
|
||||
|
||||
```
|
||||
Epic Execution Pipeline v2:
|
||||
|
||||
SETUP
|
||||
├── Validate workflows
|
||||
├── Initialize metrics
|
||||
├── Initialize decision log
|
||||
└── Initialize regression baseline
|
||||
|
||||
FOR EACH STORY:
|
||||
├── 1. DESIGN PHASE (NEW)
|
||||
│ ├── Generate implementation plan
|
||||
│ ├── Validate against architecture
|
||||
│ └── Save to decision log
|
||||
│
|
||||
├── 2. DEV PHASE
|
||||
│ ├── Pass design context
|
||||
│ ├── Implement story
|
||||
│ └── Stage changes
|
||||
│
|
||||
├── 3. STATIC ANALYSIS GATE (NEW)
|
||||
│ ├── Run npm run typecheck
|
||||
│ ├── Run npm run lint
|
||||
│ ├── Run npm run build
|
||||
│ ├── Run npm test
|
||||
│ └── Capture real failures
|
||||
│
|
||||
├── 4. ARCH COMPLIANCE
|
||||
│ └── (existing logic)
|
||||
│
|
||||
├── 5. CODE REVIEW + FIX LOOP
|
||||
│ ├── Pass real test output to fix phase
|
||||
│ └── (existing logic)
|
||||
│
|
||||
├── 6. TEST QUALITY
|
||||
│ └── (existing logic)
|
||||
│
|
||||
├── 7. REGRESSION GATE (NEW)
|
||||
│ ├── Compare test count to baseline
|
||||
│ ├── Compare coverage to baseline
|
||||
│ └── Fail if regression detected
|
||||
│
|
||||
└── 8. COMMIT
|
||||
└── (existing logic)
|
||||
|
||||
POST-STORIES:
|
||||
├── Traceability check
|
||||
├── UAT generation
|
||||
└── Finalize metrics
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Quick Wins (Implement First) ✅ ALL COMPLETE
|
||||
|
||||
1. **Static Analysis Gate** - Single most impactful change ✅
|
||||
2. **Real Test Output in Fix Loops** - Minimal code change, better fixes ✅
|
||||
3. **Decision Log** - Simple file append, preserves context ✅
|
||||
|
||||
These three changes alone would dramatically improve code reliability with minimal refactoring effort.
|
||||
|
||||
**Additionally implemented:**
|
||||
4. **Regression Gate** - Prevents silent breakage ✅
|
||||
5. **Design Phase** - Catches architectural issues early ✅
|
||||
|
||||
**Implementation:** All features are modularized in `scripts/epic-execute-lib/` with graceful degradation and skip flags (`--skip-design`, `--skip-regression`).
|
||||
|
||||
---
|
||||
|
||||
## Metrics to Track
|
||||
|
||||
After implementing these improvements, track:
|
||||
|
||||
- **False Positive Rate**: Stories marked "complete" that have real errors
|
||||
- **Fix Loop Efficiency**: Average fix attempts per story (should decrease)
|
||||
- **Regression Rate**: Stories that break previous functionality
|
||||
- **Time to First Working Implementation**: Should decrease with design phase
|
||||
|
||||
---
|
||||
|
||||
## Conclusion
|
||||
|
||||
The current epic-execute script has a solid foundation with multi-phase validation and self-healing fix loops. However, it fundamentally relies on AI self-reporting for validation. Adding real tooling verification (Static Analysis Gate) would catch the majority of "AI said it works but it doesn't" issues.
|
||||
|
||||
The recommended implementation order prioritizes high-impact, low-effort changes first, building toward a comprehensive TDD-based flow that would fundamentally improve code quality.
|
||||
Loading…
Reference in New Issue