diff --git a/src/core/tasks/workflow.xml b/src/core/tasks/workflow.xml index c65370a9..46ab8b87 100644 --- a/src/core/tasks/workflow.xml +++ b/src/core/tasks/workflow.xml @@ -17,60 +17,48 @@ - 🚨 WORKFLOW EXECUTION ENFORCEMENT 🚨 + 🚨 WORKFLOW EXECUTION CONTEXT ENFORCEMENT 🚨 - - - IF this workflow was invoked by USER via slash command (e.g., /bmad_bmm_batch-super-dev): - ✅ ALLOWED in main context - Reason: User-initiated workflows need interactive prompts - Continue execution below + + CHECK if this conversation has a tag (e.g., /bmad_bmm_batch-super-dev) - ELSE IF this workflow was invoked by ANOTHER WORKFLOW (internal call): - ❌ FORBIDDEN in main context - MUST delegate to workflow-executor Task agent + IF tag exists: + ✅ This is a USER-INITIATED slash command + ✅ EXEMPTED from Task agent delegation + ✅ Continue execution in main context - Display to user: - ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ - 🔄 Delegating workflow to fresh context agent... - ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + Reason: Interactive workflows need user prompts and responses. + Task agents cannot interact with users. - Workflow: {workflow_name} - Reason: Internal workflow calls use Task agents (fresh context) + Proceed to Step 1 below. + - Spawning workflow-executor agent... - ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ + + IF NO tag found: + ❌ This is an INTERNAL workflow call (invoked by another workflow) + ❌ FORBIDDEN in main context + ✅ MUST delegate to workflow-executor Task agent - Then spawn: - - subagent_type: general-purpose - description: Execute {workflow_name} - prompt: | - You are the workflow-executor agent. + Why: Internal calls should use fresh context to prevent degradation. - Execute workflow: {workflow_path} + Action: + 1. Display brief message to user: + "🔄 Delegating {workflow_name} to fresh context agent..." - LOAD ALL CONTEXT FIRST: - 1. Read: {workflow_path}/workflow.yaml - 2. Read: ALL files in {workflow_path}/steps/ - 3. Read: _bmad/core/tasks/workflow.xml - 4. Read: Story file (if applicable) - 5. Read: Project context + 2. Spawn Task agent: + + subagent_type: general-purpose + description: Execute {workflow_name} + prompt: | + Execute workflow: {workflow_path} - THEN execute following workflow.xml rules EXACTLY. + Load all required files first, then execute following workflow.xml. + - Report when complete. - + 3. HALT - Task agent handles workflow - HALT - Let Task agent handle workflow. - - - - How to detect invocation source: - - Slash command: User message contains "/bmad_" or conversation has tag - - Internal call: No slash command, invoked from within another workflow step - - + DO NOT proceed to Step 1. + diff --git a/src/modules/bmm/workflows/4-implementation/batch-super-dev/RESILIENCE-FIX.md b/src/modules/bmm/workflows/4-implementation/batch-super-dev/RESILIENCE-FIX.md new file mode 100644 index 00000000..3fe3b5df --- /dev/null +++ b/src/modules/bmm/workflows/4-implementation/batch-super-dev/RESILIENCE-FIX.md @@ -0,0 +1,256 @@ +# Batch-Super-Dev Resilience Fix + +**Problem:** Agents crash mid-execution, resume fails, no intermediate state saved + +--- + +## Issues Observed + +**Story 18-4 → 18-5 Transition:** +``` +✅ Story 18-4: Builder → Inspector → Fixer → Reviewer all complete +❌ Story 18-5: Workflow crashed on "Error reading file" +``` + +**Evidence:** +- Task output files empty (0 bytes) +- Resume attempts failed (0 tools used, 0 tokens) +- No state saved between stories +- When agent crashes, all progress lost + +--- + +## Root Cause + +**Sequential processing in main context has no resilience:** + +``` +Story 18-4: + ├─ Builder agent completes → outputs to temp file + ├─ Main reads output file → starts Inspector + ├─ Inspector completes → outputs to temp file + ├─ Main reads output → starts Fixer + └─ Fixer completes → Story 18-4 done + +Story 18-5: + ├─ Main tries to read Story 18-5 file + ├─ ❌ "Error reading file" (crash) + └─ All progress lost, no state saved +``` + +**Problem:** Main context doesn't save state between stories. If it crashes, batch starts over. + +--- + +## Solution: Save State After Each Story + +### Add state file tracking: + +```yaml +# In batch-super-dev/workflow.yaml +state_tracking: + enabled: true + state_file: "{sprint_artifacts}/batch-execution-state-{batch_id}.yaml" + save_after_each_story: true +``` + +### State file format: + +```yaml +batch_id: "epic-18-2026-01-26" +started: "2026-01-26T18:45:00Z" +execution_mode: "fully_autonomous" +strategy: "sequential" +total_stories: 2 + +stories: + - story_key: "18-4-billing-worker-retry-logic" + status: "completed" + started: "2026-01-26T18:46:00Z" + completed: "2026-01-26T19:05:00Z" + agents: + - phase: "builder" + agent_id: "ae3bd2b" + status: "completed" + - phase: "inspector" + agent_id: "a9f0d11" + status: "completed" + - phase: "fixer" + agent_id: "abc123" + status: "completed" + - phase: "reviewer" + agent_id: "def456" + status: "completed" + + - story_key: "18-5-precharge-payment-validation" + status: "in_progress" + started: "2026-01-26T19:05:30Z" + last_checkpoint: "attempting_to_read_story_file" + error: "Error reading file" +``` + +### Resume logic: + +```bash +# At batch-super-dev start, check for existing state file +state_file="{sprint_artifacts}/batch-execution-state-*.yaml" + +if ls $state_file 2>/dev/null; then + echo "🔄 Found interrupted batch execution" + echo "Resume from where it left off? (yes/no)" + + if yes: + # Load state file + # Skip completed stories + # Start from next story + # Reuse agent IDs if resumable +fi +``` + +### After each story completes: + +```bash +# Update state file +update_state_file() { + story_key="$1" + status="$2" # completed | failed + + # Update YAML + # Mark story as completed + # Save timestamp + # Record agent IDs +} + +# After Builder completes +update_state_file "$story_key" "builder_complete" + +# After Inspector completes +update_state_file "$story_key" "inspector_complete" + +# After Fixer completes +update_state_file "$story_key" "fixer_complete" + +# After Reviewer completes +update_state_file "$story_key" "reviewer_complete" + +# When entire story done +update_state_file "$story_key" "completed" +``` + +### Error handling: + +```bash +# Wrap file reads in try-catch +read_with_retry() { + file_path="$1" + max_attempts=3 + + for attempt in {1..$max_attempts}; do + if content=$(cat "$file_path" 2>&1); then + echo "$content" + return 0 + else + echo "⚠️ Failed to read $file_path (attempt $attempt/$max_attempts)" >&2 + sleep 2 + fi + done + + echo "❌ Cannot read file after $max_attempts attempts: $file_path" >&2 + return 1 +} + +# Use in workflow +story_content=$(read_with_retry "$story_file") || { + echo "❌ Cannot proceed with Story $story_key - file read failed" + # Save state + # Skip this story + # Continue to next story (if continue_on_failure=true) +} +``` + +--- + +## Implementation + +Add to batch-super-dev Step 4-Sequential: + +```xml + + Check for state file: batch-execution-state-*.yaml + + + 🔄 Found interrupted batch from {state.started} + Completed: {state.completed_count} stories + Failed: {state.failed_count} stories + In progress: {state.current_story} + + Resume from where it left off? (yes/no) + + + Load state + Skip completed stories + Start from next story + + + + Archive old state file + Start fresh batch + + + + + + Save state: story started + + + Read story file with retry + Execute super-dev-pipeline + Save state: story completed + + + + ⚠️ Cannot read story file for {story_key} + Save state: story failed (file read error) + Add to failed_stories list + Continue to next story if continue_on_failure=true + + + + ⚠️ Agent crashed for {story_key} + Save state: story failed (agent crash) + Record partial progress in state file + Continue to next story if continue_on_failure=true + + +``` + +--- + +## Expected Behavior After Fix + +**If crash happens:** + +``` +Story 18-4: ✅ Complete (state saved) +Story 18-5: ❌ Crashed (state saved with error) + +State file created: batch-execution-state-epic-18.yaml + +User re-runs: /batch-super-dev + +Workflow: "🔄 Found interrupted batch. Resume? (yes/no)" +User: "yes" +Workflow: "✅ Skipping 18-4 (already complete)" +Workflow: "🔄 Retrying 18-5 (was in_progress)" +Workflow: Starts 18-5 from beginning +``` + +**Benefits:** +- No lost progress +- Can resume after crashes +- Intermediate state preserved +- Failures don't block batch + +--- + +Should I implement this resilience fix now?