fix: explicit slash command detection + document resilience improvements

**Slash Command Detection Fix:** - Check for <command-name> tag explicitly (not pseudocode) - If tag exists: Run in main context (interactive) - If no tag: Delegate to Task agent (internal call) - Reduces task nesting from 3 levels to 1 level **Resilience Documentation:** - Added RESILIENCE-FIX.md documenting state tracking - Proposes state file for resume capability - File read retry logic - Error handling improvements **Expected improvement:** - Slash commands: No Task wrapper (1 level nesting) - Fewer file access issues - Can resume after crashes To be implemented in future release.
2026-01-26 18:59:48 -05:00 · 2026-01-26 18:59:48 -05:00 · b4217ba65f
parent df22c71879
commit b4217ba65f
2 changed files with 287 additions and 43 deletions
--- a/src/core/tasks/workflow.xml
+++ b/src/core/tasks/workflow.xml
@ -17,60 +17,48 @@
  </WORKFLOW-RULES>

  <llm critical="ABSOLUTE">
-    <mandate>🚨 WORKFLOW EXECUTION ENFORCEMENT 🚨</mandate>
+    <mandate>🚨 WORKFLOW EXECUTION CONTEXT ENFORCEMENT 🚨</mandate>

-    <enforcement>
-      <check type="invocation_source">
-        IF this workflow was invoked by USER via slash command (e.g., /bmad_bmm_batch-super-dev):
-          ✅ ALLOWED in main context
-          Reason: User-initiated workflows need interactive prompts
-          Continue execution below
+    <check name="interactive_workflow_exemption">
+      CHECK if this conversation has a <command-name> tag (e.g., <command-name>/bmad_bmm_batch-super-dev</command-name>)

-        ELSE IF this workflow was invoked by ANOTHER WORKFLOW (internal call):
+      IF <command-name> tag exists:
+        ✅ This is a USER-INITIATED slash command
+        ✅ EXEMPTED from Task agent delegation
+        ✅ Continue execution in main context
+
+        Reason: Interactive workflows need user prompts and responses.
+        Task agents cannot interact with users.
+
+        Proceed to Step 1 below.
+    </check>
+
+    <check name="internal_workflow_delegation" critical="true">
+      IF NO <command-name> tag found:
+        ❌ This is an INTERNAL workflow call (invoked by another workflow)
        ❌ FORBIDDEN in main context
-          MUST delegate to workflow-executor Task agent
+        ✅ MUST delegate to workflow-executor Task agent

-          Display to user:
-          ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-          🔄 Delegating workflow to fresh context agent...
-          ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+        Why: Internal calls should use fresh context to prevent degradation.

-          Workflow: {workflow_name}
-          Reason: Internal workflow calls use Task agents (fresh context)
+        Action:
+        1. Display brief message to user:
+           "🔄 Delegating {workflow_name} to fresh context agent..."

-          Spawning workflow-executor agent...
-          ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-
-          Then spawn:
+        2. Spawn Task agent:
           <Task>
             subagent_type: general-purpose
             description: Execute {workflow_name}
             prompt: |
-              You are the workflow-executor agent.
-
               Execute workflow: {workflow_path}

-              LOAD ALL CONTEXT FIRST:
-              1. Read: {workflow_path}/workflow.yaml
-              2. Read: ALL files in {workflow_path}/steps/
-              3. Read: _bmad/core/tasks/workflow.xml
-              4. Read: Story file (if applicable)
-              5. Read: Project context
-
-              THEN execute following workflow.xml rules EXACTLY.
-
-              Report when complete.
+               Load all required files first, then execute following workflow.xml.
           </Task>

-          HALT - Let Task agent handle workflow.
-      </check>
+        3. HALT - Task agent handles workflow

-      <detection>
-        How to detect invocation source:
-        - Slash command: User message contains "/bmad_" or conversation has <command-name> tag
-        - Internal call: No slash command, invoked from within another workflow step
-      </detection>
-    </enforcement>
+        DO NOT proceed to Step 1.
+    </check>
  </llm>

  <flow>
--- a/src/modules/bmm/workflows/4-implementation/batch-super-dev/RESILIENCE-FIX.md
+++ b/src/modules/bmm/workflows/4-implementation/batch-super-dev/RESILIENCE-FIX.md
@ -0,0 +1,256 @@
+# Batch-Super-Dev Resilience Fix
+
+**Problem:** Agents crash mid-execution, resume fails, no intermediate state saved
+
+---
+
+## Issues Observed
+
+**Story 18-4 → 18-5 Transition:**
+```
+✅ Story 18-4: Builder → Inspector → Fixer → Reviewer all complete
+❌ Story 18-5: Workflow crashed on "Error reading file"
+```
+
+**Evidence:**
+- Task output files empty (0 bytes)
+- Resume attempts failed (0 tools used, 0 tokens)
+- No state saved between stories
+- When agent crashes, all progress lost
+
+---
+
+## Root Cause
+
+**Sequential processing in main context has no resilience:**
+
+```
+Story 18-4:
+  ├─ Builder agent completes → outputs to temp file
+  ├─ Main reads output file → starts Inspector
+  ├─ Inspector completes → outputs to temp file
+  ├─ Main reads output → starts Fixer
+  └─ Fixer completes → Story 18-4 done
+
+Story 18-5:
+  ├─ Main tries to read Story 18-5 file
+  ├─ ❌ "Error reading file" (crash)
+  └─ All progress lost, no state saved
+```
+
+**Problem:** Main context doesn't save state between stories. If it crashes, batch starts over.
+
+---
+
+## Solution: Save State After Each Story
+
+### Add state file tracking:
+
+```yaml
+# In batch-super-dev/workflow.yaml
+state_tracking:
+  enabled: true
+  state_file: "{sprint_artifacts}/batch-execution-state-{batch_id}.yaml"
+  save_after_each_story: true
+```
+
+### State file format:
+
+```yaml
+batch_id: "epic-18-2026-01-26"
+started: "2026-01-26T18:45:00Z"
+execution_mode: "fully_autonomous"
+strategy: "sequential"
+total_stories: 2
+
+stories:
+  - story_key: "18-4-billing-worker-retry-logic"
+    status: "completed"
+    started: "2026-01-26T18:46:00Z"
+    completed: "2026-01-26T19:05:00Z"
+    agents:
+      - phase: "builder"
+        agent_id: "ae3bd2b"
+        status: "completed"
+      - phase: "inspector"
+        agent_id: "a9f0d11"
+        status: "completed"
+      - phase: "fixer"
+        agent_id: "abc123"
+        status: "completed"
+      - phase: "reviewer"
+        agent_id: "def456"
+        status: "completed"
+
+  - story_key: "18-5-precharge-payment-validation"
+    status: "in_progress"
+    started: "2026-01-26T19:05:30Z"
+    last_checkpoint: "attempting_to_read_story_file"
+    error: "Error reading file"
+```
+
+### Resume logic:
+
+```bash
+# At batch-super-dev start, check for existing state file
+state_file="{sprint_artifacts}/batch-execution-state-*.yaml"
+
+if ls $state_file 2>/dev/null; then
+  echo "🔄 Found interrupted batch execution"
+  echo "Resume from where it left off? (yes/no)"
+
+  if yes:
+    # Load state file
+    # Skip completed stories
+    # Start from next story
+    # Reuse agent IDs if resumable
+fi
+```
+
+### After each story completes:
+
+```bash
+# Update state file
+update_state_file() {
+  story_key="$1"
+  status="$2"  # completed | failed
+
+  # Update YAML
+  # Mark story as completed
+  # Save timestamp
+  # Record agent IDs
+}
+
+# After Builder completes
+update_state_file "$story_key" "builder_complete"
+
+# After Inspector completes
+update_state_file "$story_key" "inspector_complete"
+
+# After Fixer completes
+update_state_file "$story_key" "fixer_complete"
+
+# After Reviewer completes
+update_state_file "$story_key" "reviewer_complete"
+
+# When entire story done
+update_state_file "$story_key" "completed"
+```
+
+### Error handling:
+
+```bash
+# Wrap file reads in try-catch
+read_with_retry() {
+  file_path="$1"
+  max_attempts=3
+
+  for attempt in {1..$max_attempts}; do
+    if content=$(cat "$file_path" 2>&1); then
+      echo "$content"
+      return 0
+    else
+      echo "⚠️ Failed to read $file_path (attempt $attempt/$max_attempts)" >&2
+      sleep 2
+    fi
+  done
+
+  echo "❌ Cannot read file after $max_attempts attempts: $file_path" >&2
+  return 1
+}
+
+# Use in workflow
+story_content=$(read_with_retry "$story_file") || {
+  echo "❌ Cannot proceed with Story $story_key - file read failed"
+  # Save state
+  # Skip this story
+  # Continue to next story (if continue_on_failure=true)
+}
+```
+
+---
+
+## Implementation
+
+Add to batch-super-dev Step 4-Sequential:
+
+```xml
+<substep n="4s-0" title="Check for previous execution state">
+  <action>Check for state file: batch-execution-state-*.yaml</action>
+
+  <check if="state file exists">
+    <output>🔄 Found interrupted batch from {state.started}</output>
+    <output>Completed: {state.completed_count} stories</output>
+    <output>Failed: {state.failed_count} stories</output>
+    <output>In progress: {state.current_story}</output>
+
+    <ask>Resume from where it left off? (yes/no)</ask>
+
+    <check if="response == yes">
+      <action>Load state</action>
+      <action>Skip completed stories</action>
+      <action>Start from next story</action>
+    </check>
+
+    <check if="response == no">
+      <action>Archive old state file</action>
+      <action>Start fresh batch</action>
+    </check>
+  </check>
+</substep>
+
+<substep n="4s-a" title="Process individual story">
+  <action>Save state: story started</action>
+
+  <try>
+    <action>Read story file with retry</action>
+    <action>Execute super-dev-pipeline</action>
+    <action>Save state: story completed</action>
+  </try>
+
+  <catch error="file_read_error">
+    <output>⚠️ Cannot read story file for {story_key}</output>
+    <action>Save state: story failed (file read error)</action>
+    <action>Add to failed_stories list</action>
+    <action>Continue to next story if continue_on_failure=true</action>
+  </catch>
+
+  <catch error="agent_crash">
+    <output>⚠️ Agent crashed for {story_key}</output>
+    <action>Save state: story failed (agent crash)</action>
+    <action>Record partial progress in state file</action>
+    <action>Continue to next story if continue_on_failure=true</action>
+  </catch>
+</substep>
+```
+
+---
+
+## Expected Behavior After Fix
+
+**If crash happens:**
+
+```
+Story 18-4: ✅ Complete (state saved)
+Story 18-5: ❌ Crashed (state saved with error)
+
+State file created: batch-execution-state-epic-18.yaml
+
+User re-runs: /batch-super-dev
+
+Workflow: "🔄 Found interrupted batch. Resume? (yes/no)"
+User: "yes"
+Workflow: "✅ Skipping 18-4 (already complete)"
+Workflow: "🔄 Retrying 18-5 (was in_progress)"
+Workflow: Starts 18-5 from beginning
+```
+
+**Benefits:**
+- No lost progress
+- Can resume after crashes
+- Intermediate state preserved
+- Failures don't block batch
+
+---
+
+Should I implement this resilience fix now?