fix: explicit slash command detection + document resilience improvements

**Slash Command Detection Fix:** - Check for <command-name> tag explicitly (not pseudocode) - If tag exists: Run in main context (interactive) - If no tag: Delegate to Task agent (internal call) - Reduces task nesting from 3 levels to 1 level **Resilience Documentation:** - Added RESILIENCE-FIX.md documenting state tracking - Proposes state file for resume capability - File read retry logic - Error handling improvements **Expected improvement:** - Slash commands: No Task wrapper (1 level nesting) - Fewer file access issues - Can resume after crashes To be implemented in future release.
2026-01-26 18:59:48 -05:00 · 2026-01-26 18:59:48 -05:00 · b4217ba65f
parent df22c71879
commit b4217ba65f
2 changed files with 287 additions and 43 deletions
--- a/src/core/tasks/workflow.xml
+++ b/src/core/tasks/workflow.xml
@ -17,60 +17,48 @@
  </WORKFLOW-RULES>
  <llm critical="ABSOLUTE">
-    <mandate>🚨 WORKFLOW EXECUTION ENFORCEMENT 🚨</mandate>
+    <mandate>🚨 WORKFLOW EXECUTION CONTEXT ENFORCEMENT 🚨</mandate>
-    <enforcement>
+    <check name="interactive_workflow_exemption">
-      <check type="invocation_source">
+      CHECK if this conversation has a <command-name> tag (e.g., <command-name>/bmad_bmm_batch-super-dev</command-name>)
        IF this workflow was invoked by USER via slash command (e.g., /bmad_bmm_batch-super-dev):
          ✅ ALLOWED in main context
          Reason: User-initiated workflows need interactive prompts
          Continue execution below
-        ELSE IF this workflow was invoked by ANOTHER WORKFLOW (internal call):
+      IF <command-name> tag exists:
-          ❌ FORBIDDEN in main context
+        ✅ This is a USER-INITIATED slash command
-          MUST delegate to workflow-executor Task agent
+        ✅ EXEMPTED from Task agent delegation
        ✅ Continue execution in main context
-          Display to user:
+        Reason: Interactive workflows need user prompts and responses.
-          ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+        Task agents cannot interact with users.
          🔄 Delegating workflow to fresh context agent...
          ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
-          Workflow: {workflow_name}
+        Proceed to Step 1 below.
-          Reason: Internal workflow calls use Task agents (fresh context)
+    </check>
-          Spawning workflow-executor agent...
+    <check name="internal_workflow_delegation" critical="true">
-          ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
+      IF NO <command-name> tag found:
        ❌ This is an INTERNAL workflow call (invoked by another workflow)
        ❌ FORBIDDEN in main context
        ✅ MUST delegate to workflow-executor Task agent
-          Then spawn:
+        Why: Internal calls should use fresh context to prevent degradation.
          <Task>
            subagent_type: general-purpose
            description: Execute {workflow_name}
            prompt: |
              You are the workflow-executor agent.
-              Execute workflow: {workflow_path}
+        Action:
        1. Display brief message to user:
           "🔄 Delegating {workflow_name} to fresh context agent..."
-              LOAD ALL CONTEXT FIRST:
+        2. Spawn Task agent:
-              1. Read: {workflow_path}/workflow.yaml
+           <Task>
-              2. Read: ALL files in {workflow_path}/steps/
+             subagent_type: general-purpose
-              3. Read: _bmad/core/tasks/workflow.xml
+             description: Execute {workflow_name}
-              4. Read: Story file (if applicable)
+             prompt: |
-              5. Read: Project context
+               Execute workflow: {workflow_path}
-              THEN execute following workflow.xml rules EXACTLY.
+               Load all required files first, then execute following workflow.xml.
           </Task>
-              Report when complete.
+        3. HALT - Task agent handles workflow
          </Task>
-          HALT - Let Task agent handle workflow.
+        DO NOT proceed to Step 1.
-      </check>
+    </check>
      <detection>
        How to detect invocation source:
        - Slash command: User message contains "/bmad_" or conversation has <command-name> tag
        - Internal call: No slash command, invoked from within another workflow step
      </detection>
    </enforcement>
  </llm>
  <flow>
--- a/src/modules/bmm/workflows/4-implementation/batch-super-dev/RESILIENCE-FIX.md
+++ b/src/modules/bmm/workflows/4-implementation/batch-super-dev/RESILIENCE-FIX.md
@ -0,0 +1,256 @@
 # Batch-Super-Dev Resilience Fix
 **Problem:** Agents crash mid-execution, resume fails, no intermediate state saved
 ---
 ## Issues Observed
 **Story 18-4 → 18-5 Transition:**
 ```
 ✅ Story 18-4: Builder → Inspector → Fixer → Reviewer all complete
 ❌ Story 18-5: Workflow crashed on "Error reading file"
 ```
 **Evidence:**
 - Task output files empty (0 bytes)
 - Resume attempts failed (0 tools used, 0 tokens)
 - No state saved between stories
 - When agent crashes, all progress lost
 ---
 ## Root Cause
 **Sequential processing in main context has no resilience:**
 ```
 Story 18-4:
  ├─ Builder agent completes → outputs to temp file
  ├─ Main reads output file → starts Inspector
  ├─ Inspector completes → outputs to temp file
  ├─ Main reads output → starts Fixer
  └─ Fixer completes → Story 18-4 done
 Story 18-5:
  ├─ Main tries to read Story 18-5 file
  ├─ ❌ "Error reading file" (crash)
  └─ All progress lost, no state saved
 ```
 **Problem:** Main context doesn't save state between stories. If it crashes, batch starts over.
 ---
 ## Solution: Save State After Each Story
 ### Add state file tracking:
 ```yaml
 # In batch-super-dev/workflow.yaml
 state_tracking:
  enabled: true
  state_file: "{sprint_artifacts}/batch-execution-state-{batch_id}.yaml"
  save_after_each_story: true
 ```
 ### State file format:
 ```yaml
 batch_id: "epic-18-2026-01-26"
 started: "2026-01-26T18:45:00Z"
 execution_mode: "fully_autonomous"
 strategy: "sequential"
 total_stories: 2
 stories:
  - story_key: "18-4-billing-worker-retry-logic"
    status: "completed"
    started: "2026-01-26T18:46:00Z"
    completed: "2026-01-26T19:05:00Z"
    agents:
      - phase: "builder"
        agent_id: "ae3bd2b"
        status: "completed"
      - phase: "inspector"
        agent_id: "a9f0d11"
        status: "completed"
      - phase: "fixer"
        agent_id: "abc123"
        status: "completed"
      - phase: "reviewer"
        agent_id: "def456"
        status: "completed"
  - story_key: "18-5-precharge-payment-validation"
    status: "in_progress"
    started: "2026-01-26T19:05:30Z"
    last_checkpoint: "attempting_to_read_story_file"
    error: "Error reading file"
 ```
 ### Resume logic:
 ```bash
 # At batch-super-dev start, check for existing state file
 state_file="{sprint_artifacts}/batch-execution-state-*.yaml"
 if ls $state_file 2>/dev/null; then
  echo "🔄 Found interrupted batch execution"
  echo "Resume from where it left off? (yes/no)"
  if yes:
    # Load state file
    # Skip completed stories
    # Start from next story
    # Reuse agent IDs if resumable
 fi
 ```
 ### After each story completes:
 ```bash
 # Update state file
 update_state_file() {
  story_key="$1"
  status="$2"  # completed | failed
  # Update YAML
  # Mark story as completed
  # Save timestamp
  # Record agent IDs
 }
 # After Builder completes
 update_state_file "$story_key" "builder_complete"
 # After Inspector completes
 update_state_file "$story_key" "inspector_complete"
 # After Fixer completes
 update_state_file "$story_key" "fixer_complete"
 # After Reviewer completes
 update_state_file "$story_key" "reviewer_complete"
 # When entire story done
 update_state_file "$story_key" "completed"
 ```
 ### Error handling:
 ```bash
 # Wrap file reads in try-catch
 read_with_retry() {
  file_path="$1"
  max_attempts=3
  for attempt in {1..$max_attempts}; do
    if content=$(cat "$file_path" 2>&1); then
      echo "$content"
      return 0
    else
      echo "⚠️ Failed to read $file_path (attempt $attempt/$max_attempts)" >&2
      sleep 2
    fi
  done
  echo "❌ Cannot read file after $max_attempts attempts: $file_path" >&2
  return 1
 }
 # Use in workflow
 story_content=$(read_with_retry "$story_file") || {
  echo "❌ Cannot proceed with Story $story_key - file read failed"
  # Save state
  # Skip this story
  # Continue to next story (if continue_on_failure=true)
 }
 ```
 ---
 ## Implementation
 Add to batch-super-dev Step 4-Sequential:
 ```xml
 <substep n="4s-0" title="Check for previous execution state">
  <action>Check for state file: batch-execution-state-*.yaml</action>
  <check if="state file exists">
    <output>🔄 Found interrupted batch from {state.started}</output>
    <output>Completed: {state.completed_count} stories</output>
    <output>Failed: {state.failed_count} stories</output>
    <output>In progress: {state.current_story}</output>
    <ask>Resume from where it left off? (yes/no)</ask>
    <check if="response == yes">
      <action>Load state</action>
      <action>Skip completed stories</action>
      <action>Start from next story</action>
    </check>
    <check if="response == no">
      <action>Archive old state file</action>
      <action>Start fresh batch</action>
    </check>
  </check>
 </substep>
 <substep n="4s-a" title="Process individual story">
  <action>Save state: story started</action>
  <try>
    <action>Read story file with retry</action>
    <action>Execute super-dev-pipeline</action>
    <action>Save state: story completed</action>
  </try>
  <catch error="file_read_error">
    <output>⚠️ Cannot read story file for {story_key}</output>
    <action>Save state: story failed (file read error)</action>
    <action>Add to failed_stories list</action>
    <action>Continue to next story if continue_on_failure=true</action>
  </catch>
  <catch error="agent_crash">
    <output>⚠️ Agent crashed for {story_key}</output>
    <action>Save state: story failed (agent crash)</action>
    <action>Record partial progress in state file</action>
    <action>Continue to next story if continue_on_failure=true</action>
  </catch>
 </substep>
 ```
 ---
 ## Expected Behavior After Fix
 **If crash happens:**
 ```
 Story 18-4: ✅ Complete (state saved)
 Story 18-5: ❌ Crashed (state saved with error)
 State file created: batch-execution-state-epic-18.yaml
 User re-runs: /batch-super-dev
 Workflow: "🔄 Found interrupted batch. Resume? (yes/no)"
 User: "yes"
 Workflow: "✅ Skipping 18-4 (already complete)"
 Workflow: "🔄 Retrying 18-5 (was in_progress)"
 Workflow: Starts 18-5 from beginning
 ```
 **Benefits:**
 - No lost progress
 - Can resume after crashes
 - Intermediate state preserved
 - Failures don't block batch
 ---
 Should I implement this resilience fix now?