BMAD-METHOD/src/bmm-skills/4-implementation/bmad-story-automator-go/data/monitoring-fallback.md

2.6 KiB

Monitoring Failure Fallback (v1.9.0)

Purpose: Recovery patterns when primary monitoring fails.


When Primary Monitoring Fails

Primary monitoring can fail in several ways:

  • Background task crashes (TaskOutput returns empty/error)
  • Network timeout during monitoring
  • Process killed unexpectedly
  • Output file missing or corrupted

Key insight: The tmux session may have completed successfully even if monitoring died.


Fallback Sequence

When story-automator monitor-session fails or background monitoring task dies:

# STEP 1: Check if tmux session still exists
sessions=$(tmux list-sessions -F '#{session_name}' 2>/dev/null | grep "sa-.*{story_pattern}" || true)

# STEP 2: If session exists, check its status directly
if [ -n "$sessions" ]; then
    while IFS= read -r session; do
        status=$("$scripts" tmux-status-check "$session")
        session_state=$(echo "$status" | cut -d',' -f6)
        # Act based on direct status
    done <<< "$sessions"
fi

# STEP 3: ALWAYS verify source of truth regardless of session status
# Story file check:
story_file=$(ls _bmad-output/implementation-artifacts/{story_prefix}-*.md 2>/dev/null | head -1)
if [ -f "$story_file" ]; then
    # Story file exists - check its status field
fi

# Sprint-status check:
status=$("$scripts" orchestrator-helper sprint-status get "{story_key}")
is_done=$(echo "$status" | jq -r '.done')

Detection: Monitoring Task Crashed

Signs that your monitoring task has crashed:

Signal Meaning
TaskOutput returns empty 2+ times Task may be dead
Output file path doesn't exist Task never wrote results
"running" status but no progress Task is stuck or dead

Recovery:

  1. Do NOT wait indefinitely for dead monitoring task
  2. After 2+ empty TaskOutput results, switch to direct verification
  3. Use tmux session checks + source of truth verification
  4. Resume workflow based on verified state, not monitoring state

Integration with Retry Logic

If fallback verification shows step succeeded:

  • Proceed to next step (monitoring failed but workflow succeeded)
  • Log: "Monitoring failed but direct verification confirmed success"

If fallback verification shows step failed/incomplete:

  • Apply normal retry/fallback strategy
  • Do NOT treat monitoring failure as step failure

Key Principle

The tmux session is the source of truth for session state. The story file and sprint-status.yaml are the source of truth for workflow state.

Monitoring is just observation - if monitoring fails, verify from source of truth and continue.