2.6 KiB
2.6 KiB
Monitoring Failure Fallback (v1.9.0)
Purpose: Recovery patterns when primary monitoring fails.
When Primary Monitoring Fails
Primary monitoring can fail in several ways:
- Background task crashes (TaskOutput returns empty/error)
- Network timeout during monitoring
- Process killed unexpectedly
- Output file missing or corrupted
Key insight: The tmux session may have completed successfully even if monitoring died.
Fallback Sequence
When story-automator monitor-session fails or background monitoring task dies:
# STEP 1: Check if tmux session still exists
sessions=$(tmux list-sessions -F '#{session_name}' 2>/dev/null | grep "sa-.*{story_pattern}" || true)
# STEP 2: If session exists, check its status directly
if [ -n "$sessions" ]; then
while IFS= read -r session; do
status=$("$scripts" tmux-status-check "$session")
session_state=$(echo "$status" | cut -d',' -f6)
# Act based on direct status
done <<< "$sessions"
fi
# STEP 3: ALWAYS verify source of truth regardless of session status
# Story file check:
story_file=$(ls _bmad-output/implementation-artifacts/{story_prefix}-*.md 2>/dev/null | head -1)
if [ -f "$story_file" ]; then
# Story file exists - check its status field
fi
# Sprint-status check:
status=$("$scripts" orchestrator-helper sprint-status get "{story_key}")
is_done=$(echo "$status" | jq -r '.done')
Detection: Monitoring Task Crashed
Signs that your monitoring task has crashed:
| Signal | Meaning |
|---|---|
TaskOutput returns empty 2+ times |
Task may be dead |
| Output file path doesn't exist | Task never wrote results |
| "running" status but no progress | Task is stuck or dead |
Recovery:
- Do NOT wait indefinitely for dead monitoring task
- After 2+ empty TaskOutput results, switch to direct verification
- Use tmux session checks + source of truth verification
- Resume workflow based on verified state, not monitoring state
Integration with Retry Logic
If fallback verification shows step succeeded:
- Proceed to next step (monitoring failed but workflow succeeded)
- Log: "Monitoring failed but direct verification confirmed success"
If fallback verification shows step failed/incomplete:
- Apply normal retry/fallback strategy
- Do NOT treat monitoring failure as step failure
Key Principle
The tmux session is the source of truth for session state. The story file and sprint-status.yaml are the source of truth for workflow state.
Monitoring is just observation - if monitoring fails, verify from source of truth and continue.