BMAD-METHOD/bmad-agent/error_handling/error-recovery.md

# Error Recovery Procedures

## Purpose
Comprehensive error detection, graceful degradation, and self-recovery mechanisms for the memory-enhanced BMAD system.

## Common Error Scenarios & Resolutions

### 1. Configuration Errors

#### **Error**: `ide-bmad-orchestrator.cfg.md` not found
- **Detection**: Startup initialization failure
- **Recovery Steps**:
  1. Search for config file in parent directories (up to 3 levels)
  2. Check for alternative config file names (`config.md`, `orchestrator.cfg`)
  3. Create minimal config from built-in template
  4. Prompt user for project root confirmation
  5. Offer to download standard BMAD structure

**Recovery Implementation**:
```python
def recover_missing_config():
    search_paths = [
        "./ide-bmad-orchestrator.cfg.md",
        "../ide-bmad-orchestrator.cfg.md",
        "../../ide-bmad-orchestrator.cfg.md",
        "./bmad-agent/ide-bmad-orchestrator.cfg.md"
    ]

    for path in search_paths:
        if file_exists(path):
            return load_config(path)

    # Create minimal fallback config
    return create_minimal_config()
```

#### **Error**: Persona file referenced but missing
- **Detection**: Persona activation failure
- **Recovery Steps**:
  1. List available persona files in personas directory
  2. Suggest closest match by name similarity (fuzzy matching)
  3. Offer generic fallback persona with reduced functionality
  4. Provide download link for missing personas
  5. Log missing persona for later resolution

**Fallback Persona Selection**:
```python
def find_fallback_persona(missing_persona_name):
    available_personas = list_available_personas()

    # Fuzzy match by name similarity
    best_match = find_closest_match(missing_persona_name, available_personas)

    if similarity_score(missing_persona_name, best_match) > 0.7:
        return best_match

    # Use generic fallback based on persona type
    persona_type = extract_persona_type(missing_persona_name)
    return get_generic_fallback(persona_type)
```

### 2. Project Structure Errors

#### **Error**: `bmad-agent/` directory missing
- **Detection**: Path resolution failure during initialization
- **Recovery Steps**:
  1. Search for BMAD structure in parent directories (recursive search)
  2. Check for partial BMAD installation (some directories present)
  3. Offer to initialize BMAD structure in current directory
  4. Provide setup wizard for new installations
  5. Download missing components automatically

**Structure Recovery**:
```python
def recover_bmad_structure():
    # Search for existing BMAD components
    search_result = recursive_search_bmad_structure()

    if search_result.found:
        return use_existing_structure(search_result.path)

    if search_result.partial:
        return complete_partial_installation(search_result.missing_components)

    # No BMAD structure found - offer to create
    return offer_structure_creation()
```

#### **Error**: Task or template file missing during execution
- **Detection**: Task execution attempt with missing file
- **Recovery Steps**:
  1. Check for alternative task files with similar names
  2. Search for task file in backup locations
  3. Provide generic task template with reduced functionality
  4. Continue with reduced functionality, log limitation clearly
  5. Offer to download missing task files

**Missing File Fallback**:
```python
def handle_missing_task_file(missing_file):
    # Try alternative names/locations
    alternatives = find_alternative_task_files(missing_file)

    if alternatives:
        return use_alternative_task(alternatives[0])

    # Use generic fallback
    generic_task = create_generic_task_template(missing_file)
    log_limitation(f"Using generic fallback for {missing_file}")

    return generic_task
```

### 3. Memory System Errors

#### **Error**: OpenMemory MCP connection failure
- **Detection**: Memory search/add operations failing
- **Recovery Steps**:
  1. Attempt reconnection with exponential backoff
  2. Fall back to file-based context persistence
  3. Queue memory operations for later sync
  4. Notify user of reduced functionality
  5. Continue with session-only context

**Memory Fallback System**:
```python
def handle_memory_system_failure():
    # Try reconnection
    if attempt_memory_reconnection():
        return "reconnected"

    # Fall back to file-based context
    enable_file_based_context_fallback()

    # Queue pending operations
    queue_memory_operations_for_retry()

    # Notify user
    notify_user_of_memory_degradation()

    return "fallback_mode"
```

#### **Error**: Memory search returning no results unexpectedly
- **Detection**: Empty results for queries that should return data
- **Recovery Steps**:
  1. Verify memory connection and authentication
  2. Try alternative search queries with broader terms
  3. Check memory index integrity
  4. Fall back to session-only context
  5. Rebuild memory index if necessary

### 4. Session State Errors

#### **Error**: Corrupted session state file
- **Detection**: JSON/YAML parsing failure during state loading
- **Recovery Steps**:
  1. Create backup of corrupted file with timestamp
  2. Attempt partial recovery using regex parsing
  3. Initialize fresh session state with available information
  4. Attempt to recover key information from backup
  5. Notify user of reset and potential information loss

**Session State Recovery**:
```python
def recover_corrupted_session_state(corrupted_file):
    # Backup corrupted file
    backup_file = create_backup(corrupted_file)

    # Attempt partial recovery
    recovered_data = attempt_partial_recovery(corrupted_file)

    if recovered_data.success:
        return create_session_from_partial_data(recovered_data)

    # Create fresh session with basic info
    return create_fresh_session_with_backup_reference(backup_file)
```

#### **Error**: Session state write permission denied
- **Detection**: File system error during state saving
- **Recovery Steps**:
  1. Check file permissions and ownership
  2. Try alternative session state location
  3. Use memory-only session state temporarily
  4. Prompt user for permission fix
  5. Disable session persistence if unfixable

### 5. Resource Loading Errors

#### **Error**: Template or checklist file corrupted
- **Detection**: File parsing failure during task execution
- **Recovery Steps**:
  1. Use fallback generic template for the same purpose
  2. Check for template file in backup locations
  3. Download fresh template from repository
  4. Log specific error for user investigation
  5. Continue with warning about reduced functionality

**Template Recovery**:
```python
def recover_corrupted_template(template_name):
    # Try fallback templates
    fallback = get_fallback_template(template_name)

    if fallback:
        log_warning(f"Using fallback template for {template_name}")
        return fallback

    # Create minimal template
    minimal_template = create_minimal_template(template_name)
    log_limitation(f"Using minimal template for {template_name}")

    return minimal_template
```

#### **Error**: Persona file load timeout
- **Detection**: File loading exceeds timeout threshold
- **Recovery Steps**:
  1. Retry with extended timeout
  2. Check file size and complexity
  3. Use cached version if available
  4. Load persona in chunks if possible
  5. Fall back to simplified persona version

### 6. Consultation System Errors

#### **Error**: Multi-persona consultation initialization failure
- **Detection**: Failed to load multiple personas simultaneously
- **Recovery Steps**:
  1. Identify which specific personas failed to load
  2. Continue consultation with available personas
  3. Use fallback personas for missing ones
  4. Adjust consultation protocol for reduced participants
  5. Notify user of consultation limitations

**Consultation Recovery**:
```python
def recover_consultation_failure(requested_personas, failure_details):
    successful_personas = []
    fallback_personas = []

    for persona in requested_personas:
        if persona in failure_details.failed_personas:
            fallback = get_consultation_fallback(persona)
            if fallback:
                fallback_personas.append(fallback)
        else:
            successful_personas.append(persona)

    # Adjust consultation for available personas
    return adjust_consultation_protocol(successful_personas + fallback_personas)
```

## Error Reporting & Communication

### User-Friendly Error Messages
```python
def generate_user_friendly_error(error_type, technical_details):
    error_templates = {
        "config_missing": {
            "message": "BMAD configuration not found. Let me help you set up.",
            "actions": ["Create new config", "Search for existing config", "Download BMAD"],
            "severity": "warning"
        },
        "persona_missing": {
            "message": "The requested specialist isn't available. I can suggest alternatives.",
            "actions": ["Use similar specialist", "Download missing specialist", "Continue with generic"],
            "severity": "info"
        },
        "memory_failure": {
            "message": "Memory system temporarily unavailable. Using session-only context.",
            "actions": ["Retry connection", "Continue without memory", "Check system status"],
            "severity": "warning"
        }
    }

    template = error_templates.get(error_type, get_generic_error_template())
    return format_error_message(template, technical_details)
```

### Error Recovery Guidance
```markdown
# 🔧 System Recovery Guidance

## Issue Detected: {error_type}
**Severity**: {severity_level}
**Impact**: {functionality_impact}

## What Happened
{user_friendly_explanation}

## Recovery Actions Available
1. **{Primary Action}** (Recommended)
   - What it does: {action_description}
   - Expected outcome: {expected_result}

2. **{Alternative Action}**
   - What it does: {action_description}
   - When to use: {usage_scenario}

## Current System Status
✅ **Working**: {functional_components}
⚠️ **Limited**: {degraded_components}
❌ **Unavailable**: {failed_components}

## Next Steps
Choose an action above, or:
- `/diagnose` - Run comprehensive system health check
- `/recover` - Attempt automatic recovery
- `/fallback` - Switch to safe mode with basic functionality

Would you like me to attempt automatic recovery?
```

## Recovery Success Tracking

### Recovery Effectiveness Monitoring
```python
def track_recovery_effectiveness(error_type, recovery_action, outcome):
    recovery_memory = {
        "type": "error_recovery",
        "error_type": error_type,
        "recovery_action": recovery_action,
        "outcome": outcome,
        "success": outcome.success,
        "time_to_recovery": outcome.duration,
        "user_satisfaction": outcome.user_rating,
        "system_stability_after": assess_stability_post_recovery(),
        "lessons_learned": extract_recovery_lessons(outcome)
    }

    # Store in memory for learning
    add_memories(
        content=json.dumps(recovery_memory),
        tags=["error-recovery", error_type, recovery_action],
        metadata={"type": "recovery", "success": outcome.success}
    )
```

### Adaptive Recovery Learning
```python
def learn_from_recovery_patterns():
    recovery_memories = search_memory(
        "error_recovery outcome success failure",
        limit=50,
        threshold=0.5
    )

    patterns = analyze_recovery_patterns(recovery_memories)

    # Update recovery strategies based on success patterns
    for pattern in patterns.successful_approaches:
        update_recovery_strategy(pattern.error_type, pattern.approach)

    # Flag ineffective recovery approaches
    for pattern in patterns.failed_approaches:
        deprecate_recovery_strategy(pattern.error_type, pattern.approach)
```

## Proactive Error Prevention

### Health Monitoring
```python
def continuous_health_monitoring():
    health_checks = [
        check_config_file_integrity(),
        check_persona_file_availability(),
        check_memory_system_connectivity(),
        check_session_state_writability(),
        check_disk_space_availability(),
        check_file_permissions()
    ]

    for check in health_checks:
        if check.status == "warning":
            schedule_preemptive_action(check)
        elif check.status == "critical":
            trigger_immediate_recovery(check)
```

### Predictive Error Detection
```python
def predict_potential_errors(current_system_state):
    # Use memory patterns to predict likely failures
    similar_states = search_memory(
        f"system state {current_system_state.key_indicators}",
        limit=10,
        threshold=0.7
    )

    potential_errors = []
    for state in similar_states:
        if state.led_to_errors:
            potential_errors.append({
                "error_type": state.error_type,
                "probability": calculate_error_probability(state, current_system_state),
                "prevention_action": state.prevention_strategy,
                "early_warning_signs": state.warning_indicators
            })

    return rank_error_predictions(potential_errors)
```

This comprehensive error recovery system ensures that the BMAD orchestrator can gracefully handle failures while maintaining functionality and learning from each recovery experience.