405 lines
13 KiB
Markdown
405 lines
13 KiB
Markdown
# Error Recovery Procedures
|
|
|
|
## Purpose
|
|
Comprehensive error detection, graceful degradation, and self-recovery mechanisms for the memory-enhanced BMAD system.
|
|
|
|
## Common Error Scenarios & Resolutions
|
|
|
|
### 1. Configuration Errors
|
|
|
|
#### **Error**: `ide-bmad-orchestrator.cfg.md` not found
|
|
- **Detection**: Startup initialization failure
|
|
- **Recovery Steps**:
|
|
1. Search for config file in parent directories (up to 3 levels)
|
|
2. Check for alternative config file names (`config.md`, `orchestrator.cfg`)
|
|
3. Create minimal config from built-in template
|
|
4. Prompt user for project root confirmation
|
|
5. Offer to download standard BMAD structure
|
|
|
|
**Recovery Implementation**:
|
|
```python
|
|
def recover_missing_config():
|
|
search_paths = [
|
|
"./ide-bmad-orchestrator.cfg.md",
|
|
"../ide-bmad-orchestrator.cfg.md",
|
|
"../../ide-bmad-orchestrator.cfg.md",
|
|
"./bmad-agent/ide-bmad-orchestrator.cfg.md"
|
|
]
|
|
|
|
for path in search_paths:
|
|
if file_exists(path):
|
|
return load_config(path)
|
|
|
|
# Create minimal fallback config
|
|
return create_minimal_config()
|
|
```
|
|
|
|
#### **Error**: Persona file referenced but missing
|
|
- **Detection**: Persona activation failure
|
|
- **Recovery Steps**:
|
|
1. List available persona files in personas directory
|
|
2. Suggest closest match by name similarity (fuzzy matching)
|
|
3. Offer generic fallback persona with reduced functionality
|
|
4. Provide download link for missing personas
|
|
5. Log missing persona for later resolution
|
|
|
|
**Fallback Persona Selection**:
|
|
```python
|
|
def find_fallback_persona(missing_persona_name):
|
|
available_personas = list_available_personas()
|
|
|
|
# Fuzzy match by name similarity
|
|
best_match = find_closest_match(missing_persona_name, available_personas)
|
|
|
|
if similarity_score(missing_persona_name, best_match) > 0.7:
|
|
return best_match
|
|
|
|
# Use generic fallback based on persona type
|
|
persona_type = extract_persona_type(missing_persona_name)
|
|
return get_generic_fallback(persona_type)
|
|
```
|
|
|
|
### 2. Project Structure Errors
|
|
|
|
#### **Error**: `bmad-agent/` directory missing
|
|
- **Detection**: Path resolution failure during initialization
|
|
- **Recovery Steps**:
|
|
1. Search for BMAD structure in parent directories (recursive search)
|
|
2. Check for partial BMAD installation (some directories present)
|
|
3. Offer to initialize BMAD structure in current directory
|
|
4. Provide setup wizard for new installations
|
|
5. Download missing components automatically
|
|
|
|
**Structure Recovery**:
|
|
```python
|
|
def recover_bmad_structure():
|
|
# Search for existing BMAD components
|
|
search_result = recursive_search_bmad_structure()
|
|
|
|
if search_result.found:
|
|
return use_existing_structure(search_result.path)
|
|
|
|
if search_result.partial:
|
|
return complete_partial_installation(search_result.missing_components)
|
|
|
|
# No BMAD structure found - offer to create
|
|
return offer_structure_creation()
|
|
```
|
|
|
|
#### **Error**: Task or template file missing during execution
|
|
- **Detection**: Task execution attempt with missing file
|
|
- **Recovery Steps**:
|
|
1. Check for alternative task files with similar names
|
|
2. Search for task file in backup locations
|
|
3. Provide generic task template with reduced functionality
|
|
4. Continue with reduced functionality, log limitation clearly
|
|
5. Offer to download missing task files
|
|
|
|
**Missing File Fallback**:
|
|
```python
|
|
def handle_missing_task_file(missing_file):
|
|
# Try alternative names/locations
|
|
alternatives = find_alternative_task_files(missing_file)
|
|
|
|
if alternatives:
|
|
return use_alternative_task(alternatives[0])
|
|
|
|
# Use generic fallback
|
|
generic_task = create_generic_task_template(missing_file)
|
|
log_limitation(f"Using generic fallback for {missing_file}")
|
|
|
|
return generic_task
|
|
```
|
|
|
|
### 3. Memory System Errors
|
|
|
|
#### **Error**: OpenMemory MCP connection failure
|
|
- **Detection**: Memory search/add operations failing
|
|
- **Recovery Steps**:
|
|
1. Attempt reconnection with exponential backoff
|
|
2. Fall back to file-based context persistence
|
|
3. Queue memory operations for later sync
|
|
4. Notify user of reduced functionality
|
|
5. Continue with session-only context
|
|
|
|
**Memory Fallback System**:
|
|
```python
|
|
def handle_memory_system_failure():
|
|
# Try reconnection
|
|
if attempt_memory_reconnection():
|
|
return "reconnected"
|
|
|
|
# Fall back to file-based context
|
|
enable_file_based_context_fallback()
|
|
|
|
# Queue pending operations
|
|
queue_memory_operations_for_retry()
|
|
|
|
# Notify user
|
|
notify_user_of_memory_degradation()
|
|
|
|
return "fallback_mode"
|
|
```
|
|
|
|
#### **Error**: Memory search returning no results unexpectedly
|
|
- **Detection**: Empty results for queries that should return data
|
|
- **Recovery Steps**:
|
|
1. Verify memory connection and authentication
|
|
2. Try alternative search queries with broader terms
|
|
3. Check memory index integrity
|
|
4. Fall back to session-only context
|
|
5. Rebuild memory index if necessary
|
|
|
|
### 4. Session State Errors
|
|
|
|
#### **Error**: Corrupted session state file
|
|
- **Detection**: JSON/YAML parsing failure during state loading
|
|
- **Recovery Steps**:
|
|
1. Create backup of corrupted file with timestamp
|
|
2. Attempt partial recovery using regex parsing
|
|
3. Initialize fresh session state with available information
|
|
4. Attempt to recover key information from backup
|
|
5. Notify user of reset and potential information loss
|
|
|
|
**Session State Recovery**:
|
|
```python
|
|
def recover_corrupted_session_state(corrupted_file):
|
|
# Backup corrupted file
|
|
backup_file = create_backup(corrupted_file)
|
|
|
|
# Attempt partial recovery
|
|
recovered_data = attempt_partial_recovery(corrupted_file)
|
|
|
|
if recovered_data.success:
|
|
return create_session_from_partial_data(recovered_data)
|
|
|
|
# Create fresh session with basic info
|
|
return create_fresh_session_with_backup_reference(backup_file)
|
|
```
|
|
|
|
#### **Error**: Session state write permission denied
|
|
- **Detection**: File system error during state saving
|
|
- **Recovery Steps**:
|
|
1. Check file permissions and ownership
|
|
2. Try alternative session state location
|
|
3. Use memory-only session state temporarily
|
|
4. Prompt user for permission fix
|
|
5. Disable session persistence if unfixable
|
|
|
|
### 5. Resource Loading Errors
|
|
|
|
#### **Error**: Template or checklist file corrupted
|
|
- **Detection**: File parsing failure during task execution
|
|
- **Recovery Steps**:
|
|
1. Use fallback generic template for the same purpose
|
|
2. Check for template file in backup locations
|
|
3. Download fresh template from repository
|
|
4. Log specific error for user investigation
|
|
5. Continue with warning about reduced functionality
|
|
|
|
**Template Recovery**:
|
|
```python
|
|
def recover_corrupted_template(template_name):
|
|
# Try fallback templates
|
|
fallback = get_fallback_template(template_name)
|
|
|
|
if fallback:
|
|
log_warning(f"Using fallback template for {template_name}")
|
|
return fallback
|
|
|
|
# Create minimal template
|
|
minimal_template = create_minimal_template(template_name)
|
|
log_limitation(f"Using minimal template for {template_name}")
|
|
|
|
return minimal_template
|
|
```
|
|
|
|
#### **Error**: Persona file load timeout
|
|
- **Detection**: File loading exceeds timeout threshold
|
|
- **Recovery Steps**:
|
|
1. Retry with extended timeout
|
|
2. Check file size and complexity
|
|
3. Use cached version if available
|
|
4. Load persona in chunks if possible
|
|
5. Fall back to simplified persona version
|
|
|
|
### 6. Consultation System Errors
|
|
|
|
#### **Error**: Multi-persona consultation initialization failure
|
|
- **Detection**: Failed to load multiple personas simultaneously
|
|
- **Recovery Steps**:
|
|
1. Identify which specific personas failed to load
|
|
2. Continue consultation with available personas
|
|
3. Use fallback personas for missing ones
|
|
4. Adjust consultation protocol for reduced participants
|
|
5. Notify user of consultation limitations
|
|
|
|
**Consultation Recovery**:
|
|
```python
|
|
def recover_consultation_failure(requested_personas, failure_details):
|
|
successful_personas = []
|
|
fallback_personas = []
|
|
|
|
for persona in requested_personas:
|
|
if persona in failure_details.failed_personas:
|
|
fallback = get_consultation_fallback(persona)
|
|
if fallback:
|
|
fallback_personas.append(fallback)
|
|
else:
|
|
successful_personas.append(persona)
|
|
|
|
# Adjust consultation for available personas
|
|
return adjust_consultation_protocol(successful_personas + fallback_personas)
|
|
```
|
|
|
|
## Error Reporting & Communication
|
|
|
|
### User-Friendly Error Messages
|
|
```python
|
|
def generate_user_friendly_error(error_type, technical_details):
|
|
error_templates = {
|
|
"config_missing": {
|
|
"message": "BMAD configuration not found. Let me help you set up.",
|
|
"actions": ["Create new config", "Search for existing config", "Download BMAD"],
|
|
"severity": "warning"
|
|
},
|
|
"persona_missing": {
|
|
"message": "The requested specialist isn't available. I can suggest alternatives.",
|
|
"actions": ["Use similar specialist", "Download missing specialist", "Continue with generic"],
|
|
"severity": "info"
|
|
},
|
|
"memory_failure": {
|
|
"message": "Memory system temporarily unavailable. Using session-only context.",
|
|
"actions": ["Retry connection", "Continue without memory", "Check system status"],
|
|
"severity": "warning"
|
|
}
|
|
}
|
|
|
|
template = error_templates.get(error_type, get_generic_error_template())
|
|
return format_error_message(template, technical_details)
|
|
```
|
|
|
|
### Error Recovery Guidance
|
|
```markdown
|
|
# 🔧 System Recovery Guidance
|
|
|
|
## Issue Detected: {error_type}
|
|
**Severity**: {severity_level}
|
|
**Impact**: {functionality_impact}
|
|
|
|
## What Happened
|
|
{user_friendly_explanation}
|
|
|
|
## Recovery Actions Available
|
|
1. **{Primary Action}** (Recommended)
|
|
- What it does: {action_description}
|
|
- Expected outcome: {expected_result}
|
|
|
|
2. **{Alternative Action}**
|
|
- What it does: {action_description}
|
|
- When to use: {usage_scenario}
|
|
|
|
## Current System Status
|
|
✅ **Working**: {functional_components}
|
|
⚠️ **Limited**: {degraded_components}
|
|
❌ **Unavailable**: {failed_components}
|
|
|
|
## Next Steps
|
|
Choose an action above, or:
|
|
- `/diagnose` - Run comprehensive system health check
|
|
- `/recover` - Attempt automatic recovery
|
|
- `/fallback` - Switch to safe mode with basic functionality
|
|
|
|
Would you like me to attempt automatic recovery?
|
|
```
|
|
|
|
## Recovery Success Tracking
|
|
|
|
### Recovery Effectiveness Monitoring
|
|
```python
|
|
def track_recovery_effectiveness(error_type, recovery_action, outcome):
|
|
recovery_memory = {
|
|
"type": "error_recovery",
|
|
"error_type": error_type,
|
|
"recovery_action": recovery_action,
|
|
"outcome": outcome,
|
|
"success": outcome.success,
|
|
"time_to_recovery": outcome.duration,
|
|
"user_satisfaction": outcome.user_rating,
|
|
"system_stability_after": assess_stability_post_recovery(),
|
|
"lessons_learned": extract_recovery_lessons(outcome)
|
|
}
|
|
|
|
# Store in memory for learning
|
|
add_memories(
|
|
content=json.dumps(recovery_memory),
|
|
tags=["error-recovery", error_type, recovery_action],
|
|
metadata={"type": "recovery", "success": outcome.success}
|
|
)
|
|
```
|
|
|
|
### Adaptive Recovery Learning
|
|
```python
|
|
def learn_from_recovery_patterns():
|
|
recovery_memories = search_memory(
|
|
"error_recovery outcome success failure",
|
|
limit=50,
|
|
threshold=0.5
|
|
)
|
|
|
|
patterns = analyze_recovery_patterns(recovery_memories)
|
|
|
|
# Update recovery strategies based on success patterns
|
|
for pattern in patterns.successful_approaches:
|
|
update_recovery_strategy(pattern.error_type, pattern.approach)
|
|
|
|
# Flag ineffective recovery approaches
|
|
for pattern in patterns.failed_approaches:
|
|
deprecate_recovery_strategy(pattern.error_type, pattern.approach)
|
|
```
|
|
|
|
## Proactive Error Prevention
|
|
|
|
### Health Monitoring
|
|
```python
|
|
def continuous_health_monitoring():
|
|
health_checks = [
|
|
check_config_file_integrity(),
|
|
check_persona_file_availability(),
|
|
check_memory_system_connectivity(),
|
|
check_session_state_writability(),
|
|
check_disk_space_availability(),
|
|
check_file_permissions()
|
|
]
|
|
|
|
for check in health_checks:
|
|
if check.status == "warning":
|
|
schedule_preemptive_action(check)
|
|
elif check.status == "critical":
|
|
trigger_immediate_recovery(check)
|
|
```
|
|
|
|
### Predictive Error Detection
|
|
```python
|
|
def predict_potential_errors(current_system_state):
|
|
# Use memory patterns to predict likely failures
|
|
similar_states = search_memory(
|
|
f"system state {current_system_state.key_indicators}",
|
|
limit=10,
|
|
threshold=0.7
|
|
)
|
|
|
|
potential_errors = []
|
|
for state in similar_states:
|
|
if state.led_to_errors:
|
|
potential_errors.append({
|
|
"error_type": state.error_type,
|
|
"probability": calculate_error_probability(state, current_system_state),
|
|
"prevention_action": state.prevention_strategy,
|
|
"early_warning_signs": state.warning_indicators
|
|
})
|
|
|
|
return rank_error_predictions(potential_errors)
|
|
```
|
|
|
|
This comprehensive error recovery system ensures that the BMAD orchestrator can gracefully handle failures while maintaining functionality and learning from each recovery experience. |