13 KiB
13 KiB
Error Recovery Procedures
Purpose
Comprehensive error detection, graceful degradation, and self-recovery mechanisms for the memory-enhanced BMAD system.
Common Error Scenarios & Resolutions
1. Configuration Errors
Error: ide-bmad-orchestrator.cfg.md not found
- Detection: Startup initialization failure
- Recovery Steps:
- Search for config file in parent directories (up to 3 levels)
- Check for alternative config file names (
config.md,orchestrator.cfg) - Create minimal config from built-in template
- Prompt user for project root confirmation
- Offer to download standard BMAD structure
Recovery Implementation:
def recover_missing_config():
search_paths = [
"./ide-bmad-orchestrator.cfg.md",
"../ide-bmad-orchestrator.cfg.md",
"../../ide-bmad-orchestrator.cfg.md",
"./bmad-agent/ide-bmad-orchestrator.cfg.md"
]
for path in search_paths:
if file_exists(path):
return load_config(path)
# Create minimal fallback config
return create_minimal_config()
Error: Persona file referenced but missing
- Detection: Persona activation failure
- Recovery Steps:
- List available persona files in personas directory
- Suggest closest match by name similarity (fuzzy matching)
- Offer generic fallback persona with reduced functionality
- Provide download link for missing personas
- Log missing persona for later resolution
Fallback Persona Selection:
def find_fallback_persona(missing_persona_name):
available_personas = list_available_personas()
# Fuzzy match by name similarity
best_match = find_closest_match(missing_persona_name, available_personas)
if similarity_score(missing_persona_name, best_match) > 0.7:
return best_match
# Use generic fallback based on persona type
persona_type = extract_persona_type(missing_persona_name)
return get_generic_fallback(persona_type)
2. Project Structure Errors
Error: bmad-agent/ directory missing
- Detection: Path resolution failure during initialization
- Recovery Steps:
- Search for BMAD structure in parent directories (recursive search)
- Check for partial BMAD installation (some directories present)
- Offer to initialize BMAD structure in current directory
- Provide setup wizard for new installations
- Download missing components automatically
Structure Recovery:
def recover_bmad_structure():
# Search for existing BMAD components
search_result = recursive_search_bmad_structure()
if search_result.found:
return use_existing_structure(search_result.path)
if search_result.partial:
return complete_partial_installation(search_result.missing_components)
# No BMAD structure found - offer to create
return offer_structure_creation()
Error: Task or template file missing during execution
- Detection: Task execution attempt with missing file
- Recovery Steps:
- Check for alternative task files with similar names
- Search for task file in backup locations
- Provide generic task template with reduced functionality
- Continue with reduced functionality, log limitation clearly
- Offer to download missing task files
Missing File Fallback:
def handle_missing_task_file(missing_file):
# Try alternative names/locations
alternatives = find_alternative_task_files(missing_file)
if alternatives:
return use_alternative_task(alternatives[0])
# Use generic fallback
generic_task = create_generic_task_template(missing_file)
log_limitation(f"Using generic fallback for {missing_file}")
return generic_task
3. Memory System Errors
Error: OpenMemory MCP connection failure
- Detection: Memory search/add operations failing
- Recovery Steps:
- Attempt reconnection with exponential backoff
- Fall back to file-based context persistence
- Queue memory operations for later sync
- Notify user of reduced functionality
- Continue with session-only context
Memory Fallback System:
def handle_memory_system_failure():
# Try reconnection
if attempt_memory_reconnection():
return "reconnected"
# Fall back to file-based context
enable_file_based_context_fallback()
# Queue pending operations
queue_memory_operations_for_retry()
# Notify user
notify_user_of_memory_degradation()
return "fallback_mode"
Error: Memory search returning no results unexpectedly
- Detection: Empty results for queries that should return data
- Recovery Steps:
- Verify memory connection and authentication
- Try alternative search queries with broader terms
- Check memory index integrity
- Fall back to session-only context
- Rebuild memory index if necessary
4. Session State Errors
Error: Corrupted session state file
- Detection: JSON/YAML parsing failure during state loading
- Recovery Steps:
- Create backup of corrupted file with timestamp
- Attempt partial recovery using regex parsing
- Initialize fresh session state with available information
- Attempt to recover key information from backup
- Notify user of reset and potential information loss
Session State Recovery:
def recover_corrupted_session_state(corrupted_file):
# Backup corrupted file
backup_file = create_backup(corrupted_file)
# Attempt partial recovery
recovered_data = attempt_partial_recovery(corrupted_file)
if recovered_data.success:
return create_session_from_partial_data(recovered_data)
# Create fresh session with basic info
return create_fresh_session_with_backup_reference(backup_file)
Error: Session state write permission denied
- Detection: File system error during state saving
- Recovery Steps:
- Check file permissions and ownership
- Try alternative session state location
- Use memory-only session state temporarily
- Prompt user for permission fix
- Disable session persistence if unfixable
5. Resource Loading Errors
Error: Template or checklist file corrupted
- Detection: File parsing failure during task execution
- Recovery Steps:
- Use fallback generic template for the same purpose
- Check for template file in backup locations
- Download fresh template from repository
- Log specific error for user investigation
- Continue with warning about reduced functionality
Template Recovery:
def recover_corrupted_template(template_name):
# Try fallback templates
fallback = get_fallback_template(template_name)
if fallback:
log_warning(f"Using fallback template for {template_name}")
return fallback
# Create minimal template
minimal_template = create_minimal_template(template_name)
log_limitation(f"Using minimal template for {template_name}")
return minimal_template
Error: Persona file load timeout
- Detection: File loading exceeds timeout threshold
- Recovery Steps:
- Retry with extended timeout
- Check file size and complexity
- Use cached version if available
- Load persona in chunks if possible
- Fall back to simplified persona version
6. Consultation System Errors
Error: Multi-persona consultation initialization failure
- Detection: Failed to load multiple personas simultaneously
- Recovery Steps:
- Identify which specific personas failed to load
- Continue consultation with available personas
- Use fallback personas for missing ones
- Adjust consultation protocol for reduced participants
- Notify user of consultation limitations
Consultation Recovery:
def recover_consultation_failure(requested_personas, failure_details):
successful_personas = []
fallback_personas = []
for persona in requested_personas:
if persona in failure_details.failed_personas:
fallback = get_consultation_fallback(persona)
if fallback:
fallback_personas.append(fallback)
else:
successful_personas.append(persona)
# Adjust consultation for available personas
return adjust_consultation_protocol(successful_personas + fallback_personas)
Error Reporting & Communication
User-Friendly Error Messages
def generate_user_friendly_error(error_type, technical_details):
error_templates = {
"config_missing": {
"message": "BMAD configuration not found. Let me help you set up.",
"actions": ["Create new config", "Search for existing config", "Download BMAD"],
"severity": "warning"
},
"persona_missing": {
"message": "The requested specialist isn't available. I can suggest alternatives.",
"actions": ["Use similar specialist", "Download missing specialist", "Continue with generic"],
"severity": "info"
},
"memory_failure": {
"message": "Memory system temporarily unavailable. Using session-only context.",
"actions": ["Retry connection", "Continue without memory", "Check system status"],
"severity": "warning"
}
}
template = error_templates.get(error_type, get_generic_error_template())
return format_error_message(template, technical_details)
Error Recovery Guidance
# 🔧 System Recovery Guidance
## Issue Detected: {error_type}
**Severity**: {severity_level}
**Impact**: {functionality_impact}
## What Happened
{user_friendly_explanation}
## Recovery Actions Available
1. **{Primary Action}** (Recommended)
- What it does: {action_description}
- Expected outcome: {expected_result}
2. **{Alternative Action}**
- What it does: {action_description}
- When to use: {usage_scenario}
## Current System Status
✅ **Working**: {functional_components}
⚠️ **Limited**: {degraded_components}
❌ **Unavailable**: {failed_components}
## Next Steps
Choose an action above, or:
- `/diagnose` - Run comprehensive system health check
- `/recover` - Attempt automatic recovery
- `/fallback` - Switch to safe mode with basic functionality
Would you like me to attempt automatic recovery?
Recovery Success Tracking
Recovery Effectiveness Monitoring
def track_recovery_effectiveness(error_type, recovery_action, outcome):
recovery_memory = {
"type": "error_recovery",
"error_type": error_type,
"recovery_action": recovery_action,
"outcome": outcome,
"success": outcome.success,
"time_to_recovery": outcome.duration,
"user_satisfaction": outcome.user_rating,
"system_stability_after": assess_stability_post_recovery(),
"lessons_learned": extract_recovery_lessons(outcome)
}
# Store in memory for learning
add_memories(
content=json.dumps(recovery_memory),
tags=["error-recovery", error_type, recovery_action],
metadata={"type": "recovery", "success": outcome.success}
)
Adaptive Recovery Learning
def learn_from_recovery_patterns():
recovery_memories = search_memory(
"error_recovery outcome success failure",
limit=50,
threshold=0.5
)
patterns = analyze_recovery_patterns(recovery_memories)
# Update recovery strategies based on success patterns
for pattern in patterns.successful_approaches:
update_recovery_strategy(pattern.error_type, pattern.approach)
# Flag ineffective recovery approaches
for pattern in patterns.failed_approaches:
deprecate_recovery_strategy(pattern.error_type, pattern.approach)
Proactive Error Prevention
Health Monitoring
def continuous_health_monitoring():
health_checks = [
check_config_file_integrity(),
check_persona_file_availability(),
check_memory_system_connectivity(),
check_session_state_writability(),
check_disk_space_availability(),
check_file_permissions()
]
for check in health_checks:
if check.status == "warning":
schedule_preemptive_action(check)
elif check.status == "critical":
trigger_immediate_recovery(check)
Predictive Error Detection
def predict_potential_errors(current_system_state):
# Use memory patterns to predict likely failures
similar_states = search_memory(
f"system state {current_system_state.key_indicators}",
limit=10,
threshold=0.7
)
potential_errors = []
for state in similar_states:
if state.led_to_errors:
potential_errors.append({
"error_type": state.error_type,
"probability": calculate_error_probability(state, current_system_state),
"prevention_action": state.prevention_strategy,
"early_warning_signs": state.warning_indicators
})
return rank_error_predictions(potential_errors)
This comprehensive error recovery system ensures that the BMAD orchestrator can gracefully handle failures while maintaining functionality and learning from each recovery experience.