BMAD-METHOD/bmad-system/memory/error-prevention-system.md

575 lines
21 KiB
Markdown

# Error Prevention System
## Mistake Tracking and Prevention for Claude Code
The Error Prevention System enables Claude Code to learn from past mistakes and proactively prevent similar errors, creating a self-improving development environment that gets safer over time.
### Error Catalog and Learning Framework
#### Comprehensive Error Documentation
```yaml
error_entry:
identification:
id: "{uuid}"
timestamp: "2024-01-15T14:30:00Z"
severity: "critical|high|medium|low"
category: "security|performance|logic|integration|deployment"
error_signature: "unique_fingerprint_for_similar_errors"
error_details:
description: "Database connection pool exhaustion causing 503 errors"
symptoms:
- "HTTP 503 Service Unavailable responses"
- "Database connection timeout errors in logs"
- "Application hanging on database queries"
- "Memory usage steadily increasing"
impact:
- user_experience: "Complete service unavailability"
- business_impact: "Revenue loss during downtime"
- technical_debt: "Required emergency hotfix"
- team_impact: "Weekend emergency response required"
affected_components:
- "Database connection pool"
- "API endpoints"
- "User authentication service"
- "Payment processing"
context_information:
project_phase: "production"
technology_stack: ["nodejs", "postgresql", "docker", "kubernetes"]
project_characteristics:
size: "large"
complexity: "high"
team_size: "8"
load_profile: "high_traffic"
environmental_factors:
- "Black Friday traffic spike"
- "Recent deployment of new features"
- "Database maintenance window completed day before"
claude_code_context:
files_involved: ["src/database/pool.js", "config/database.js"]
tools_used_before_error: ["Edit", "Bash", "Write"]
recent_changes: ["Increased connection timeout", "Added retry logic"]
root_cause_analysis:
immediate_cause: "Connection pool size insufficient for traffic spike"
contributing_factors:
- "Default pool size never adjusted for production load"
- "No connection pool monitoring in place"
- "Load testing didn't simulate realistic user behavior"
- "Connection leak in error handling paths"
root_cause: "Inadequate capacity planning and monitoring for database connections"
analysis_method: "5 whys analysis + performance profiling"
investigation_tools: ["APM traces", "Database logs", "Container metrics"]
prevention_strategy:
detection_rules:
- rule: "Monitor connection pool utilization"
trigger: "when pool_utilization > 80%"
action: "Alert DevOps team immediately"
automation_possible: true
- rule: "Watch for connection timeout patterns"
trigger: "when connection_timeouts > 5 in 1 minute"
action: "Scale pool size automatically"
automation_possible: true
- rule: "Track connection pool growth rate"
trigger: "when pool_size increases > 20% in 5 minutes"
action: "Check for connection leaks"
automation_possible: false
prevention_steps:
- step: "Implement connection pool monitoring"
when: "during development phase"
responsibility: "platform-engineer"
tools_involved: ["monitoring setup", "alerting configuration"]
effort_estimate: "4 hours"
- step: "Add connection pool size auto-scaling"
when: "before production deployment"
responsibility: "dev"
tools_involved: ["database configuration", "scaling logic"]
effort_estimate: "8 hours"
- step: "Implement proper connection cleanup"
when: "during code review"
responsibility: "dev"
tools_involved: ["code review", "static analysis"]
effort_estimate: "2 hours"
validation_checks:
- check: "Load test with connection pool monitoring"
automation: "ci_cd_pipeline"
frequency: "before_each_production_deployment"
- check: "Review database connection usage patterns"
automation: "static_analysis_tool"
frequency: "with_each_code_change"
- check: "Validate connection cleanup in error paths"
automation: "integration_tests"
frequency: "continuous"
recovery_procedures:
immediate_response:
- "Scale database connection pool size"
- "Restart application instances to clear stale connections"
- "Enable database connection throttling"
- "Redirect traffic to secondary regions if available"
short_term_fixes:
- "Implement connection pool monitoring dashboard"
- "Add automated scaling for connection pool"
- "Fix connection leaks in error handling"
long_term_improvements:
- "Implement comprehensive database capacity planning"
- "Add chaos engineering tests for database failures"
- "Create runbooks for database scaling scenarios"
lessons_learned:
- "Connection pool sizing must account for traffic spikes"
- "Monitoring is essential for database resource management"
- "Load testing scenarios should include realistic user patterns"
- "Error handling paths need careful connection management"
- "Automated scaling can prevent manual intervention delays"
```
### Proactive Error Detection for Claude Code
#### Claude Code Tool Integration for Error Prevention
```python
async def prevent_errors_in_claude_operations(operation_type, operation_context):
"""
Prevent errors before Claude Code tool execution
"""
# Get operation-specific error patterns
relevant_errors = await get_relevant_error_patterns(
operation_type,
operation_context
)
error_prevention_result = {
'operation_safe': True,
'warnings': [],
'preventive_actions': [],
'risk_factors': []
}
# Analyze each relevant error pattern
for error_pattern in relevant_errors:
risk_assessment = assess_error_risk(
error_pattern,
operation_context
)
if risk_assessment.risk_level > 0.3: # 30% risk threshold
error_prevention_result['operation_safe'] = False
error_prevention_result['warnings'].append({
'error_type': error_pattern['category'],
'description': error_pattern['description'],
'risk_level': risk_assessment.risk_level,
'similar_past_cases': risk_assessment.similar_cases
})
# Generate preventive actions
preventive_actions = generate_preventive_actions(
error_pattern,
operation_context
)
error_prevention_result['preventive_actions'].extend(preventive_actions)
return error_prevention_result
async def error_aware_file_edit(file_path, edit_content, current_context):
"""
Edit files with error prevention based on historical patterns
"""
# Pre-edit error analysis
edit_risks = await analyze_edit_risks(file_path, edit_content, current_context)
if edit_risks.has_high_risk_patterns:
# Present warnings and suggest safer alternatives
risk_warnings = []
for risk in edit_risks.high_risk_patterns:
warning = {
'risk_type': risk.pattern_type,
'description': risk.description,
'historical_failures': risk.past_failures,
'suggested_alternatives': risk.safer_alternatives
}
risk_warnings.append(warning)
# Get user confirmation or apply safer alternatives
prevention_response = await handle_edit_risk_warnings(
risk_warnings,
file_path,
edit_content
)
if prevention_response.action == 'cancel':
return {'status': 'cancelled', 'reason': 'high_risk_prevented'}
elif prevention_response.action == 'modify':
edit_content = prevention_response.safer_content
# Execute edit with monitoring
edit_result = await claude_code_edit(file_path, edit_content)
# Post-edit validation
post_edit_validation = await validate_edit_success(
file_path,
edit_content,
edit_result,
edit_risks
)
# Learn from edit outcome
await learn_from_edit_outcome(
file_path,
edit_content,
edit_result,
post_edit_validation,
current_context
)
return {
'edit_result': edit_result,
'risk_prevention': edit_risks,
'validation': post_edit_validation
}
async def error_aware_bash_execution(command, current_context):
"""
Execute bash commands with error prevention
"""
# Analyze command for known dangerous patterns
command_risks = await analyze_command_risks(command, current_context)
if command_risks.has_dangerous_patterns:
# Check against error history
similar_failures = await find_similar_command_failures(
command,
current_context
)
if similar_failures:
# Provide warnings and safer alternatives
safety_recommendations = generate_command_safety_recommendations(
command,
similar_failures,
current_context
)
safer_command = await suggest_safer_command_alternative(
command,
safety_recommendations
)
if safer_command:
command = safer_command
# Execute with error monitoring
execution_start = datetime.utcnow()
try:
result = await claude_code_bash(command)
execution_duration = (datetime.utcnow() - execution_start).total_seconds()
# Learn from successful execution
await record_successful_command_execution(
command,
result,
execution_duration,
current_context
)
return result
except Exception as e:
execution_duration = (datetime.utcnow() - execution_start).total_seconds()
# Learn from failed execution
await record_failed_command_execution(
command,
str(e),
execution_duration,
current_context
)
# Try to provide recovery suggestions
recovery_suggestions = await generate_recovery_suggestions(
command,
str(e),
current_context
)
raise Exception(f"Command failed: {str(e)}\nRecovery suggestions: {recovery_suggestions}")
```
### Pattern-Based Error Prevention
#### Automatic Error Pattern Detection
```python
async def detect_error_patterns_in_codebase(project_path):
"""
Detect potential error patterns in codebase using Claude Code tools
"""
# Use Glob to find all relevant files
code_files = await claude_code_glob("**/*.{js,ts,py,java,go,rb}")
detected_patterns = {
'high_risk': [],
'medium_risk': [],
'low_risk': []
}
# Load known error patterns
error_patterns = await load_error_pattern_library()
# Analyze each file for error patterns
for file_path in code_files:
file_content = await claude_code_read(file_path)
for pattern in error_patterns:
# Use Grep to find pattern matches
pattern_matches = await claude_code_grep(pattern.search_regex, file_path)
if pattern_matches.matches:
for match in pattern_matches.matches:
risk_assessment = assess_pattern_risk(
pattern,
match,
file_content,
file_path
)
detected_pattern = {
'pattern_name': pattern.name,
'file_path': file_path,
'line_number': match.line_number,
'match_text': match.text,
'risk_level': risk_assessment.risk_level,
'potential_issues': risk_assessment.potential_issues,
'recommendations': risk_assessment.recommendations
}
if risk_assessment.risk_level >= 0.7:
detected_patterns['high_risk'].append(detected_pattern)
elif risk_assessment.risk_level >= 0.4:
detected_patterns['medium_risk'].append(detected_pattern)
else:
detected_patterns['low_risk'].append(detected_pattern)
# Generate prevention recommendations
prevention_plan = await generate_pattern_prevention_plan(detected_patterns)
return {
'detected_patterns': detected_patterns,
'prevention_plan': prevention_plan,
'risk_summary': {
'high_risk_count': len(detected_patterns['high_risk']),
'medium_risk_count': len(detected_patterns['medium_risk']),
'low_risk_count': len(detected_patterns['low_risk'])
}
}
async def implement_error_prevention_fixes(prevention_plan, project_context):
"""
Implement error prevention fixes using Claude Code tools
"""
implementation_results = []
for fix in prevention_plan.recommended_fixes:
try:
if fix.fix_type == 'code_modification':
# Use Edit tool to apply code fixes
fix_result = await apply_code_fix(fix, project_context)
elif fix.fix_type == 'configuration_change':
# Use Write tool to update configuration
fix_result = await apply_configuration_fix(fix, project_context)
elif fix.fix_type == 'dependency_update':
# Use Bash tool to update dependencies
fix_result = await apply_dependency_fix(fix, project_context)
elif fix.fix_type == 'test_addition':
# Use Write tool to add preventive tests
fix_result = await add_preventive_tests(fix, project_context)
implementation_results.append({
'fix_id': fix.id,
'status': 'success',
'result': fix_result
})
except Exception as e:
implementation_results.append({
'fix_id': fix.id,
'status': 'failed',
'error': str(e)
})
# Validate fixes were applied correctly
validation_results = await validate_prevention_fixes(
implementation_results,
project_context
)
return {
'implementation_results': implementation_results,
'validation_results': validation_results,
'overall_success': all(r['status'] == 'success' for r in implementation_results)
}
```
### Real-time Error Monitoring and Learning
#### Continuous Learning from Claude Code Operations
```python
async def monitor_claude_code_operations():
"""
Continuously monitor Claude Code operations for error patterns and learning opportunities
"""
operation_monitor = {
'tool_usage_monitor': ToolUsageMonitor(),
'error_detection_monitor': ErrorDetectionMonitor(),
'performance_monitor': PerformanceMonitor(),
'success_pattern_monitor': SuccessPatternMonitor()
}
async def monitoring_loop():
while True:
# Collect operation data
operation_data = await collect_operation_data(operation_monitor)
# Analyze for error patterns
error_analysis = await analyze_for_error_patterns(operation_data)
if error_analysis.new_patterns_detected:
# Learn new error patterns
await learn_new_error_patterns(error_analysis.new_patterns)
# Update prevention rules
await update_prevention_rules(error_analysis.new_patterns)
# Analyze for success patterns
success_analysis = await analyze_for_success_patterns(operation_data)
if success_analysis.new_patterns_detected:
# Learn new success patterns
await learn_new_success_patterns(success_analysis.new_patterns)
# Update recommendation engine
await update_recommendation_engine(success_analysis.new_patterns)
# Update error prevention database
await update_error_prevention_database(
error_analysis,
success_analysis,
operation_data
)
await asyncio.sleep(5) # Monitor every 5 seconds
# Start monitoring
await monitoring_loop()
async def learn_from_error_occurrence(error_details, context):
"""
Learn from actual error occurrences to improve prevention
"""
# Create error entry
error_entry = {
'id': generate_uuid(),
'timestamp': datetime.utcnow().isoformat(),
'error_details': error_details,
'context': context,
'severity': classify_error_severity(error_details),
'category': classify_error_category(error_details)
}
# Perform root cause analysis
root_cause_analysis = await perform_root_cause_analysis(
error_details,
context
)
error_entry['root_cause_analysis'] = root_cause_analysis
# Generate prevention strategies
prevention_strategies = await generate_prevention_strategies(
error_entry,
root_cause_analysis
)
error_entry['prevention_strategy'] = prevention_strategies
# Store error entry
await store_error_entry(error_entry)
# Update prevention rules
await update_prevention_rules_from_error(error_entry)
# Notify relevant personas about new error pattern
await notify_personas_of_new_error_pattern(error_entry)
return {
'error_learned': True,
'prevention_strategies_generated': len(prevention_strategies['prevention_steps']),
'detection_rules_created': len(prevention_strategies['detection_rules'])
}
```
### Error Prevention Dashboard and Reporting
#### Comprehensive Error Prevention Analytics
```yaml
error_prevention_metrics:
prevention_effectiveness:
errors_prevented: "Count of errors caught before execution"
false_positives: "Warnings that didn't lead to actual errors"
false_negatives: "Errors that weren't caught by prevention"
prevention_accuracy: "Percentage of accurate error predictions"
learning_progress:
new_patterns_learned: "Number of new error patterns identified"
pattern_accuracy_improvement: "How pattern recognition has improved"
prevention_rule_effectiveness: "Success rate of prevention rules"
system_reliability:
mean_time_between_errors: "MTBE for different error categories"
error_severity_distribution: "Breakdown of error types caught"
recovery_time_improvement: "How quickly errors are resolved"
development_impact:
development_velocity_impact: "How prevention affects speed"
code_quality_improvement: "Measurable quality gains"
developer_confidence: "Survey results on prevention helpfulness"
```
### Claude Code Integration Commands
```bash
# Error prevention and analysis
bmad prevent --analyze-risks --operation "database-migration"
bmad prevent --scan-patterns --project-path "src/"
bmad prevent --check-command "rm -rf node_modules" --suggest-safer
# Error learning and pattern management
bmad errors learn --from-incident "incident-report.md"
bmad errors patterns --list --category "security"
bmad errors rules --update --based-on-recent-failures
# Prevention implementation
bmad prevent implement --fixes-for "high-risk-patterns"
bmad prevent validate --applied-fixes --test-effectiveness
bmad prevent monitor --real-time --alert-on-risks
# Error prevention reporting
bmad prevent report --effectiveness --time-period "last-month"
bmad prevent dashboard --show-trends --error-categories
bmad prevent export --prevention-rules --format "yaml"
```
This Error Prevention System transforms Claude Code into a proactive development assistant that learns from every mistake and continuously improves its ability to prevent errors, creating an increasingly safe and reliable development environment.