575 lines
21 KiB
Markdown
575 lines
21 KiB
Markdown
# Error Prevention System
|
|
|
|
## Mistake Tracking and Prevention for Claude Code
|
|
|
|
The Error Prevention System enables Claude Code to learn from past mistakes and proactively prevent similar errors, creating a self-improving development environment that gets safer over time.
|
|
|
|
### Error Catalog and Learning Framework
|
|
|
|
#### Comprehensive Error Documentation
|
|
```yaml
|
|
error_entry:
|
|
identification:
|
|
id: "{uuid}"
|
|
timestamp: "2024-01-15T14:30:00Z"
|
|
severity: "critical|high|medium|low"
|
|
category: "security|performance|logic|integration|deployment"
|
|
error_signature: "unique_fingerprint_for_similar_errors"
|
|
|
|
error_details:
|
|
description: "Database connection pool exhaustion causing 503 errors"
|
|
symptoms:
|
|
- "HTTP 503 Service Unavailable responses"
|
|
- "Database connection timeout errors in logs"
|
|
- "Application hanging on database queries"
|
|
- "Memory usage steadily increasing"
|
|
impact:
|
|
- user_experience: "Complete service unavailability"
|
|
- business_impact: "Revenue loss during downtime"
|
|
- technical_debt: "Required emergency hotfix"
|
|
- team_impact: "Weekend emergency response required"
|
|
affected_components:
|
|
- "Database connection pool"
|
|
- "API endpoints"
|
|
- "User authentication service"
|
|
- "Payment processing"
|
|
|
|
context_information:
|
|
project_phase: "production"
|
|
technology_stack: ["nodejs", "postgresql", "docker", "kubernetes"]
|
|
project_characteristics:
|
|
size: "large"
|
|
complexity: "high"
|
|
team_size: "8"
|
|
load_profile: "high_traffic"
|
|
environmental_factors:
|
|
- "Black Friday traffic spike"
|
|
- "Recent deployment of new features"
|
|
- "Database maintenance window completed day before"
|
|
claude_code_context:
|
|
files_involved: ["src/database/pool.js", "config/database.js"]
|
|
tools_used_before_error: ["Edit", "Bash", "Write"]
|
|
recent_changes: ["Increased connection timeout", "Added retry logic"]
|
|
|
|
root_cause_analysis:
|
|
immediate_cause: "Connection pool size insufficient for traffic spike"
|
|
contributing_factors:
|
|
- "Default pool size never adjusted for production load"
|
|
- "No connection pool monitoring in place"
|
|
- "Load testing didn't simulate realistic user behavior"
|
|
- "Connection leak in error handling paths"
|
|
root_cause: "Inadequate capacity planning and monitoring for database connections"
|
|
analysis_method: "5 whys analysis + performance profiling"
|
|
investigation_tools: ["APM traces", "Database logs", "Container metrics"]
|
|
|
|
prevention_strategy:
|
|
detection_rules:
|
|
- rule: "Monitor connection pool utilization"
|
|
trigger: "when pool_utilization > 80%"
|
|
action: "Alert DevOps team immediately"
|
|
automation_possible: true
|
|
|
|
- rule: "Watch for connection timeout patterns"
|
|
trigger: "when connection_timeouts > 5 in 1 minute"
|
|
action: "Scale pool size automatically"
|
|
automation_possible: true
|
|
|
|
- rule: "Track connection pool growth rate"
|
|
trigger: "when pool_size increases > 20% in 5 minutes"
|
|
action: "Check for connection leaks"
|
|
automation_possible: false
|
|
|
|
prevention_steps:
|
|
- step: "Implement connection pool monitoring"
|
|
when: "during development phase"
|
|
responsibility: "platform-engineer"
|
|
tools_involved: ["monitoring setup", "alerting configuration"]
|
|
effort_estimate: "4 hours"
|
|
|
|
- step: "Add connection pool size auto-scaling"
|
|
when: "before production deployment"
|
|
responsibility: "dev"
|
|
tools_involved: ["database configuration", "scaling logic"]
|
|
effort_estimate: "8 hours"
|
|
|
|
- step: "Implement proper connection cleanup"
|
|
when: "during code review"
|
|
responsibility: "dev"
|
|
tools_involved: ["code review", "static analysis"]
|
|
effort_estimate: "2 hours"
|
|
|
|
validation_checks:
|
|
- check: "Load test with connection pool monitoring"
|
|
automation: "ci_cd_pipeline"
|
|
frequency: "before_each_production_deployment"
|
|
|
|
- check: "Review database connection usage patterns"
|
|
automation: "static_analysis_tool"
|
|
frequency: "with_each_code_change"
|
|
|
|
- check: "Validate connection cleanup in error paths"
|
|
automation: "integration_tests"
|
|
frequency: "continuous"
|
|
|
|
recovery_procedures:
|
|
immediate_response:
|
|
- "Scale database connection pool size"
|
|
- "Restart application instances to clear stale connections"
|
|
- "Enable database connection throttling"
|
|
- "Redirect traffic to secondary regions if available"
|
|
|
|
short_term_fixes:
|
|
- "Implement connection pool monitoring dashboard"
|
|
- "Add automated scaling for connection pool"
|
|
- "Fix connection leaks in error handling"
|
|
|
|
long_term_improvements:
|
|
- "Implement comprehensive database capacity planning"
|
|
- "Add chaos engineering tests for database failures"
|
|
- "Create runbooks for database scaling scenarios"
|
|
|
|
lessons_learned:
|
|
- "Connection pool sizing must account for traffic spikes"
|
|
- "Monitoring is essential for database resource management"
|
|
- "Load testing scenarios should include realistic user patterns"
|
|
- "Error handling paths need careful connection management"
|
|
- "Automated scaling can prevent manual intervention delays"
|
|
```
|
|
|
|
### Proactive Error Detection for Claude Code
|
|
|
|
#### Claude Code Tool Integration for Error Prevention
|
|
```python
|
|
async def prevent_errors_in_claude_operations(operation_type, operation_context):
|
|
"""
|
|
Prevent errors before Claude Code tool execution
|
|
"""
|
|
# Get operation-specific error patterns
|
|
relevant_errors = await get_relevant_error_patterns(
|
|
operation_type,
|
|
operation_context
|
|
)
|
|
|
|
error_prevention_result = {
|
|
'operation_safe': True,
|
|
'warnings': [],
|
|
'preventive_actions': [],
|
|
'risk_factors': []
|
|
}
|
|
|
|
# Analyze each relevant error pattern
|
|
for error_pattern in relevant_errors:
|
|
risk_assessment = assess_error_risk(
|
|
error_pattern,
|
|
operation_context
|
|
)
|
|
|
|
if risk_assessment.risk_level > 0.3: # 30% risk threshold
|
|
error_prevention_result['operation_safe'] = False
|
|
error_prevention_result['warnings'].append({
|
|
'error_type': error_pattern['category'],
|
|
'description': error_pattern['description'],
|
|
'risk_level': risk_assessment.risk_level,
|
|
'similar_past_cases': risk_assessment.similar_cases
|
|
})
|
|
|
|
# Generate preventive actions
|
|
preventive_actions = generate_preventive_actions(
|
|
error_pattern,
|
|
operation_context
|
|
)
|
|
error_prevention_result['preventive_actions'].extend(preventive_actions)
|
|
|
|
return error_prevention_result
|
|
|
|
async def error_aware_file_edit(file_path, edit_content, current_context):
|
|
"""
|
|
Edit files with error prevention based on historical patterns
|
|
"""
|
|
# Pre-edit error analysis
|
|
edit_risks = await analyze_edit_risks(file_path, edit_content, current_context)
|
|
|
|
if edit_risks.has_high_risk_patterns:
|
|
# Present warnings and suggest safer alternatives
|
|
risk_warnings = []
|
|
|
|
for risk in edit_risks.high_risk_patterns:
|
|
warning = {
|
|
'risk_type': risk.pattern_type,
|
|
'description': risk.description,
|
|
'historical_failures': risk.past_failures,
|
|
'suggested_alternatives': risk.safer_alternatives
|
|
}
|
|
risk_warnings.append(warning)
|
|
|
|
# Get user confirmation or apply safer alternatives
|
|
prevention_response = await handle_edit_risk_warnings(
|
|
risk_warnings,
|
|
file_path,
|
|
edit_content
|
|
)
|
|
|
|
if prevention_response.action == 'cancel':
|
|
return {'status': 'cancelled', 'reason': 'high_risk_prevented'}
|
|
elif prevention_response.action == 'modify':
|
|
edit_content = prevention_response.safer_content
|
|
|
|
# Execute edit with monitoring
|
|
edit_result = await claude_code_edit(file_path, edit_content)
|
|
|
|
# Post-edit validation
|
|
post_edit_validation = await validate_edit_success(
|
|
file_path,
|
|
edit_content,
|
|
edit_result,
|
|
edit_risks
|
|
)
|
|
|
|
# Learn from edit outcome
|
|
await learn_from_edit_outcome(
|
|
file_path,
|
|
edit_content,
|
|
edit_result,
|
|
post_edit_validation,
|
|
current_context
|
|
)
|
|
|
|
return {
|
|
'edit_result': edit_result,
|
|
'risk_prevention': edit_risks,
|
|
'validation': post_edit_validation
|
|
}
|
|
|
|
async def error_aware_bash_execution(command, current_context):
|
|
"""
|
|
Execute bash commands with error prevention
|
|
"""
|
|
# Analyze command for known dangerous patterns
|
|
command_risks = await analyze_command_risks(command, current_context)
|
|
|
|
if command_risks.has_dangerous_patterns:
|
|
# Check against error history
|
|
similar_failures = await find_similar_command_failures(
|
|
command,
|
|
current_context
|
|
)
|
|
|
|
if similar_failures:
|
|
# Provide warnings and safer alternatives
|
|
safety_recommendations = generate_command_safety_recommendations(
|
|
command,
|
|
similar_failures,
|
|
current_context
|
|
)
|
|
|
|
safer_command = await suggest_safer_command_alternative(
|
|
command,
|
|
safety_recommendations
|
|
)
|
|
|
|
if safer_command:
|
|
command = safer_command
|
|
|
|
# Execute with error monitoring
|
|
execution_start = datetime.utcnow()
|
|
|
|
try:
|
|
result = await claude_code_bash(command)
|
|
execution_duration = (datetime.utcnow() - execution_start).total_seconds()
|
|
|
|
# Learn from successful execution
|
|
await record_successful_command_execution(
|
|
command,
|
|
result,
|
|
execution_duration,
|
|
current_context
|
|
)
|
|
|
|
return result
|
|
|
|
except Exception as e:
|
|
execution_duration = (datetime.utcnow() - execution_start).total_seconds()
|
|
|
|
# Learn from failed execution
|
|
await record_failed_command_execution(
|
|
command,
|
|
str(e),
|
|
execution_duration,
|
|
current_context
|
|
)
|
|
|
|
# Try to provide recovery suggestions
|
|
recovery_suggestions = await generate_recovery_suggestions(
|
|
command,
|
|
str(e),
|
|
current_context
|
|
)
|
|
|
|
raise Exception(f"Command failed: {str(e)}\nRecovery suggestions: {recovery_suggestions}")
|
|
```
|
|
|
|
### Pattern-Based Error Prevention
|
|
|
|
#### Automatic Error Pattern Detection
|
|
```python
|
|
async def detect_error_patterns_in_codebase(project_path):
|
|
"""
|
|
Detect potential error patterns in codebase using Claude Code tools
|
|
"""
|
|
# Use Glob to find all relevant files
|
|
code_files = await claude_code_glob("**/*.{js,ts,py,java,go,rb}")
|
|
|
|
detected_patterns = {
|
|
'high_risk': [],
|
|
'medium_risk': [],
|
|
'low_risk': []
|
|
}
|
|
|
|
# Load known error patterns
|
|
error_patterns = await load_error_pattern_library()
|
|
|
|
# Analyze each file for error patterns
|
|
for file_path in code_files:
|
|
file_content = await claude_code_read(file_path)
|
|
|
|
for pattern in error_patterns:
|
|
# Use Grep to find pattern matches
|
|
pattern_matches = await claude_code_grep(pattern.search_regex, file_path)
|
|
|
|
if pattern_matches.matches:
|
|
for match in pattern_matches.matches:
|
|
risk_assessment = assess_pattern_risk(
|
|
pattern,
|
|
match,
|
|
file_content,
|
|
file_path
|
|
)
|
|
|
|
detected_pattern = {
|
|
'pattern_name': pattern.name,
|
|
'file_path': file_path,
|
|
'line_number': match.line_number,
|
|
'match_text': match.text,
|
|
'risk_level': risk_assessment.risk_level,
|
|
'potential_issues': risk_assessment.potential_issues,
|
|
'recommendations': risk_assessment.recommendations
|
|
}
|
|
|
|
if risk_assessment.risk_level >= 0.7:
|
|
detected_patterns['high_risk'].append(detected_pattern)
|
|
elif risk_assessment.risk_level >= 0.4:
|
|
detected_patterns['medium_risk'].append(detected_pattern)
|
|
else:
|
|
detected_patterns['low_risk'].append(detected_pattern)
|
|
|
|
# Generate prevention recommendations
|
|
prevention_plan = await generate_pattern_prevention_plan(detected_patterns)
|
|
|
|
return {
|
|
'detected_patterns': detected_patterns,
|
|
'prevention_plan': prevention_plan,
|
|
'risk_summary': {
|
|
'high_risk_count': len(detected_patterns['high_risk']),
|
|
'medium_risk_count': len(detected_patterns['medium_risk']),
|
|
'low_risk_count': len(detected_patterns['low_risk'])
|
|
}
|
|
}
|
|
|
|
async def implement_error_prevention_fixes(prevention_plan, project_context):
|
|
"""
|
|
Implement error prevention fixes using Claude Code tools
|
|
"""
|
|
implementation_results = []
|
|
|
|
for fix in prevention_plan.recommended_fixes:
|
|
try:
|
|
if fix.fix_type == 'code_modification':
|
|
# Use Edit tool to apply code fixes
|
|
fix_result = await apply_code_fix(fix, project_context)
|
|
|
|
elif fix.fix_type == 'configuration_change':
|
|
# Use Write tool to update configuration
|
|
fix_result = await apply_configuration_fix(fix, project_context)
|
|
|
|
elif fix.fix_type == 'dependency_update':
|
|
# Use Bash tool to update dependencies
|
|
fix_result = await apply_dependency_fix(fix, project_context)
|
|
|
|
elif fix.fix_type == 'test_addition':
|
|
# Use Write tool to add preventive tests
|
|
fix_result = await add_preventive_tests(fix, project_context)
|
|
|
|
implementation_results.append({
|
|
'fix_id': fix.id,
|
|
'status': 'success',
|
|
'result': fix_result
|
|
})
|
|
|
|
except Exception as e:
|
|
implementation_results.append({
|
|
'fix_id': fix.id,
|
|
'status': 'failed',
|
|
'error': str(e)
|
|
})
|
|
|
|
# Validate fixes were applied correctly
|
|
validation_results = await validate_prevention_fixes(
|
|
implementation_results,
|
|
project_context
|
|
)
|
|
|
|
return {
|
|
'implementation_results': implementation_results,
|
|
'validation_results': validation_results,
|
|
'overall_success': all(r['status'] == 'success' for r in implementation_results)
|
|
}
|
|
```
|
|
|
|
### Real-time Error Monitoring and Learning
|
|
|
|
#### Continuous Learning from Claude Code Operations
|
|
```python
|
|
async def monitor_claude_code_operations():
|
|
"""
|
|
Continuously monitor Claude Code operations for error patterns and learning opportunities
|
|
"""
|
|
operation_monitor = {
|
|
'tool_usage_monitor': ToolUsageMonitor(),
|
|
'error_detection_monitor': ErrorDetectionMonitor(),
|
|
'performance_monitor': PerformanceMonitor(),
|
|
'success_pattern_monitor': SuccessPatternMonitor()
|
|
}
|
|
|
|
async def monitoring_loop():
|
|
while True:
|
|
# Collect operation data
|
|
operation_data = await collect_operation_data(operation_monitor)
|
|
|
|
# Analyze for error patterns
|
|
error_analysis = await analyze_for_error_patterns(operation_data)
|
|
|
|
if error_analysis.new_patterns_detected:
|
|
# Learn new error patterns
|
|
await learn_new_error_patterns(error_analysis.new_patterns)
|
|
|
|
# Update prevention rules
|
|
await update_prevention_rules(error_analysis.new_patterns)
|
|
|
|
# Analyze for success patterns
|
|
success_analysis = await analyze_for_success_patterns(operation_data)
|
|
|
|
if success_analysis.new_patterns_detected:
|
|
# Learn new success patterns
|
|
await learn_new_success_patterns(success_analysis.new_patterns)
|
|
|
|
# Update recommendation engine
|
|
await update_recommendation_engine(success_analysis.new_patterns)
|
|
|
|
# Update error prevention database
|
|
await update_error_prevention_database(
|
|
error_analysis,
|
|
success_analysis,
|
|
operation_data
|
|
)
|
|
|
|
await asyncio.sleep(5) # Monitor every 5 seconds
|
|
|
|
# Start monitoring
|
|
await monitoring_loop()
|
|
|
|
async def learn_from_error_occurrence(error_details, context):
|
|
"""
|
|
Learn from actual error occurrences to improve prevention
|
|
"""
|
|
# Create error entry
|
|
error_entry = {
|
|
'id': generate_uuid(),
|
|
'timestamp': datetime.utcnow().isoformat(),
|
|
'error_details': error_details,
|
|
'context': context,
|
|
'severity': classify_error_severity(error_details),
|
|
'category': classify_error_category(error_details)
|
|
}
|
|
|
|
# Perform root cause analysis
|
|
root_cause_analysis = await perform_root_cause_analysis(
|
|
error_details,
|
|
context
|
|
)
|
|
error_entry['root_cause_analysis'] = root_cause_analysis
|
|
|
|
# Generate prevention strategies
|
|
prevention_strategies = await generate_prevention_strategies(
|
|
error_entry,
|
|
root_cause_analysis
|
|
)
|
|
error_entry['prevention_strategy'] = prevention_strategies
|
|
|
|
# Store error entry
|
|
await store_error_entry(error_entry)
|
|
|
|
# Update prevention rules
|
|
await update_prevention_rules_from_error(error_entry)
|
|
|
|
# Notify relevant personas about new error pattern
|
|
await notify_personas_of_new_error_pattern(error_entry)
|
|
|
|
return {
|
|
'error_learned': True,
|
|
'prevention_strategies_generated': len(prevention_strategies['prevention_steps']),
|
|
'detection_rules_created': len(prevention_strategies['detection_rules'])
|
|
}
|
|
```
|
|
|
|
### Error Prevention Dashboard and Reporting
|
|
|
|
#### Comprehensive Error Prevention Analytics
|
|
```yaml
|
|
error_prevention_metrics:
|
|
prevention_effectiveness:
|
|
errors_prevented: "Count of errors caught before execution"
|
|
false_positives: "Warnings that didn't lead to actual errors"
|
|
false_negatives: "Errors that weren't caught by prevention"
|
|
prevention_accuracy: "Percentage of accurate error predictions"
|
|
|
|
learning_progress:
|
|
new_patterns_learned: "Number of new error patterns identified"
|
|
pattern_accuracy_improvement: "How pattern recognition has improved"
|
|
prevention_rule_effectiveness: "Success rate of prevention rules"
|
|
|
|
system_reliability:
|
|
mean_time_between_errors: "MTBE for different error categories"
|
|
error_severity_distribution: "Breakdown of error types caught"
|
|
recovery_time_improvement: "How quickly errors are resolved"
|
|
|
|
development_impact:
|
|
development_velocity_impact: "How prevention affects speed"
|
|
code_quality_improvement: "Measurable quality gains"
|
|
developer_confidence: "Survey results on prevention helpfulness"
|
|
```
|
|
|
|
### Claude Code Integration Commands
|
|
|
|
```bash
|
|
# Error prevention and analysis
|
|
bmad prevent --analyze-risks --operation "database-migration"
|
|
bmad prevent --scan-patterns --project-path "src/"
|
|
bmad prevent --check-command "rm -rf node_modules" --suggest-safer
|
|
|
|
# Error learning and pattern management
|
|
bmad errors learn --from-incident "incident-report.md"
|
|
bmad errors patterns --list --category "security"
|
|
bmad errors rules --update --based-on-recent-failures
|
|
|
|
# Prevention implementation
|
|
bmad prevent implement --fixes-for "high-risk-patterns"
|
|
bmad prevent validate --applied-fixes --test-effectiveness
|
|
bmad prevent monitor --real-time --alert-on-risks
|
|
|
|
# Error prevention reporting
|
|
bmad prevent report --effectiveness --time-period "last-month"
|
|
bmad prevent dashboard --show-trends --error-categories
|
|
bmad prevent export --prevention-rules --format "yaml"
|
|
```
|
|
|
|
This Error Prevention System transforms Claude Code into a proactive development assistant that learns from every mistake and continuously improves its ability to prevent errors, creating an increasingly safe and reliable development environment. |