21 KiB
21 KiB
Error Prevention System
Mistake Tracking and Prevention for Claude Code
The Error Prevention System enables Claude Code to learn from past mistakes and proactively prevent similar errors, creating a self-improving development environment that gets safer over time.
Error Catalog and Learning Framework
Comprehensive Error Documentation
error_entry:
identification:
id: "{uuid}"
timestamp: "2024-01-15T14:30:00Z"
severity: "critical|high|medium|low"
category: "security|performance|logic|integration|deployment"
error_signature: "unique_fingerprint_for_similar_errors"
error_details:
description: "Database connection pool exhaustion causing 503 errors"
symptoms:
- "HTTP 503 Service Unavailable responses"
- "Database connection timeout errors in logs"
- "Application hanging on database queries"
- "Memory usage steadily increasing"
impact:
- user_experience: "Complete service unavailability"
- business_impact: "Revenue loss during downtime"
- technical_debt: "Required emergency hotfix"
- team_impact: "Weekend emergency response required"
affected_components:
- "Database connection pool"
- "API endpoints"
- "User authentication service"
- "Payment processing"
context_information:
project_phase: "production"
technology_stack: ["nodejs", "postgresql", "docker", "kubernetes"]
project_characteristics:
size: "large"
complexity: "high"
team_size: "8"
load_profile: "high_traffic"
environmental_factors:
- "Black Friday traffic spike"
- "Recent deployment of new features"
- "Database maintenance window completed day before"
claude_code_context:
files_involved: ["src/database/pool.js", "config/database.js"]
tools_used_before_error: ["Edit", "Bash", "Write"]
recent_changes: ["Increased connection timeout", "Added retry logic"]
root_cause_analysis:
immediate_cause: "Connection pool size insufficient for traffic spike"
contributing_factors:
- "Default pool size never adjusted for production load"
- "No connection pool monitoring in place"
- "Load testing didn't simulate realistic user behavior"
- "Connection leak in error handling paths"
root_cause: "Inadequate capacity planning and monitoring for database connections"
analysis_method: "5 whys analysis + performance profiling"
investigation_tools: ["APM traces", "Database logs", "Container metrics"]
prevention_strategy:
detection_rules:
- rule: "Monitor connection pool utilization"
trigger: "when pool_utilization > 80%"
action: "Alert DevOps team immediately"
automation_possible: true
- rule: "Watch for connection timeout patterns"
trigger: "when connection_timeouts > 5 in 1 minute"
action: "Scale pool size automatically"
automation_possible: true
- rule: "Track connection pool growth rate"
trigger: "when pool_size increases > 20% in 5 minutes"
action: "Check for connection leaks"
automation_possible: false
prevention_steps:
- step: "Implement connection pool monitoring"
when: "during development phase"
responsibility: "platform-engineer"
tools_involved: ["monitoring setup", "alerting configuration"]
effort_estimate: "4 hours"
- step: "Add connection pool size auto-scaling"
when: "before production deployment"
responsibility: "dev"
tools_involved: ["database configuration", "scaling logic"]
effort_estimate: "8 hours"
- step: "Implement proper connection cleanup"
when: "during code review"
responsibility: "dev"
tools_involved: ["code review", "static analysis"]
effort_estimate: "2 hours"
validation_checks:
- check: "Load test with connection pool monitoring"
automation: "ci_cd_pipeline"
frequency: "before_each_production_deployment"
- check: "Review database connection usage patterns"
automation: "static_analysis_tool"
frequency: "with_each_code_change"
- check: "Validate connection cleanup in error paths"
automation: "integration_tests"
frequency: "continuous"
recovery_procedures:
immediate_response:
- "Scale database connection pool size"
- "Restart application instances to clear stale connections"
- "Enable database connection throttling"
- "Redirect traffic to secondary regions if available"
short_term_fixes:
- "Implement connection pool monitoring dashboard"
- "Add automated scaling for connection pool"
- "Fix connection leaks in error handling"
long_term_improvements:
- "Implement comprehensive database capacity planning"
- "Add chaos engineering tests for database failures"
- "Create runbooks for database scaling scenarios"
lessons_learned:
- "Connection pool sizing must account for traffic spikes"
- "Monitoring is essential for database resource management"
- "Load testing scenarios should include realistic user patterns"
- "Error handling paths need careful connection management"
- "Automated scaling can prevent manual intervention delays"
Proactive Error Detection for Claude Code
Claude Code Tool Integration for Error Prevention
async def prevent_errors_in_claude_operations(operation_type, operation_context):
"""
Prevent errors before Claude Code tool execution
"""
# Get operation-specific error patterns
relevant_errors = await get_relevant_error_patterns(
operation_type,
operation_context
)
error_prevention_result = {
'operation_safe': True,
'warnings': [],
'preventive_actions': [],
'risk_factors': []
}
# Analyze each relevant error pattern
for error_pattern in relevant_errors:
risk_assessment = assess_error_risk(
error_pattern,
operation_context
)
if risk_assessment.risk_level > 0.3: # 30% risk threshold
error_prevention_result['operation_safe'] = False
error_prevention_result['warnings'].append({
'error_type': error_pattern['category'],
'description': error_pattern['description'],
'risk_level': risk_assessment.risk_level,
'similar_past_cases': risk_assessment.similar_cases
})
# Generate preventive actions
preventive_actions = generate_preventive_actions(
error_pattern,
operation_context
)
error_prevention_result['preventive_actions'].extend(preventive_actions)
return error_prevention_result
async def error_aware_file_edit(file_path, edit_content, current_context):
"""
Edit files with error prevention based on historical patterns
"""
# Pre-edit error analysis
edit_risks = await analyze_edit_risks(file_path, edit_content, current_context)
if edit_risks.has_high_risk_patterns:
# Present warnings and suggest safer alternatives
risk_warnings = []
for risk in edit_risks.high_risk_patterns:
warning = {
'risk_type': risk.pattern_type,
'description': risk.description,
'historical_failures': risk.past_failures,
'suggested_alternatives': risk.safer_alternatives
}
risk_warnings.append(warning)
# Get user confirmation or apply safer alternatives
prevention_response = await handle_edit_risk_warnings(
risk_warnings,
file_path,
edit_content
)
if prevention_response.action == 'cancel':
return {'status': 'cancelled', 'reason': 'high_risk_prevented'}
elif prevention_response.action == 'modify':
edit_content = prevention_response.safer_content
# Execute edit with monitoring
edit_result = await claude_code_edit(file_path, edit_content)
# Post-edit validation
post_edit_validation = await validate_edit_success(
file_path,
edit_content,
edit_result,
edit_risks
)
# Learn from edit outcome
await learn_from_edit_outcome(
file_path,
edit_content,
edit_result,
post_edit_validation,
current_context
)
return {
'edit_result': edit_result,
'risk_prevention': edit_risks,
'validation': post_edit_validation
}
async def error_aware_bash_execution(command, current_context):
"""
Execute bash commands with error prevention
"""
# Analyze command for known dangerous patterns
command_risks = await analyze_command_risks(command, current_context)
if command_risks.has_dangerous_patterns:
# Check against error history
similar_failures = await find_similar_command_failures(
command,
current_context
)
if similar_failures:
# Provide warnings and safer alternatives
safety_recommendations = generate_command_safety_recommendations(
command,
similar_failures,
current_context
)
safer_command = await suggest_safer_command_alternative(
command,
safety_recommendations
)
if safer_command:
command = safer_command
# Execute with error monitoring
execution_start = datetime.utcnow()
try:
result = await claude_code_bash(command)
execution_duration = (datetime.utcnow() - execution_start).total_seconds()
# Learn from successful execution
await record_successful_command_execution(
command,
result,
execution_duration,
current_context
)
return result
except Exception as e:
execution_duration = (datetime.utcnow() - execution_start).total_seconds()
# Learn from failed execution
await record_failed_command_execution(
command,
str(e),
execution_duration,
current_context
)
# Try to provide recovery suggestions
recovery_suggestions = await generate_recovery_suggestions(
command,
str(e),
current_context
)
raise Exception(f"Command failed: {str(e)}\nRecovery suggestions: {recovery_suggestions}")
Pattern-Based Error Prevention
Automatic Error Pattern Detection
async def detect_error_patterns_in_codebase(project_path):
"""
Detect potential error patterns in codebase using Claude Code tools
"""
# Use Glob to find all relevant files
code_files = await claude_code_glob("**/*.{js,ts,py,java,go,rb}")
detected_patterns = {
'high_risk': [],
'medium_risk': [],
'low_risk': []
}
# Load known error patterns
error_patterns = await load_error_pattern_library()
# Analyze each file for error patterns
for file_path in code_files:
file_content = await claude_code_read(file_path)
for pattern in error_patterns:
# Use Grep to find pattern matches
pattern_matches = await claude_code_grep(pattern.search_regex, file_path)
if pattern_matches.matches:
for match in pattern_matches.matches:
risk_assessment = assess_pattern_risk(
pattern,
match,
file_content,
file_path
)
detected_pattern = {
'pattern_name': pattern.name,
'file_path': file_path,
'line_number': match.line_number,
'match_text': match.text,
'risk_level': risk_assessment.risk_level,
'potential_issues': risk_assessment.potential_issues,
'recommendations': risk_assessment.recommendations
}
if risk_assessment.risk_level >= 0.7:
detected_patterns['high_risk'].append(detected_pattern)
elif risk_assessment.risk_level >= 0.4:
detected_patterns['medium_risk'].append(detected_pattern)
else:
detected_patterns['low_risk'].append(detected_pattern)
# Generate prevention recommendations
prevention_plan = await generate_pattern_prevention_plan(detected_patterns)
return {
'detected_patterns': detected_patterns,
'prevention_plan': prevention_plan,
'risk_summary': {
'high_risk_count': len(detected_patterns['high_risk']),
'medium_risk_count': len(detected_patterns['medium_risk']),
'low_risk_count': len(detected_patterns['low_risk'])
}
}
async def implement_error_prevention_fixes(prevention_plan, project_context):
"""
Implement error prevention fixes using Claude Code tools
"""
implementation_results = []
for fix in prevention_plan.recommended_fixes:
try:
if fix.fix_type == 'code_modification':
# Use Edit tool to apply code fixes
fix_result = await apply_code_fix(fix, project_context)
elif fix.fix_type == 'configuration_change':
# Use Write tool to update configuration
fix_result = await apply_configuration_fix(fix, project_context)
elif fix.fix_type == 'dependency_update':
# Use Bash tool to update dependencies
fix_result = await apply_dependency_fix(fix, project_context)
elif fix.fix_type == 'test_addition':
# Use Write tool to add preventive tests
fix_result = await add_preventive_tests(fix, project_context)
implementation_results.append({
'fix_id': fix.id,
'status': 'success',
'result': fix_result
})
except Exception as e:
implementation_results.append({
'fix_id': fix.id,
'status': 'failed',
'error': str(e)
})
# Validate fixes were applied correctly
validation_results = await validate_prevention_fixes(
implementation_results,
project_context
)
return {
'implementation_results': implementation_results,
'validation_results': validation_results,
'overall_success': all(r['status'] == 'success' for r in implementation_results)
}
Real-time Error Monitoring and Learning
Continuous Learning from Claude Code Operations
async def monitor_claude_code_operations():
"""
Continuously monitor Claude Code operations for error patterns and learning opportunities
"""
operation_monitor = {
'tool_usage_monitor': ToolUsageMonitor(),
'error_detection_monitor': ErrorDetectionMonitor(),
'performance_monitor': PerformanceMonitor(),
'success_pattern_monitor': SuccessPatternMonitor()
}
async def monitoring_loop():
while True:
# Collect operation data
operation_data = await collect_operation_data(operation_monitor)
# Analyze for error patterns
error_analysis = await analyze_for_error_patterns(operation_data)
if error_analysis.new_patterns_detected:
# Learn new error patterns
await learn_new_error_patterns(error_analysis.new_patterns)
# Update prevention rules
await update_prevention_rules(error_analysis.new_patterns)
# Analyze for success patterns
success_analysis = await analyze_for_success_patterns(operation_data)
if success_analysis.new_patterns_detected:
# Learn new success patterns
await learn_new_success_patterns(success_analysis.new_patterns)
# Update recommendation engine
await update_recommendation_engine(success_analysis.new_patterns)
# Update error prevention database
await update_error_prevention_database(
error_analysis,
success_analysis,
operation_data
)
await asyncio.sleep(5) # Monitor every 5 seconds
# Start monitoring
await monitoring_loop()
async def learn_from_error_occurrence(error_details, context):
"""
Learn from actual error occurrences to improve prevention
"""
# Create error entry
error_entry = {
'id': generate_uuid(),
'timestamp': datetime.utcnow().isoformat(),
'error_details': error_details,
'context': context,
'severity': classify_error_severity(error_details),
'category': classify_error_category(error_details)
}
# Perform root cause analysis
root_cause_analysis = await perform_root_cause_analysis(
error_details,
context
)
error_entry['root_cause_analysis'] = root_cause_analysis
# Generate prevention strategies
prevention_strategies = await generate_prevention_strategies(
error_entry,
root_cause_analysis
)
error_entry['prevention_strategy'] = prevention_strategies
# Store error entry
await store_error_entry(error_entry)
# Update prevention rules
await update_prevention_rules_from_error(error_entry)
# Notify relevant personas about new error pattern
await notify_personas_of_new_error_pattern(error_entry)
return {
'error_learned': True,
'prevention_strategies_generated': len(prevention_strategies['prevention_steps']),
'detection_rules_created': len(prevention_strategies['detection_rules'])
}
Error Prevention Dashboard and Reporting
Comprehensive Error Prevention Analytics
error_prevention_metrics:
prevention_effectiveness:
errors_prevented: "Count of errors caught before execution"
false_positives: "Warnings that didn't lead to actual errors"
false_negatives: "Errors that weren't caught by prevention"
prevention_accuracy: "Percentage of accurate error predictions"
learning_progress:
new_patterns_learned: "Number of new error patterns identified"
pattern_accuracy_improvement: "How pattern recognition has improved"
prevention_rule_effectiveness: "Success rate of prevention rules"
system_reliability:
mean_time_between_errors: "MTBE for different error categories"
error_severity_distribution: "Breakdown of error types caught"
recovery_time_improvement: "How quickly errors are resolved"
development_impact:
development_velocity_impact: "How prevention affects speed"
code_quality_improvement: "Measurable quality gains"
developer_confidence: "Survey results on prevention helpfulness"
Claude Code Integration Commands
# Error prevention and analysis
bmad prevent --analyze-risks --operation "database-migration"
bmad prevent --scan-patterns --project-path "src/"
bmad prevent --check-command "rm -rf node_modules" --suggest-safer
# Error learning and pattern management
bmad errors learn --from-incident "incident-report.md"
bmad errors patterns --list --category "security"
bmad errors rules --update --based-on-recent-failures
# Prevention implementation
bmad prevent implement --fixes-for "high-risk-patterns"
bmad prevent validate --applied-fixes --test-effectiveness
bmad prevent monitor --real-time --alert-on-risks
# Error prevention reporting
bmad prevent report --effectiveness --time-period "last-month"
bmad prevent dashboard --show-trends --error-categories
bmad prevent export --prevention-rules --format "yaml"
This Error Prevention System transforms Claude Code into a proactive development assistant that learns from every mistake and continuously improves its ability to prevent errors, creating an increasingly safe and reliable development environment.