BMAD-METHOD/bmad-system/memory/error-prevention-system.md

21 KiB

Error Prevention System

Mistake Tracking and Prevention for Claude Code

The Error Prevention System enables Claude Code to learn from past mistakes and proactively prevent similar errors, creating a self-improving development environment that gets safer over time.

Error Catalog and Learning Framework

Comprehensive Error Documentation

error_entry:
  identification:
    id: "{uuid}"
    timestamp: "2024-01-15T14:30:00Z"
    severity: "critical|high|medium|low"
    category: "security|performance|logic|integration|deployment"
    error_signature: "unique_fingerprint_for_similar_errors"
    
  error_details:
    description: "Database connection pool exhaustion causing 503 errors"
    symptoms: 
      - "HTTP 503 Service Unavailable responses"
      - "Database connection timeout errors in logs"
      - "Application hanging on database queries"
      - "Memory usage steadily increasing"
    impact:
      - user_experience: "Complete service unavailability"
      - business_impact: "Revenue loss during downtime"
      - technical_debt: "Required emergency hotfix"
      - team_impact: "Weekend emergency response required"
    affected_components: 
      - "Database connection pool"
      - "API endpoints"
      - "User authentication service"
      - "Payment processing"
      
  context_information:
    project_phase: "production"
    technology_stack: ["nodejs", "postgresql", "docker", "kubernetes"]
    project_characteristics:
      size: "large"
      complexity: "high"
      team_size: "8"
      load_profile: "high_traffic"
    environmental_factors:
      - "Black Friday traffic spike"
      - "Recent deployment of new features"
      - "Database maintenance window completed day before"
    claude_code_context:
      files_involved: ["src/database/pool.js", "config/database.js"]
      tools_used_before_error: ["Edit", "Bash", "Write"]
      recent_changes: ["Increased connection timeout", "Added retry logic"]
      
  root_cause_analysis:
    immediate_cause: "Connection pool size insufficient for traffic spike"
    contributing_factors:
      - "Default pool size never adjusted for production load"
      - "No connection pool monitoring in place"
      - "Load testing didn't simulate realistic user behavior"
      - "Connection leak in error handling paths"
    root_cause: "Inadequate capacity planning and monitoring for database connections"
    analysis_method: "5 whys analysis + performance profiling"
    investigation_tools: ["APM traces", "Database logs", "Container metrics"]
    
  prevention_strategy:
    detection_rules:
      - rule: "Monitor connection pool utilization"
        trigger: "when pool_utilization > 80%"
        action: "Alert DevOps team immediately"
        automation_possible: true
        
      - rule: "Watch for connection timeout patterns"
        trigger: "when connection_timeouts > 5 in 1 minute"
        action: "Scale pool size automatically"
        automation_possible: true
        
      - rule: "Track connection pool growth rate"
        trigger: "when pool_size increases > 20% in 5 minutes"
        action: "Check for connection leaks"
        automation_possible: false
        
    prevention_steps:
      - step: "Implement connection pool monitoring"
        when: "during development phase"
        responsibility: "platform-engineer"
        tools_involved: ["monitoring setup", "alerting configuration"]
        effort_estimate: "4 hours"
        
      - step: "Add connection pool size auto-scaling"
        when: "before production deployment"
        responsibility: "dev"
        tools_involved: ["database configuration", "scaling logic"]
        effort_estimate: "8 hours"
        
      - step: "Implement proper connection cleanup"
        when: "during code review"
        responsibility: "dev"
        tools_involved: ["code review", "static analysis"]
        effort_estimate: "2 hours"
        
    validation_checks:
      - check: "Load test with connection pool monitoring"
        automation: "ci_cd_pipeline"
        frequency: "before_each_production_deployment"
        
      - check: "Review database connection usage patterns"
        automation: "static_analysis_tool"
        frequency: "with_each_code_change"
        
      - check: "Validate connection cleanup in error paths"
        automation: "integration_tests"
        frequency: "continuous"
        
  recovery_procedures:
    immediate_response:
      - "Scale database connection pool size"
      - "Restart application instances to clear stale connections"
      - "Enable database connection throttling"
      - "Redirect traffic to secondary regions if available"
      
    short_term_fixes:
      - "Implement connection pool monitoring dashboard"
      - "Add automated scaling for connection pool"
      - "Fix connection leaks in error handling"
      
    long_term_improvements:
      - "Implement comprehensive database capacity planning"
      - "Add chaos engineering tests for database failures"
      - "Create runbooks for database scaling scenarios"
      
  lessons_learned:
    - "Connection pool sizing must account for traffic spikes"
    - "Monitoring is essential for database resource management"
    - "Load testing scenarios should include realistic user patterns"
    - "Error handling paths need careful connection management"
    - "Automated scaling can prevent manual intervention delays"

Proactive Error Detection for Claude Code

Claude Code Tool Integration for Error Prevention

async def prevent_errors_in_claude_operations(operation_type, operation_context):
    """
    Prevent errors before Claude Code tool execution
    """
    # Get operation-specific error patterns
    relevant_errors = await get_relevant_error_patterns(
        operation_type,
        operation_context
    )
    
    error_prevention_result = {
        'operation_safe': True,
        'warnings': [],
        'preventive_actions': [],
        'risk_factors': []
    }
    
    # Analyze each relevant error pattern
    for error_pattern in relevant_errors:
        risk_assessment = assess_error_risk(
            error_pattern,
            operation_context
        )
        
        if risk_assessment.risk_level > 0.3:  # 30% risk threshold
            error_prevention_result['operation_safe'] = False
            error_prevention_result['warnings'].append({
                'error_type': error_pattern['category'],
                'description': error_pattern['description'],
                'risk_level': risk_assessment.risk_level,
                'similar_past_cases': risk_assessment.similar_cases
            })
            
            # Generate preventive actions
            preventive_actions = generate_preventive_actions(
                error_pattern,
                operation_context
            )
            error_prevention_result['preventive_actions'].extend(preventive_actions)
    
    return error_prevention_result

async def error_aware_file_edit(file_path, edit_content, current_context):
    """
    Edit files with error prevention based on historical patterns
    """
    # Pre-edit error analysis
    edit_risks = await analyze_edit_risks(file_path, edit_content, current_context)
    
    if edit_risks.has_high_risk_patterns:
        # Present warnings and suggest safer alternatives
        risk_warnings = []
        
        for risk in edit_risks.high_risk_patterns:
            warning = {
                'risk_type': risk.pattern_type,
                'description': risk.description,
                'historical_failures': risk.past_failures,
                'suggested_alternatives': risk.safer_alternatives
            }
            risk_warnings.append(warning)
        
        # Get user confirmation or apply safer alternatives
        prevention_response = await handle_edit_risk_warnings(
            risk_warnings,
            file_path,
            edit_content
        )
        
        if prevention_response.action == 'cancel':
            return {'status': 'cancelled', 'reason': 'high_risk_prevented'}
        elif prevention_response.action == 'modify':
            edit_content = prevention_response.safer_content
    
    # Execute edit with monitoring
    edit_result = await claude_code_edit(file_path, edit_content)
    
    # Post-edit validation
    post_edit_validation = await validate_edit_success(
        file_path,
        edit_content,
        edit_result,
        edit_risks
    )
    
    # Learn from edit outcome
    await learn_from_edit_outcome(
        file_path,
        edit_content,
        edit_result,
        post_edit_validation,
        current_context
    )
    
    return {
        'edit_result': edit_result,
        'risk_prevention': edit_risks,
        'validation': post_edit_validation
    }

async def error_aware_bash_execution(command, current_context):
    """
    Execute bash commands with error prevention
    """
    # Analyze command for known dangerous patterns
    command_risks = await analyze_command_risks(command, current_context)
    
    if command_risks.has_dangerous_patterns:
        # Check against error history
        similar_failures = await find_similar_command_failures(
            command,
            current_context
        )
        
        if similar_failures:
            # Provide warnings and safer alternatives
            safety_recommendations = generate_command_safety_recommendations(
                command,
                similar_failures,
                current_context
            )
            
            safer_command = await suggest_safer_command_alternative(
                command,
                safety_recommendations
            )
            
            if safer_command:
                command = safer_command
    
    # Execute with error monitoring
    execution_start = datetime.utcnow()
    
    try:
        result = await claude_code_bash(command)
        execution_duration = (datetime.utcnow() - execution_start).total_seconds()
        
        # Learn from successful execution
        await record_successful_command_execution(
            command,
            result,
            execution_duration,
            current_context
        )
        
        return result
        
    except Exception as e:
        execution_duration = (datetime.utcnow() - execution_start).total_seconds()
        
        # Learn from failed execution
        await record_failed_command_execution(
            command,
            str(e),
            execution_duration,
            current_context
        )
        
        # Try to provide recovery suggestions
        recovery_suggestions = await generate_recovery_suggestions(
            command,
            str(e),
            current_context
        )
        
        raise Exception(f"Command failed: {str(e)}\nRecovery suggestions: {recovery_suggestions}")

Pattern-Based Error Prevention

Automatic Error Pattern Detection

async def detect_error_patterns_in_codebase(project_path):
    """
    Detect potential error patterns in codebase using Claude Code tools
    """
    # Use Glob to find all relevant files
    code_files = await claude_code_glob("**/*.{js,ts,py,java,go,rb}")
    
    detected_patterns = {
        'high_risk': [],
        'medium_risk': [],
        'low_risk': []
    }
    
    # Load known error patterns
    error_patterns = await load_error_pattern_library()
    
    # Analyze each file for error patterns
    for file_path in code_files:
        file_content = await claude_code_read(file_path)
        
        for pattern in error_patterns:
            # Use Grep to find pattern matches
            pattern_matches = await claude_code_grep(pattern.search_regex, file_path)
            
            if pattern_matches.matches:
                for match in pattern_matches.matches:
                    risk_assessment = assess_pattern_risk(
                        pattern,
                        match,
                        file_content,
                        file_path
                    )
                    
                    detected_pattern = {
                        'pattern_name': pattern.name,
                        'file_path': file_path,
                        'line_number': match.line_number,
                        'match_text': match.text,
                        'risk_level': risk_assessment.risk_level,
                        'potential_issues': risk_assessment.potential_issues,
                        'recommendations': risk_assessment.recommendations
                    }
                    
                    if risk_assessment.risk_level >= 0.7:
                        detected_patterns['high_risk'].append(detected_pattern)
                    elif risk_assessment.risk_level >= 0.4:
                        detected_patterns['medium_risk'].append(detected_pattern)
                    else:
                        detected_patterns['low_risk'].append(detected_pattern)
    
    # Generate prevention recommendations
    prevention_plan = await generate_pattern_prevention_plan(detected_patterns)
    
    return {
        'detected_patterns': detected_patterns,
        'prevention_plan': prevention_plan,
        'risk_summary': {
            'high_risk_count': len(detected_patterns['high_risk']),
            'medium_risk_count': len(detected_patterns['medium_risk']),
            'low_risk_count': len(detected_patterns['low_risk'])
        }
    }

async def implement_error_prevention_fixes(prevention_plan, project_context):
    """
    Implement error prevention fixes using Claude Code tools
    """
    implementation_results = []
    
    for fix in prevention_plan.recommended_fixes:
        try:
            if fix.fix_type == 'code_modification':
                # Use Edit tool to apply code fixes
                fix_result = await apply_code_fix(fix, project_context)
                
            elif fix.fix_type == 'configuration_change':
                # Use Write tool to update configuration
                fix_result = await apply_configuration_fix(fix, project_context)
                
            elif fix.fix_type == 'dependency_update':
                # Use Bash tool to update dependencies
                fix_result = await apply_dependency_fix(fix, project_context)
                
            elif fix.fix_type == 'test_addition':
                # Use Write tool to add preventive tests
                fix_result = await add_preventive_tests(fix, project_context)
            
            implementation_results.append({
                'fix_id': fix.id,
                'status': 'success',
                'result': fix_result
            })
            
        except Exception as e:
            implementation_results.append({
                'fix_id': fix.id,
                'status': 'failed',
                'error': str(e)
            })
    
    # Validate fixes were applied correctly
    validation_results = await validate_prevention_fixes(
        implementation_results,
        project_context
    )
    
    return {
        'implementation_results': implementation_results,
        'validation_results': validation_results,
        'overall_success': all(r['status'] == 'success' for r in implementation_results)
    }

Real-time Error Monitoring and Learning

Continuous Learning from Claude Code Operations

async def monitor_claude_code_operations():
    """
    Continuously monitor Claude Code operations for error patterns and learning opportunities
    """
    operation_monitor = {
        'tool_usage_monitor': ToolUsageMonitor(),
        'error_detection_monitor': ErrorDetectionMonitor(),
        'performance_monitor': PerformanceMonitor(),
        'success_pattern_monitor': SuccessPatternMonitor()
    }
    
    async def monitoring_loop():
        while True:
            # Collect operation data
            operation_data = await collect_operation_data(operation_monitor)
            
            # Analyze for error patterns
            error_analysis = await analyze_for_error_patterns(operation_data)
            
            if error_analysis.new_patterns_detected:
                # Learn new error patterns
                await learn_new_error_patterns(error_analysis.new_patterns)
                
                # Update prevention rules
                await update_prevention_rules(error_analysis.new_patterns)
            
            # Analyze for success patterns
            success_analysis = await analyze_for_success_patterns(operation_data)
            
            if success_analysis.new_patterns_detected:
                # Learn new success patterns
                await learn_new_success_patterns(success_analysis.new_patterns)
                
                # Update recommendation engine
                await update_recommendation_engine(success_analysis.new_patterns)
            
            # Update error prevention database
            await update_error_prevention_database(
                error_analysis,
                success_analysis,
                operation_data
            )
            
            await asyncio.sleep(5)  # Monitor every 5 seconds
    
    # Start monitoring
    await monitoring_loop()

async def learn_from_error_occurrence(error_details, context):
    """
    Learn from actual error occurrences to improve prevention
    """
    # Create error entry
    error_entry = {
        'id': generate_uuid(),
        'timestamp': datetime.utcnow().isoformat(),
        'error_details': error_details,
        'context': context,
        'severity': classify_error_severity(error_details),
        'category': classify_error_category(error_details)
    }
    
    # Perform root cause analysis
    root_cause_analysis = await perform_root_cause_analysis(
        error_details,
        context
    )
    error_entry['root_cause_analysis'] = root_cause_analysis
    
    # Generate prevention strategies
    prevention_strategies = await generate_prevention_strategies(
        error_entry,
        root_cause_analysis
    )
    error_entry['prevention_strategy'] = prevention_strategies
    
    # Store error entry
    await store_error_entry(error_entry)
    
    # Update prevention rules
    await update_prevention_rules_from_error(error_entry)
    
    # Notify relevant personas about new error pattern
    await notify_personas_of_new_error_pattern(error_entry)
    
    return {
        'error_learned': True,
        'prevention_strategies_generated': len(prevention_strategies['prevention_steps']),
        'detection_rules_created': len(prevention_strategies['detection_rules'])
    }

Error Prevention Dashboard and Reporting

Comprehensive Error Prevention Analytics

error_prevention_metrics:
  prevention_effectiveness:
    errors_prevented: "Count of errors caught before execution"
    false_positives: "Warnings that didn't lead to actual errors"
    false_negatives: "Errors that weren't caught by prevention"
    prevention_accuracy: "Percentage of accurate error predictions"
    
  learning_progress:
    new_patterns_learned: "Number of new error patterns identified"
    pattern_accuracy_improvement: "How pattern recognition has improved"
    prevention_rule_effectiveness: "Success rate of prevention rules"
    
  system_reliability:
    mean_time_between_errors: "MTBE for different error categories"
    error_severity_distribution: "Breakdown of error types caught"
    recovery_time_improvement: "How quickly errors are resolved"
    
  development_impact:
    development_velocity_impact: "How prevention affects speed"
    code_quality_improvement: "Measurable quality gains"
    developer_confidence: "Survey results on prevention helpfulness"

Claude Code Integration Commands

# Error prevention and analysis
bmad prevent --analyze-risks --operation "database-migration"
bmad prevent --scan-patterns --project-path "src/"
bmad prevent --check-command "rm -rf node_modules" --suggest-safer

# Error learning and pattern management
bmad errors learn --from-incident "incident-report.md"
bmad errors patterns --list --category "security"
bmad errors rules --update --based-on-recent-failures

# Prevention implementation
bmad prevent implement --fixes-for "high-risk-patterns"
bmad prevent validate --applied-fixes --test-effectiveness
bmad prevent monitor --real-time --alert-on-risks

# Error prevention reporting
bmad prevent report --effectiveness --time-period "last-month"
bmad prevent dashboard --show-trends --error-categories
bmad prevent export --prevention-rules --format "yaml"

This Error Prevention System transforms Claude Code into a proactive development assistant that learns from every mistake and continuously improves its ability to prevent errors, creating an increasingly safe and reliable development environment.