# Root Cause Analysis Task ## Purpose To conduct comprehensive root cause analysis for complex technical issues, utilizing systematic methodologies to identify underlying causes and develop effective prevention strategies across all technology stacks. ## Task Overview This task provides a structured approach to deep-dive analysis of technical problems, ensuring thorough investigation of root causes and development of comprehensive solutions that address underlying issues rather than just symptoms. ## Inputs for this Task - Incident description and timeline - System logs and diagnostic data - Performance metrics and monitoring data - Environmental configuration details - Stakeholder interviews and observations - Previous incident history and patterns ## Task Execution Instructions ### Phase 1: Incident Reconstruction and Data Collection #### 1.1 Timeline Reconstruction - **Chronological Analysis:** - Create detailed timeline of events leading to the incident - Identify trigger events and contributing factors - Map system changes and deployments to timeline - Correlate user actions with system behaviors - **Data Point Collection:** - Gather all relevant logs from affected systems - Collect performance metrics before, during, and after incident - Document configuration changes and system modifications - Compile user reports and stakeholder observations #### 1.2 System State Analysis - **Pre-Incident State:** - Analyze system health and performance baselines - Review recent changes and deployments - Identify any warning signs or anomalies - Document normal operational parameters - **Incident State:** - Capture system behavior during the incident - Document error conditions and failure modes - Analyze resource utilization and constraints - Record user impact and business consequences ### Phase 2: Systematic Root Cause Investigation #### 2.1 5 Whys Analysis - **Iterative Questioning:** - Start with the immediate problem statement - Ask "Why did this happen?" for each identified cause - Continue questioning until fundamental root cause is reached - Document each level of analysis with supporting evidence - **Evidence Validation:** - Support each "why" with concrete evidence - Verify assumptions with data and testing - Eliminate speculation and focus on facts - Cross-reference findings with multiple data sources #### 2.2 Fishbone Diagram Analysis - **Category-Based Investigation:** - **People:** Human factors, training, procedures, communication - **Process:** Workflows, procedures, policies, standards - **Technology:** Hardware, software, infrastructure, tools - **Environment:** External factors, dependencies, constraints - **Comprehensive Cause Mapping:** - Identify all potential contributing factors in each category - Analyze interactions between different categories - Prioritize causes based on impact and evidence - Validate cause relationships with data and testing #### 2.3 Fault Tree Analysis - **Top-Down Analysis:** - Start with the top-level failure event - Systematically break down into contributing events - Use logical gates (AND, OR) to show relationships - Continue decomposition until basic events are reached - **Probability Assessment:** - Assign probability estimates to basic events - Calculate overall failure probability - Identify critical paths and high-impact factors - Prioritize mitigation efforts based on risk analysis ### Phase 3: Contributing Factor Analysis #### 3.1 Technical Contributing Factors - **System Design Issues:** - Architecture limitations and design flaws - Scalability constraints and bottlenecks - Integration weaknesses and dependencies - Performance limitations and resource constraints - **Implementation Problems:** - Code defects and logic errors - Configuration mistakes and inconsistencies - Deployment issues and environment differences - Testing gaps and validation failures #### 3.2 Process Contributing Factors - **Operational Processes:** - Monitoring and alerting gaps - Incident response procedures - Change management processes - Capacity planning and resource management - **Development Processes:** - Code review and quality assurance - Testing strategies and coverage - Deployment and release procedures - Documentation and knowledge management #### 3.3 Human Contributing Factors - **Knowledge and Training:** - Skill gaps and training needs - Knowledge transfer and documentation - Experience levels and expertise - Communication and collaboration - **Decision Making:** - Risk assessment and management - Priority setting and resource allocation - Escalation procedures and authority - Information availability and quality ### Phase 4: Solution Development and Prevention Strategy #### 4.1 Immediate Corrective Actions - **Symptom Resolution:** - Address immediate symptoms and restore service - Implement temporary workarounds if needed - Ensure system stability and user access - Monitor for recurrence or side effects - **Data Preservation:** - Preserve evidence for further analysis - Backup system states and configurations - Document all corrective actions taken - Maintain audit trail for compliance #### 4.2 Root Cause Remediation - **Fundamental Fixes:** - Address identified root causes directly - Implement systematic solutions rather than patches - Consider long-term sustainability and maintainability - Plan for comprehensive testing and validation - **System Improvements:** - Enhance system design and architecture - Improve monitoring and observability - Strengthen error handling and resilience - Optimize performance and scalability #### 4.3 Prevention Strategy Development - **Proactive Measures:** - Implement monitoring and alerting for early detection - Develop automated testing and validation procedures - Create preventive maintenance and health checks - Establish capacity planning and resource management - **Process Improvements:** - Enhance change management and deployment procedures - Improve incident response and escalation processes - Strengthen quality assurance and testing practices - Develop training and knowledge sharing programs ## Quality Validation ### Analysis Quality Checks - [ ] Root cause analysis is evidence-based and thorough - [ ] Multiple analysis methodologies were applied appropriately - [ ] All contributing factors were identified and validated - [ ] Cause relationships are logical and well-supported - [ ] Analysis depth reaches fundamental root causes ### Solution Quality Checks - [ ] Solutions address root causes, not just symptoms - [ ] Prevention strategies are comprehensive and practical - [ ] Implementation plans are detailed and realistic - [ ] Risk assessment and mitigation are included - [ ] Success criteria and metrics are defined ### Documentation Quality Checks - [ ] Analysis process and findings are clearly documented - [ ] Evidence and supporting data are properly referenced - [ ] Recommendations are actionable and prioritized - [ ] Lessons learned are captured and shareable - [ ] Knowledge base is updated with findings ## Integration Points ### BMAD Method Integration - Integration with troubleshooting and problem resolution workflows - Cross-persona collaboration for complex multi-domain analysis - Integration with quality validation and improvement processes - Support for organizational learning and knowledge management ### Tool and Process Integration - Integration with incident management and ticketing systems - Support for monitoring and observability platforms - Compatibility with quality assurance and testing frameworks - Integration with change management and deployment processes ## Success Metrics ### Analysis Effectiveness - Root cause identification accuracy - Analysis completeness and thoroughness - Time to root cause identification - Stakeholder satisfaction with analysis quality ### Solution Effectiveness - Problem recurrence rate - Solution implementation success rate - Prevention strategy effectiveness - System reliability improvement ### Organizational Learning - Knowledge base contribution and utilization - Process improvement implementation rate - Team skill development and knowledge transfer - Incident prevention and early detection improvement ## Deliverables ### Primary Deliverables - **Root Cause Analysis Report:** Comprehensive analysis with findings and evidence - **Corrective Action Plan:** Detailed plan for addressing root causes - **Prevention Strategy:** Comprehensive approach to preventing recurrence - **Implementation Roadmap:** Prioritized plan for solution implementation ### Supporting Deliverables - **Timeline Reconstruction:** Detailed chronology of events and factors - **Contributing Factor Analysis:** Comprehensive analysis of all contributing elements - **Risk Assessment:** Analysis of risks and mitigation strategies - **Lessons Learned Document:** Insights and recommendations for organizational improvement Remember: Effective root cause analysis requires systematic methodology, thorough investigation, and focus on fundamental causes rather than surface symptoms.