BMAD-METHOD/bmad-agent/tasks/root-cause-analysis-task.md

# Root Cause Analysis Task

## Purpose
To conduct comprehensive root cause analysis for complex technical issues, utilizing systematic methodologies to identify underlying causes and develop effective prevention strategies across all technology stacks.

## Task Overview
This task provides a structured approach to deep-dive analysis of technical problems, ensuring thorough investigation of root causes and development of comprehensive solutions that address underlying issues rather than just symptoms.

## Inputs for this Task
- Incident description and timeline
- System logs and diagnostic data
- Performance metrics and monitoring data
- Environmental configuration details
- Stakeholder interviews and observations
- Previous incident history and patterns

## Task Execution Instructions

### Phase 1: Incident Reconstruction and Data Collection

#### 1.1 Timeline Reconstruction
- **Chronological Analysis:**
  - Create detailed timeline of events leading to the incident
  - Identify trigger events and contributing factors
  - Map system changes and deployments to timeline
  - Correlate user actions with system behaviors

- **Data Point Collection:**
  - Gather all relevant logs from affected systems
  - Collect performance metrics before, during, and after incident
  - Document configuration changes and system modifications
  - Compile user reports and stakeholder observations

#### 1.2 System State Analysis
- **Pre-Incident State:**
  - Analyze system health and performance baselines
  - Review recent changes and deployments
  - Identify any warning signs or anomalies
  - Document normal operational parameters

- **Incident State:**
  - Capture system behavior during the incident
  - Document error conditions and failure modes
  - Analyze resource utilization and constraints
  - Record user impact and business consequences

### Phase 2: Systematic Root Cause Investigation

#### 2.1 5 Whys Analysis
- **Iterative Questioning:**
  - Start with the immediate problem statement
  - Ask "Why did this happen?" for each identified cause
  - Continue questioning until fundamental root cause is reached
  - Document each level of analysis with supporting evidence

- **Evidence Validation:**
  - Support each "why" with concrete evidence
  - Verify assumptions with data and testing
  - Eliminate speculation and focus on facts
  - Cross-reference findings with multiple data sources

#### 2.2 Fishbone Diagram Analysis
- **Category-Based Investigation:**
  - **People:** Human factors, training, procedures, communication
  - **Process:** Workflows, procedures, policies, standards
  - **Technology:** Hardware, software, infrastructure, tools
  - **Environment:** External factors, dependencies, constraints

- **Comprehensive Cause Mapping:**
  - Identify all potential contributing factors in each category
  - Analyze interactions between different categories
  - Prioritize causes based on impact and evidence
  - Validate cause relationships with data and testing

#### 2.3 Fault Tree Analysis
- **Top-Down Analysis:**
  - Start with the top-level failure event
  - Systematically break down into contributing events
  - Use logical gates (AND, OR) to show relationships
  - Continue decomposition until basic events are reached

- **Probability Assessment:**
  - Assign probability estimates to basic events
  - Calculate overall failure probability
  - Identify critical paths and high-impact factors
  - Prioritize mitigation efforts based on risk analysis

### Phase 3: Contributing Factor Analysis

#### 3.1 Technical Contributing Factors
- **System Design Issues:**
  - Architecture limitations and design flaws
  - Scalability constraints and bottlenecks
  - Integration weaknesses and dependencies
  - Performance limitations and resource constraints

- **Implementation Problems:**
  - Code defects and logic errors
  - Configuration mistakes and inconsistencies
  - Deployment issues and environment differences
  - Testing gaps and validation failures

#### 3.2 Process Contributing Factors
- **Operational Processes:**
  - Monitoring and alerting gaps
  - Incident response procedures
  - Change management processes
  - Capacity planning and resource management

- **Development Processes:**
  - Code review and quality assurance
  - Testing strategies and coverage
  - Deployment and release procedures
  - Documentation and knowledge management

#### 3.3 Human Contributing Factors
- **Knowledge and Training:**
  - Skill gaps and training needs
  - Knowledge transfer and documentation
  - Experience levels and expertise
  - Communication and collaboration

- **Decision Making:**
  - Risk assessment and management
  - Priority setting and resource allocation
  - Escalation procedures and authority
  - Information availability and quality

### Phase 4: Solution Development and Prevention Strategy

#### 4.1 Immediate Corrective Actions
- **Symptom Resolution:**
  - Address immediate symptoms and restore service
  - Implement temporary workarounds if needed
  - Ensure system stability and user access
  - Monitor for recurrence or side effects

- **Data Preservation:**
  - Preserve evidence for further analysis
  - Backup system states and configurations
  - Document all corrective actions taken
  - Maintain audit trail for compliance

#### 4.2 Root Cause Remediation
- **Fundamental Fixes:**
  - Address identified root causes directly
  - Implement systematic solutions rather than patches
  - Consider long-term sustainability and maintainability
  - Plan for comprehensive testing and validation

- **System Improvements:**
  - Enhance system design and architecture
  - Improve monitoring and observability
  - Strengthen error handling and resilience
  - Optimize performance and scalability

#### 4.3 Prevention Strategy Development
- **Proactive Measures:**
  - Implement monitoring and alerting for early detection
  - Develop automated testing and validation procedures
  - Create preventive maintenance and health checks
  - Establish capacity planning and resource management

- **Process Improvements:**
  - Enhance change management and deployment procedures
  - Improve incident response and escalation processes
  - Strengthen quality assurance and testing practices
  - Develop training and knowledge sharing programs

## Quality Validation

### Analysis Quality Checks
- [ ] Root cause analysis is evidence-based and thorough
- [ ] Multiple analysis methodologies were applied appropriately
- [ ] All contributing factors were identified and validated
- [ ] Cause relationships are logical and well-supported
- [ ] Analysis depth reaches fundamental root causes

### Solution Quality Checks
- [ ] Solutions address root causes, not just symptoms
- [ ] Prevention strategies are comprehensive and practical
- [ ] Implementation plans are detailed and realistic
- [ ] Risk assessment and mitigation are included
- [ ] Success criteria and metrics are defined

### Documentation Quality Checks
- [ ] Analysis process and findings are clearly documented
- [ ] Evidence and supporting data are properly referenced
- [ ] Recommendations are actionable and prioritized
- [ ] Lessons learned are captured and shareable
- [ ] Knowledge base is updated with findings

## Integration Points

### BMAD Method Integration
- Integration with troubleshooting and problem resolution workflows
- Cross-persona collaboration for complex multi-domain analysis
- Integration with quality validation and improvement processes
- Support for organizational learning and knowledge management

### Tool and Process Integration
- Integration with incident management and ticketing systems
- Support for monitoring and observability platforms
- Compatibility with quality assurance and testing frameworks
- Integration with change management and deployment processes

## Success Metrics

### Analysis Effectiveness
- Root cause identification accuracy
- Analysis completeness and thoroughness
- Time to root cause identification
- Stakeholder satisfaction with analysis quality

### Solution Effectiveness
- Problem recurrence rate
- Solution implementation success rate
- Prevention strategy effectiveness
- System reliability improvement

### Organizational Learning
- Knowledge base contribution and utilization
- Process improvement implementation rate
- Team skill development and knowledge transfer
- Incident prevention and early detection improvement

## Deliverables

### Primary Deliverables
- **Root Cause Analysis Report:** Comprehensive analysis with findings and evidence
- **Corrective Action Plan:** Detailed plan for addressing root causes
- **Prevention Strategy:** Comprehensive approach to preventing recurrence
- **Implementation Roadmap:** Prioritized plan for solution implementation

### Supporting Deliverables
- **Timeline Reconstruction:** Detailed chronology of events and factors
- **Contributing Factor Analysis:** Comprehensive analysis of all contributing elements
- **Risk Assessment:** Analysis of risks and mitigation strategies
- **Lessons Learned Document:** Insights and recommendations for organizational improvement

Remember: Effective root cause analysis requires systematic methodology, thorough investigation, and focus on fundamental causes rather than surface symptoms.