BMAD-METHOD/bmad-agent/tasks/root-cause-analysis-task.md

242 lines
9.0 KiB
Markdown

# Root Cause Analysis Task
## Purpose
To conduct comprehensive root cause analysis for complex technical issues, utilizing systematic methodologies to identify underlying causes and develop effective prevention strategies across all technology stacks.
## Task Overview
This task provides a structured approach to deep-dive analysis of technical problems, ensuring thorough investigation of root causes and development of comprehensive solutions that address underlying issues rather than just symptoms.
## Inputs for this Task
- Incident description and timeline
- System logs and diagnostic data
- Performance metrics and monitoring data
- Environmental configuration details
- Stakeholder interviews and observations
- Previous incident history and patterns
## Task Execution Instructions
### Phase 1: Incident Reconstruction and Data Collection
#### 1.1 Timeline Reconstruction
- **Chronological Analysis:**
- Create detailed timeline of events leading to the incident
- Identify trigger events and contributing factors
- Map system changes and deployments to timeline
- Correlate user actions with system behaviors
- **Data Point Collection:**
- Gather all relevant logs from affected systems
- Collect performance metrics before, during, and after incident
- Document configuration changes and system modifications
- Compile user reports and stakeholder observations
#### 1.2 System State Analysis
- **Pre-Incident State:**
- Analyze system health and performance baselines
- Review recent changes and deployments
- Identify any warning signs or anomalies
- Document normal operational parameters
- **Incident State:**
- Capture system behavior during the incident
- Document error conditions and failure modes
- Analyze resource utilization and constraints
- Record user impact and business consequences
### Phase 2: Systematic Root Cause Investigation
#### 2.1 5 Whys Analysis
- **Iterative Questioning:**
- Start with the immediate problem statement
- Ask "Why did this happen?" for each identified cause
- Continue questioning until fundamental root cause is reached
- Document each level of analysis with supporting evidence
- **Evidence Validation:**
- Support each "why" with concrete evidence
- Verify assumptions with data and testing
- Eliminate speculation and focus on facts
- Cross-reference findings with multiple data sources
#### 2.2 Fishbone Diagram Analysis
- **Category-Based Investigation:**
- **People:** Human factors, training, procedures, communication
- **Process:** Workflows, procedures, policies, standards
- **Technology:** Hardware, software, infrastructure, tools
- **Environment:** External factors, dependencies, constraints
- **Comprehensive Cause Mapping:**
- Identify all potential contributing factors in each category
- Analyze interactions between different categories
- Prioritize causes based on impact and evidence
- Validate cause relationships with data and testing
#### 2.3 Fault Tree Analysis
- **Top-Down Analysis:**
- Start with the top-level failure event
- Systematically break down into contributing events
- Use logical gates (AND, OR) to show relationships
- Continue decomposition until basic events are reached
- **Probability Assessment:**
- Assign probability estimates to basic events
- Calculate overall failure probability
- Identify critical paths and high-impact factors
- Prioritize mitigation efforts based on risk analysis
### Phase 3: Contributing Factor Analysis
#### 3.1 Technical Contributing Factors
- **System Design Issues:**
- Architecture limitations and design flaws
- Scalability constraints and bottlenecks
- Integration weaknesses and dependencies
- Performance limitations and resource constraints
- **Implementation Problems:**
- Code defects and logic errors
- Configuration mistakes and inconsistencies
- Deployment issues and environment differences
- Testing gaps and validation failures
#### 3.2 Process Contributing Factors
- **Operational Processes:**
- Monitoring and alerting gaps
- Incident response procedures
- Change management processes
- Capacity planning and resource management
- **Development Processes:**
- Code review and quality assurance
- Testing strategies and coverage
- Deployment and release procedures
- Documentation and knowledge management
#### 3.3 Human Contributing Factors
- **Knowledge and Training:**
- Skill gaps and training needs
- Knowledge transfer and documentation
- Experience levels and expertise
- Communication and collaboration
- **Decision Making:**
- Risk assessment and management
- Priority setting and resource allocation
- Escalation procedures and authority
- Information availability and quality
### Phase 4: Solution Development and Prevention Strategy
#### 4.1 Immediate Corrective Actions
- **Symptom Resolution:**
- Address immediate symptoms and restore service
- Implement temporary workarounds if needed
- Ensure system stability and user access
- Monitor for recurrence or side effects
- **Data Preservation:**
- Preserve evidence for further analysis
- Backup system states and configurations
- Document all corrective actions taken
- Maintain audit trail for compliance
#### 4.2 Root Cause Remediation
- **Fundamental Fixes:**
- Address identified root causes directly
- Implement systematic solutions rather than patches
- Consider long-term sustainability and maintainability
- Plan for comprehensive testing and validation
- **System Improvements:**
- Enhance system design and architecture
- Improve monitoring and observability
- Strengthen error handling and resilience
- Optimize performance and scalability
#### 4.3 Prevention Strategy Development
- **Proactive Measures:**
- Implement monitoring and alerting for early detection
- Develop automated testing and validation procedures
- Create preventive maintenance and health checks
- Establish capacity planning and resource management
- **Process Improvements:**
- Enhance change management and deployment procedures
- Improve incident response and escalation processes
- Strengthen quality assurance and testing practices
- Develop training and knowledge sharing programs
## Quality Validation
### Analysis Quality Checks
- [ ] Root cause analysis is evidence-based and thorough
- [ ] Multiple analysis methodologies were applied appropriately
- [ ] All contributing factors were identified and validated
- [ ] Cause relationships are logical and well-supported
- [ ] Analysis depth reaches fundamental root causes
### Solution Quality Checks
- [ ] Solutions address root causes, not just symptoms
- [ ] Prevention strategies are comprehensive and practical
- [ ] Implementation plans are detailed and realistic
- [ ] Risk assessment and mitigation are included
- [ ] Success criteria and metrics are defined
### Documentation Quality Checks
- [ ] Analysis process and findings are clearly documented
- [ ] Evidence and supporting data are properly referenced
- [ ] Recommendations are actionable and prioritized
- [ ] Lessons learned are captured and shareable
- [ ] Knowledge base is updated with findings
## Integration Points
### BMAD Method Integration
- Integration with troubleshooting and problem resolution workflows
- Cross-persona collaboration for complex multi-domain analysis
- Integration with quality validation and improvement processes
- Support for organizational learning and knowledge management
### Tool and Process Integration
- Integration with incident management and ticketing systems
- Support for monitoring and observability platforms
- Compatibility with quality assurance and testing frameworks
- Integration with change management and deployment processes
## Success Metrics
### Analysis Effectiveness
- Root cause identification accuracy
- Analysis completeness and thoroughness
- Time to root cause identification
- Stakeholder satisfaction with analysis quality
### Solution Effectiveness
- Problem recurrence rate
- Solution implementation success rate
- Prevention strategy effectiveness
- System reliability improvement
### Organizational Learning
- Knowledge base contribution and utilization
- Process improvement implementation rate
- Team skill development and knowledge transfer
- Incident prevention and early detection improvement
## Deliverables
### Primary Deliverables
- **Root Cause Analysis Report:** Comprehensive analysis with findings and evidence
- **Corrective Action Plan:** Detailed plan for addressing root causes
- **Prevention Strategy:** Comprehensive approach to preventing recurrence
- **Implementation Roadmap:** Prioritized plan for solution implementation
### Supporting Deliverables
- **Timeline Reconstruction:** Detailed chronology of events and factors
- **Contributing Factor Analysis:** Comprehensive analysis of all contributing elements
- **Risk Assessment:** Analysis of risks and mitigation strategies
- **Lessons Learned Document:** Insights and recommendations for organizational improvement
Remember: Effective root cause analysis requires systematic methodology, thorough investigation, and focus on fundamental causes rather than surface symptoms.