BMAD-METHOD/bmad-agent/tasks/advanced-troubleshooting-an...

217 lines
8.8 KiB
Markdown

# Advanced Troubleshooting Analysis Task
## Purpose
To provide comprehensive troubleshooting analysis for complex technical issues across React, TypeScript, Node.js, ASP.NET, and Python technology stacks, utilizing systematic debugging methodologies and root cause analysis techniques.
## Task Overview
This task guides the Advanced Troubleshooting Specialist through a structured approach to diagnosing and resolving sophisticated technical problems, ensuring thorough analysis, effective solutions, and comprehensive documentation.
## Inputs for this Task
- Problem description and symptoms
- System logs and error messages
- Performance metrics and monitoring data
- Environment configuration details
- Reproduction steps and conditions
- Impact assessment and urgency level
## Task Execution Instructions
### Phase 1: Problem Assessment and Information Gathering
#### 1.1 Initial Problem Analysis
- **Problem Definition:**
- Clearly define the issue, symptoms, and observable behaviors
- Identify affected systems, components, and user groups
- Assess business impact and urgency level
- Determine problem scope and boundaries
- **Information Collection:**
- Gather system logs, error messages, and stack traces
- Collect performance metrics and monitoring data
- Document environment configuration and recent changes
- Obtain reproduction steps and conditions
- Interview stakeholders and affected users
#### 1.2 Environmental Assessment
- **System Health Check:**
- Verify system resource utilization (CPU, memory, disk, network)
- Check service status and connectivity
- Validate configuration settings and dependencies
- Review recent deployments and changes
- **Technology Stack Analysis:**
- Identify all components in the technology stack
- Verify version compatibility and dependencies
- Check for known issues or vulnerabilities
- Assess integration points and data flows
### Phase 2: Systematic Analysis and Root Cause Investigation
#### 2.1 Log Analysis and Pattern Recognition
- **Log Examination:**
- Analyze application logs for error patterns and anomalies
- Examine system logs for infrastructure issues
- Review security logs for potential security incidents
- Correlate logs across multiple systems and timeframes
- **Error Pattern Analysis:**
- Identify recurring error patterns and frequencies
- Analyze error correlation with system events
- Map errors to specific components or operations
- Determine error propagation paths
#### 2.2 Performance Analysis
- **Metrics Evaluation:**
- Analyze response times, throughput, and latency metrics
- Examine resource utilization patterns and trends
- Identify performance bottlenecks and constraints
- Assess scalability and capacity issues
- **Profiling and Tracing:**
- Conduct application profiling for performance hotspots
- Implement distributed tracing for request flows
- Analyze database query performance and optimization
- Examine memory usage patterns and garbage collection
#### 2.3 Root Cause Analysis
- **Hypothesis Formation:**
- Develop multiple hypotheses for potential root causes
- Prioritize hypotheses based on evidence and probability
- Design tests to validate or eliminate hypotheses
- Consider both technical and process-related causes
- **Systematic Investigation:**
- Apply 5 Whys methodology for deep analysis
- Use fishbone diagrams for comprehensive cause mapping
- Implement fault tree analysis for complex systems
- Conduct timeline reconstruction for incident analysis
### Phase 3: Solution Development and Strategy Planning
#### 3.1 Solution Strategy Development
- **Multiple Approach Development:**
- Design immediate workarounds for urgent issues
- Develop short-term fixes for quick resolution
- Plan long-term solutions for permanent resolution
- Consider preventive measures and improvements
- **Risk Assessment:**
- Evaluate risks associated with each solution approach
- Assess potential side effects and system impacts
- Determine rollback procedures and contingency plans
- Consider resource requirements and timelines
#### 3.2 Implementation Planning
- **Solution Prioritization:**
- Rank solutions by effectiveness and feasibility
- Consider implementation complexity and resource requirements
- Assess business impact and user experience implications
- Plan phased implementation for complex solutions
- **Testing Strategy:**
- Design comprehensive testing procedures
- Plan validation criteria and success metrics
- Implement monitoring and alerting for solution effectiveness
- Prepare rollback procedures and emergency responses
### Phase 4: Implementation, Validation, and Documentation
#### 4.1 Solution Implementation
- **Controlled Deployment:**
- Implement solutions in controlled environments first
- Monitor system behavior and performance during implementation
- Validate solution effectiveness against defined criteria
- Ensure proper backup and rollback capabilities
- **Monitoring and Validation:**
- Implement comprehensive monitoring for solution effectiveness
- Track key performance indicators and success metrics
- Monitor for side effects or unintended consequences
- Validate user experience and business impact improvements
#### 4.2 Documentation and Knowledge Sharing
- **Comprehensive Documentation:**
- Document problem description, analysis, and root cause
- Record solution implementation steps and procedures
- Create troubleshooting runbooks for similar issues
- Document lessons learned and improvement recommendations
- **Knowledge Base Integration:**
- Add findings to organizational knowledge base
- Create searchable documentation for future reference
- Share insights with relevant teams and stakeholders
- Update procedures and best practices based on learnings
## Quality Validation
### Technical Quality Checks
- [ ] Root cause analysis is thorough and evidence-based
- [ ] Solutions address underlying causes, not just symptoms
- [ ] Implementation includes proper testing and validation
- [ ] Monitoring and alerting are implemented for ongoing detection
- [ ] Documentation is comprehensive and actionable
### Process Quality Checks
- [ ] Systematic troubleshooting methodology was followed
- [ ] Multiple solution approaches were considered
- [ ] Risk assessment and mitigation planning were conducted
- [ ] Stakeholder communication was maintained throughout
- [ ] Knowledge sharing and documentation were completed
### Outcome Quality Checks
- [ ] Problem resolution meets defined success criteria
- [ ] Solution implementation does not introduce new issues
- [ ] System performance and stability are maintained or improved
- [ ] User experience and business impact are positively affected
- [ ] Prevention strategies are implemented to avoid recurrence
## Integration Points
### BMAD Method Integration
- Seamless integration with BMAD orchestrator for task management
- Cross-persona collaboration for complex multi-domain issues
- Integration with quality validation frameworks and standards
- Support for automated workflow and documentation generation
### Tool and Platform Integration
- Integration with monitoring and observability platforms
- Support for log aggregation and analysis tools
- Compatibility with debugging and profiling tools
- Integration with incident management and ticketing systems
## Success Metrics
### Resolution Effectiveness
- Mean time to resolution (MTTR)
- First-call resolution rate
- Problem recurrence rate
- Solution effectiveness score
### Process Efficiency
- Troubleshooting methodology adherence
- Documentation completeness and quality
- Knowledge base contribution and utilization
- Team skill development and knowledge transfer
### System Improvement
- Incident reduction rate
- Proactive issue identification and prevention
- Monitoring and alerting coverage improvement
- Overall system reliability and performance enhancement
## Deliverables
### Primary Deliverables
- **Troubleshooting Analysis Report:** Comprehensive analysis of the problem, root cause, and solution
- **Solution Implementation Guide:** Step-by-step procedures for implementing the solution
- **Monitoring and Alerting Configuration:** Setup for ongoing detection and prevention
- **Troubleshooting Runbook:** Reusable procedures for similar issues
### Supporting Deliverables
- **Root Cause Analysis Documentation:** Detailed analysis of underlying causes
- **Risk Assessment and Mitigation Plan:** Comprehensive risk analysis and mitigation strategies
- **Knowledge Base Entries:** Searchable documentation for organizational learning
- **Process Improvement Recommendations:** Suggestions for preventing similar issues
Remember: This task ensures systematic, thorough troubleshooting that not only resolves immediate issues but also builds organizational knowledge and prevents future problems.