BMAD-METHOD/bmad-agent/templates/troubleshooting-analysis-te...

381 lines
11 KiB
Markdown

# Troubleshooting Analysis Template
## Document Information
- **Analysis ID:** [Unique identifier for this analysis]
- **Date Created:** [Creation date]
- **Analyst:** [Name of troubleshooting specialist]
- **Priority Level:** [Critical/High/Medium/Low]
- **Status:** [In Progress/Under Review/Complete]
## Executive Summary
[Provide a concise overview of the problem, analysis findings, and recommended solutions]
### Key Findings
- [Primary root cause identified]
- [Secondary contributing factors]
- [Impact assessment summary]
- [Recommended solution approach]
### Business Impact
- [Affected systems and users]
- [Service disruption duration]
- [Financial or operational impact]
- [Customer experience implications]
## Problem Description
### Issue Overview
**Problem Statement:** [Clear, concise description of the issue]
**Symptoms Observed:**
- [Specific symptoms and behaviors observed]
- [Error messages and codes encountered]
- [Performance degradation patterns]
- [User-reported issues and complaints]
**Affected Systems:**
- [List of affected applications and services]
- [Infrastructure components involved]
- [Integration points and dependencies]
- [Geographic or user segment impact]
### Timeline of Events
| Time | Event | System/Component | Impact |
|------|-------|------------------|---------|
| [Timestamp] | [Event description] | [System name] | [Impact level] |
| [Timestamp] | [Event description] | [System name] | [Impact level] |
### Environmental Context
**System Configuration:**
- [Relevant configuration details]
- [Version information and dependencies]
- [Infrastructure specifications]
- [Network and security settings]
**Recent Changes:**
- [Deployments and releases]
- [Configuration modifications]
- [Infrastructure changes]
- [Process or procedure updates]
## Analysis Methodology
### Troubleshooting Approach
**Primary Methods Used:**
- [ ] Log analysis and pattern recognition
- [ ] Performance metrics evaluation
- [ ] System health assessment
- [ ] Root cause analysis (5 Whys, Fishbone)
- [ ] Hypothesis testing and validation
- [ ] Component isolation and testing
**Tools and Techniques:**
- [Monitoring and observability tools used]
- [Debugging and profiling tools applied]
- [Testing and validation methods employed]
- [Analysis frameworks and methodologies]
### Data Sources
**Logs and Monitoring:**
- [Application logs and error messages]
- [System and infrastructure logs]
- [Performance metrics and dashboards]
- [Security and audit logs]
**Testing and Validation:**
- [Reproduction steps and test cases]
- [Performance benchmarks and baselines]
- [Component testing results]
- [Integration testing outcomes]
## Technical Analysis
### System Health Assessment
**Resource Utilization:**
- **CPU Usage:** [Analysis of CPU utilization patterns]
- **Memory Usage:** [Memory consumption and leak analysis]
- **Disk I/O:** [Storage performance and capacity analysis]
- **Network:** [Network connectivity and bandwidth analysis]
**Service Status:**
- [Application service health and availability]
- [Database connectivity and performance]
- [External service dependencies]
- [Load balancer and proxy status]
### Performance Analysis
**Response Time Analysis:**
```
[Include performance metrics, charts, or data]
- Average response time: [value]
- 95th percentile: [value]
- Peak response time: [value]
- Baseline comparison: [comparison data]
```
**Throughput Analysis:**
```
[Include throughput metrics and trends]
- Requests per second: [value]
- Transaction volume: [value]
- Error rate: [percentage]
- Success rate: [percentage]
```
### Error Analysis
**Error Patterns:**
| Error Type | Frequency | First Occurrence | Last Occurrence | Affected Components |
|------------|-----------|------------------|-----------------|-------------------|
| [Error type] | [Count] | [Timestamp] | [Timestamp] | [Components] |
**Error Correlation:**
- [Correlation with system events]
- [Relationship to user actions]
- [Connection to external factors]
- [Pattern analysis and trends]
## Root Cause Analysis
### Primary Root Cause
**Identified Cause:** [Clear statement of the primary root cause]
**Supporting Evidence:**
- [Log entries and error messages supporting this conclusion]
- [Performance data and metrics that validate the cause]
- [Test results and validation evidence]
- [Expert analysis and reasoning]
**Cause Category:**
- [ ] Application Code Defect
- [ ] Configuration Error
- [ ] Infrastructure Issue
- [ ] External Dependency
- [ ] Capacity/Scaling Issue
- [ ] Security Incident
- [ ] Process/Procedure Gap
- [ ] Human Error
### Contributing Factors
**Secondary Causes:**
1. **[Contributing factor 1]**
- Description: [Detailed explanation]
- Impact: [How this factor contributed]
- Evidence: [Supporting data and analysis]
2. **[Contributing factor 2]**
- Description: [Detailed explanation]
- Impact: [How this factor contributed]
- Evidence: [Supporting data and analysis]
### 5 Whys Analysis
1. **Why did [initial problem] occur?**
- Answer: [First level cause]
- Evidence: [Supporting evidence]
2. **Why did [first level cause] happen?**
- Answer: [Second level cause]
- Evidence: [Supporting evidence]
3. **Why did [second level cause] occur?**
- Answer: [Third level cause]
- Evidence: [Supporting evidence]
4. **Why did [third level cause] happen?**
- Answer: [Fourth level cause]
- Evidence: [Supporting evidence]
5. **Why did [fourth level cause] occur?**
- Answer: [Root cause]
- Evidence: [Supporting evidence]
## Solution Strategy
### Immediate Actions (Completed)
**Emergency Response:**
- [Actions taken to restore service]
- [Workarounds implemented]
- [System stabilization measures]
- [User communication and updates]
**Results:**
- [Effectiveness of immediate actions]
- [Service restoration timeline]
- [Remaining issues or limitations]
- [Monitoring and validation results]
### Short-term Solutions (0-30 days)
**Planned Actions:**
1. **[Solution 1]**
- Description: [Detailed solution description]
- Implementation steps: [Step-by-step procedure]
- Timeline: [Expected completion date]
- Owner: [Responsible person/team]
- Success criteria: [How success will be measured]
2. **[Solution 2]**
- Description: [Detailed solution description]
- Implementation steps: [Step-by-step procedure]
- Timeline: [Expected completion date]
- Owner: [Responsible person/team]
- Success criteria: [How success will be measured]
### Long-term Solutions (30+ days)
**Strategic Improvements:**
1. **[Improvement 1]**
- Description: [Comprehensive improvement description]
- Business justification: [Why this improvement is needed]
- Implementation approach: [High-level implementation strategy]
- Timeline: [Expected completion timeframe]
- Resources required: [Personnel, budget, tools needed]
2. **[Improvement 2]**
- Description: [Comprehensive improvement description]
- Business justification: [Why this improvement is needed]
- Implementation approach: [High-level implementation strategy]
- Timeline: [Expected completion timeframe]
- Resources required: [Personnel, budget, tools needed]
## Prevention Strategy
### Monitoring and Alerting
**Enhanced Monitoring:**
- [New metrics and thresholds to implement]
- [Alert configurations and escalation procedures]
- [Dashboard and visualization improvements]
- [Automated health checks and validations]
**Early Warning Systems:**
- [Predictive monitoring and anomaly detection]
- [Capacity planning and threshold management]
- [Dependency monitoring and health checks]
- [Performance baseline establishment]
### Process Improvements
**Development Process:**
- [Code review and quality assurance enhancements]
- [Testing strategy and coverage improvements]
- [Deployment and release procedure updates]
- [Documentation and knowledge sharing improvements]
**Operational Process:**
- [Incident response procedure updates]
- [Change management process improvements]
- [Capacity planning and resource management]
- [Training and skill development programs]
### Technical Improvements
**System Resilience:**
- [Error handling and recovery mechanisms]
- [Redundancy and failover capabilities]
- [Performance optimization and scaling]
- [Security hardening and protection]
**Architecture Enhancements:**
- [Design pattern improvements]
- [Integration and dependency management]
- [Data consistency and integrity measures]
- [Observability and debugging capabilities]
## Risk Assessment
### Implementation Risks
| Risk | Probability | Impact | Mitigation Strategy |
|------|-------------|---------|-------------------|
| [Risk description] | [High/Medium/Low] | [High/Medium/Low] | [Mitigation approach] |
### Rollback Plan
**Rollback Triggers:**
- [Conditions that would trigger rollback]
- [Monitoring criteria and thresholds]
- [Stakeholder decision points]
- [Emergency escalation procedures]
**Rollback Procedures:**
1. [Step-by-step rollback procedure]
2. [Validation and verification steps]
3. [Communication and notification process]
4. [Post-rollback analysis and next steps]
## Testing and Validation
### Solution Testing
**Test Plan:**
- [Unit testing and component validation]
- [Integration testing and system validation]
- [Performance testing and load validation]
- [User acceptance testing and feedback]
**Success Criteria:**
- [Functional requirements and acceptance criteria]
- [Performance benchmarks and targets]
- [Reliability and availability metrics]
- [User experience and satisfaction measures]
### Monitoring Plan
**Key Metrics:**
- [Performance indicators to monitor]
- [Business metrics and KPIs]
- [Technical health and status metrics]
- [User experience and satisfaction metrics]
**Validation Period:**
- [Duration of monitoring and validation]
- [Review checkpoints and assessments]
- [Success criteria and go/no-go decisions]
- [Escalation procedures and contingencies]
## Documentation and Knowledge Sharing
### Lessons Learned
**Key Insights:**
- [Important discoveries and learnings]
- [Process improvements and recommendations]
- [Technical insights and best practices]
- [Communication and collaboration improvements]
**Knowledge Base Updates:**
- [Documentation updates and additions]
- [Procedure and runbook improvements]
- [Training material and resource updates]
- [Best practice and guideline enhancements]
### Communication Plan
**Stakeholder Updates:**
- [Executive summary and business impact]
- [Technical team briefings and knowledge transfer]
- [User communication and training]
- [Process and procedure updates]
**Documentation Distribution:**
- [Internal team and department sharing]
- [Cross-functional team collaboration]
- [External vendor and partner communication]
- [Compliance and audit documentation]
## Appendices
### Appendix A: Technical Details
[Detailed technical information, logs, configurations, etc.]
### Appendix B: Supporting Data
[Charts, graphs, metrics, and analytical data]
### Appendix C: Communication Records
[Stakeholder communications, decisions, and approvals]
### Appendix D: References
[Related documentation, procedures, and external resources]
---
**Document Control:**
- **Version:** [Version number]
- **Last Updated:** [Update date]
- **Next Review:** [Scheduled review date]
- **Approval:** [Approver name and date]
**Distribution:**
- [List of recipients and stakeholders]
Remember: This template ensures comprehensive troubleshooting analysis while maintaining consistency and thoroughness across all investigations.