BMAD-METHOD/bmad-agent/templates/troubleshooting-analysis-te...

11 KiB

Troubleshooting Analysis Template

Document Information

  • Analysis ID: [Unique identifier for this analysis]
  • Date Created: [Creation date]
  • Analyst: [Name of troubleshooting specialist]
  • Priority Level: [Critical/High/Medium/Low]
  • Status: [In Progress/Under Review/Complete]

Executive Summary

[Provide a concise overview of the problem, analysis findings, and recommended solutions]

Key Findings

  • [Primary root cause identified]
  • [Secondary contributing factors]
  • [Impact assessment summary]
  • [Recommended solution approach]

Business Impact

  • [Affected systems and users]
  • [Service disruption duration]
  • [Financial or operational impact]
  • [Customer experience implications]

Problem Description

Issue Overview

Problem Statement: [Clear, concise description of the issue]

Symptoms Observed:

  • [Specific symptoms and behaviors observed]
  • [Error messages and codes encountered]
  • [Performance degradation patterns]
  • [User-reported issues and complaints]

Affected Systems:

  • [List of affected applications and services]
  • [Infrastructure components involved]
  • [Integration points and dependencies]
  • [Geographic or user segment impact]

Timeline of Events

Time Event System/Component Impact
[Timestamp] [Event description] [System name] [Impact level]
[Timestamp] [Event description] [System name] [Impact level]

Environmental Context

System Configuration:

  • [Relevant configuration details]
  • [Version information and dependencies]
  • [Infrastructure specifications]
  • [Network and security settings]

Recent Changes:

  • [Deployments and releases]
  • [Configuration modifications]
  • [Infrastructure changes]
  • [Process or procedure updates]

Analysis Methodology

Troubleshooting Approach

Primary Methods Used:

  • Log analysis and pattern recognition
  • Performance metrics evaluation
  • System health assessment
  • Root cause analysis (5 Whys, Fishbone)
  • Hypothesis testing and validation
  • Component isolation and testing

Tools and Techniques:

  • [Monitoring and observability tools used]
  • [Debugging and profiling tools applied]
  • [Testing and validation methods employed]
  • [Analysis frameworks and methodologies]

Data Sources

Logs and Monitoring:

  • [Application logs and error messages]
  • [System and infrastructure logs]
  • [Performance metrics and dashboards]
  • [Security and audit logs]

Testing and Validation:

  • [Reproduction steps and test cases]
  • [Performance benchmarks and baselines]
  • [Component testing results]
  • [Integration testing outcomes]

Technical Analysis

System Health Assessment

Resource Utilization:

  • CPU Usage: [Analysis of CPU utilization patterns]
  • Memory Usage: [Memory consumption and leak analysis]
  • Disk I/O: [Storage performance and capacity analysis]
  • Network: [Network connectivity and bandwidth analysis]

Service Status:

  • [Application service health and availability]
  • [Database connectivity and performance]
  • [External service dependencies]
  • [Load balancer and proxy status]

Performance Analysis

Response Time Analysis:

[Include performance metrics, charts, or data]
- Average response time: [value]
- 95th percentile: [value]
- Peak response time: [value]
- Baseline comparison: [comparison data]

Throughput Analysis:

[Include throughput metrics and trends]
- Requests per second: [value]
- Transaction volume: [value]
- Error rate: [percentage]
- Success rate: [percentage]

Error Analysis

Error Patterns:

Error Type Frequency First Occurrence Last Occurrence Affected Components
[Error type] [Count] [Timestamp] [Timestamp] [Components]

Error Correlation:

  • [Correlation with system events]
  • [Relationship to user actions]
  • [Connection to external factors]
  • [Pattern analysis and trends]

Root Cause Analysis

Primary Root Cause

Identified Cause: [Clear statement of the primary root cause]

Supporting Evidence:

  • [Log entries and error messages supporting this conclusion]
  • [Performance data and metrics that validate the cause]
  • [Test results and validation evidence]
  • [Expert analysis and reasoning]

Cause Category:

  • Application Code Defect
  • Configuration Error
  • Infrastructure Issue
  • External Dependency
  • Capacity/Scaling Issue
  • Security Incident
  • Process/Procedure Gap
  • Human Error

Contributing Factors

Secondary Causes:

  1. [Contributing factor 1]

    • Description: [Detailed explanation]
    • Impact: [How this factor contributed]
    • Evidence: [Supporting data and analysis]
  2. [Contributing factor 2]

    • Description: [Detailed explanation]
    • Impact: [How this factor contributed]
    • Evidence: [Supporting data and analysis]

5 Whys Analysis

  1. Why did [initial problem] occur?

    • Answer: [First level cause]
    • Evidence: [Supporting evidence]
  2. Why did [first level cause] happen?

    • Answer: [Second level cause]
    • Evidence: [Supporting evidence]
  3. Why did [second level cause] occur?

    • Answer: [Third level cause]
    • Evidence: [Supporting evidence]
  4. Why did [third level cause] happen?

    • Answer: [Fourth level cause]
    • Evidence: [Supporting evidence]
  5. Why did [fourth level cause] occur?

    • Answer: [Root cause]
    • Evidence: [Supporting evidence]

Solution Strategy

Immediate Actions (Completed)

Emergency Response:

  • [Actions taken to restore service]
  • [Workarounds implemented]
  • [System stabilization measures]
  • [User communication and updates]

Results:

  • [Effectiveness of immediate actions]
  • [Service restoration timeline]
  • [Remaining issues or limitations]
  • [Monitoring and validation results]

Short-term Solutions (0-30 days)

Planned Actions:

  1. [Solution 1]

    • Description: [Detailed solution description]
    • Implementation steps: [Step-by-step procedure]
    • Timeline: [Expected completion date]
    • Owner: [Responsible person/team]
    • Success criteria: [How success will be measured]
  2. [Solution 2]

    • Description: [Detailed solution description]
    • Implementation steps: [Step-by-step procedure]
    • Timeline: [Expected completion date]
    • Owner: [Responsible person/team]
    • Success criteria: [How success will be measured]

Long-term Solutions (30+ days)

Strategic Improvements:

  1. [Improvement 1]

    • Description: [Comprehensive improvement description]
    • Business justification: [Why this improvement is needed]
    • Implementation approach: [High-level implementation strategy]
    • Timeline: [Expected completion timeframe]
    • Resources required: [Personnel, budget, tools needed]
  2. [Improvement 2]

    • Description: [Comprehensive improvement description]
    • Business justification: [Why this improvement is needed]
    • Implementation approach: [High-level implementation strategy]
    • Timeline: [Expected completion timeframe]
    • Resources required: [Personnel, budget, tools needed]

Prevention Strategy

Monitoring and Alerting

Enhanced Monitoring:

  • [New metrics and thresholds to implement]
  • [Alert configurations and escalation procedures]
  • [Dashboard and visualization improvements]
  • [Automated health checks and validations]

Early Warning Systems:

  • [Predictive monitoring and anomaly detection]
  • [Capacity planning and threshold management]
  • [Dependency monitoring and health checks]
  • [Performance baseline establishment]

Process Improvements

Development Process:

  • [Code review and quality assurance enhancements]
  • [Testing strategy and coverage improvements]
  • [Deployment and release procedure updates]
  • [Documentation and knowledge sharing improvements]

Operational Process:

  • [Incident response procedure updates]
  • [Change management process improvements]
  • [Capacity planning and resource management]
  • [Training and skill development programs]

Technical Improvements

System Resilience:

  • [Error handling and recovery mechanisms]
  • [Redundancy and failover capabilities]
  • [Performance optimization and scaling]
  • [Security hardening and protection]

Architecture Enhancements:

  • [Design pattern improvements]
  • [Integration and dependency management]
  • [Data consistency and integrity measures]
  • [Observability and debugging capabilities]

Risk Assessment

Implementation Risks

Risk Probability Impact Mitigation Strategy
[Risk description] [High/Medium/Low] [High/Medium/Low] [Mitigation approach]

Rollback Plan

Rollback Triggers:

  • [Conditions that would trigger rollback]
  • [Monitoring criteria and thresholds]
  • [Stakeholder decision points]
  • [Emergency escalation procedures]

Rollback Procedures:

  1. [Step-by-step rollback procedure]
  2. [Validation and verification steps]
  3. [Communication and notification process]
  4. [Post-rollback analysis and next steps]

Testing and Validation

Solution Testing

Test Plan:

  • [Unit testing and component validation]
  • [Integration testing and system validation]
  • [Performance testing and load validation]
  • [User acceptance testing and feedback]

Success Criteria:

  • [Functional requirements and acceptance criteria]
  • [Performance benchmarks and targets]
  • [Reliability and availability metrics]
  • [User experience and satisfaction measures]

Monitoring Plan

Key Metrics:

  • [Performance indicators to monitor]
  • [Business metrics and KPIs]
  • [Technical health and status metrics]
  • [User experience and satisfaction metrics]

Validation Period:

  • [Duration of monitoring and validation]
  • [Review checkpoints and assessments]
  • [Success criteria and go/no-go decisions]
  • [Escalation procedures and contingencies]

Documentation and Knowledge Sharing

Lessons Learned

Key Insights:

  • [Important discoveries and learnings]
  • [Process improvements and recommendations]
  • [Technical insights and best practices]
  • [Communication and collaboration improvements]

Knowledge Base Updates:

  • [Documentation updates and additions]
  • [Procedure and runbook improvements]
  • [Training material and resource updates]
  • [Best practice and guideline enhancements]

Communication Plan

Stakeholder Updates:

  • [Executive summary and business impact]
  • [Technical team briefings and knowledge transfer]
  • [User communication and training]
  • [Process and procedure updates]

Documentation Distribution:

  • [Internal team and department sharing]
  • [Cross-functional team collaboration]
  • [External vendor and partner communication]
  • [Compliance and audit documentation]

Appendices

Appendix A: Technical Details

[Detailed technical information, logs, configurations, etc.]

Appendix B: Supporting Data

[Charts, graphs, metrics, and analytical data]

Appendix C: Communication Records

[Stakeholder communications, decisions, and approvals]

Appendix D: References

[Related documentation, procedures, and external resources]


Document Control:

  • Version: [Version number]
  • Last Updated: [Update date]
  • Next Review: [Scheduled review date]
  • Approval: [Approver name and date]

Distribution:

  • [List of recipients and stakeholders]

Remember: This template ensures comprehensive troubleshooting analysis while maintaining consistency and thoroughness across all investigations.