BMAD-METHOD/bmad-agent/tasks/advanced-troubleshooting-an...

8.8 KiB

Advanced Troubleshooting Analysis Task

Purpose

To provide comprehensive troubleshooting analysis for complex technical issues across React, TypeScript, Node.js, ASP.NET, and Python technology stacks, utilizing systematic debugging methodologies and root cause analysis techniques.

Task Overview

This task guides the Advanced Troubleshooting Specialist through a structured approach to diagnosing and resolving sophisticated technical problems, ensuring thorough analysis, effective solutions, and comprehensive documentation.

Inputs for this Task

  • Problem description and symptoms
  • System logs and error messages
  • Performance metrics and monitoring data
  • Environment configuration details
  • Reproduction steps and conditions
  • Impact assessment and urgency level

Task Execution Instructions

Phase 1: Problem Assessment and Information Gathering

1.1 Initial Problem Analysis

  • Problem Definition:

    • Clearly define the issue, symptoms, and observable behaviors
    • Identify affected systems, components, and user groups
    • Assess business impact and urgency level
    • Determine problem scope and boundaries
  • Information Collection:

    • Gather system logs, error messages, and stack traces
    • Collect performance metrics and monitoring data
    • Document environment configuration and recent changes
    • Obtain reproduction steps and conditions
    • Interview stakeholders and affected users

1.2 Environmental Assessment

  • System Health Check:

    • Verify system resource utilization (CPU, memory, disk, network)
    • Check service status and connectivity
    • Validate configuration settings and dependencies
    • Review recent deployments and changes
  • Technology Stack Analysis:

    • Identify all components in the technology stack
    • Verify version compatibility and dependencies
    • Check for known issues or vulnerabilities
    • Assess integration points and data flows

Phase 2: Systematic Analysis and Root Cause Investigation

2.1 Log Analysis and Pattern Recognition

  • Log Examination:

    • Analyze application logs for error patterns and anomalies
    • Examine system logs for infrastructure issues
    • Review security logs for potential security incidents
    • Correlate logs across multiple systems and timeframes
  • Error Pattern Analysis:

    • Identify recurring error patterns and frequencies
    • Analyze error correlation with system events
    • Map errors to specific components or operations
    • Determine error propagation paths

2.2 Performance Analysis

  • Metrics Evaluation:

    • Analyze response times, throughput, and latency metrics
    • Examine resource utilization patterns and trends
    • Identify performance bottlenecks and constraints
    • Assess scalability and capacity issues
  • Profiling and Tracing:

    • Conduct application profiling for performance hotspots
    • Implement distributed tracing for request flows
    • Analyze database query performance and optimization
    • Examine memory usage patterns and garbage collection

2.3 Root Cause Analysis

  • Hypothesis Formation:

    • Develop multiple hypotheses for potential root causes
    • Prioritize hypotheses based on evidence and probability
    • Design tests to validate or eliminate hypotheses
    • Consider both technical and process-related causes
  • Systematic Investigation:

    • Apply 5 Whys methodology for deep analysis
    • Use fishbone diagrams for comprehensive cause mapping
    • Implement fault tree analysis for complex systems
    • Conduct timeline reconstruction for incident analysis

Phase 3: Solution Development and Strategy Planning

3.1 Solution Strategy Development

  • Multiple Approach Development:

    • Design immediate workarounds for urgent issues
    • Develop short-term fixes for quick resolution
    • Plan long-term solutions for permanent resolution
    • Consider preventive measures and improvements
  • Risk Assessment:

    • Evaluate risks associated with each solution approach
    • Assess potential side effects and system impacts
    • Determine rollback procedures and contingency plans
    • Consider resource requirements and timelines

3.2 Implementation Planning

  • Solution Prioritization:

    • Rank solutions by effectiveness and feasibility
    • Consider implementation complexity and resource requirements
    • Assess business impact and user experience implications
    • Plan phased implementation for complex solutions
  • Testing Strategy:

    • Design comprehensive testing procedures
    • Plan validation criteria and success metrics
    • Implement monitoring and alerting for solution effectiveness
    • Prepare rollback procedures and emergency responses

Phase 4: Implementation, Validation, and Documentation

4.1 Solution Implementation

  • Controlled Deployment:

    • Implement solutions in controlled environments first
    • Monitor system behavior and performance during implementation
    • Validate solution effectiveness against defined criteria
    • Ensure proper backup and rollback capabilities
  • Monitoring and Validation:

    • Implement comprehensive monitoring for solution effectiveness
    • Track key performance indicators and success metrics
    • Monitor for side effects or unintended consequences
    • Validate user experience and business impact improvements

4.2 Documentation and Knowledge Sharing

  • Comprehensive Documentation:

    • Document problem description, analysis, and root cause
    • Record solution implementation steps and procedures
    • Create troubleshooting runbooks for similar issues
    • Document lessons learned and improvement recommendations
  • Knowledge Base Integration:

    • Add findings to organizational knowledge base
    • Create searchable documentation for future reference
    • Share insights with relevant teams and stakeholders
    • Update procedures and best practices based on learnings

Quality Validation

Technical Quality Checks

  • Root cause analysis is thorough and evidence-based
  • Solutions address underlying causes, not just symptoms
  • Implementation includes proper testing and validation
  • Monitoring and alerting are implemented for ongoing detection
  • Documentation is comprehensive and actionable

Process Quality Checks

  • Systematic troubleshooting methodology was followed
  • Multiple solution approaches were considered
  • Risk assessment and mitigation planning were conducted
  • Stakeholder communication was maintained throughout
  • Knowledge sharing and documentation were completed

Outcome Quality Checks

  • Problem resolution meets defined success criteria
  • Solution implementation does not introduce new issues
  • System performance and stability are maintained or improved
  • User experience and business impact are positively affected
  • Prevention strategies are implemented to avoid recurrence

Integration Points

BMAD Method Integration

  • Seamless integration with BMAD orchestrator for task management
  • Cross-persona collaboration for complex multi-domain issues
  • Integration with quality validation frameworks and standards
  • Support for automated workflow and documentation generation

Tool and Platform Integration

  • Integration with monitoring and observability platforms
  • Support for log aggregation and analysis tools
  • Compatibility with debugging and profiling tools
  • Integration with incident management and ticketing systems

Success Metrics

Resolution Effectiveness

  • Mean time to resolution (MTTR)
  • First-call resolution rate
  • Problem recurrence rate
  • Solution effectiveness score

Process Efficiency

  • Troubleshooting methodology adherence
  • Documentation completeness and quality
  • Knowledge base contribution and utilization
  • Team skill development and knowledge transfer

System Improvement

  • Incident reduction rate
  • Proactive issue identification and prevention
  • Monitoring and alerting coverage improvement
  • Overall system reliability and performance enhancement

Deliverables

Primary Deliverables

  • Troubleshooting Analysis Report: Comprehensive analysis of the problem, root cause, and solution
  • Solution Implementation Guide: Step-by-step procedures for implementing the solution
  • Monitoring and Alerting Configuration: Setup for ongoing detection and prevention
  • Troubleshooting Runbook: Reusable procedures for similar issues

Supporting Deliverables

  • Root Cause Analysis Documentation: Detailed analysis of underlying causes
  • Risk Assessment and Mitigation Plan: Comprehensive risk analysis and mitigation strategies
  • Knowledge Base Entries: Searchable documentation for organizational learning
  • Process Improvement Recommendations: Suggestions for preventing similar issues

Remember: This task ensures systematic, thorough troubleshooting that not only resolves immediate issues but also builds organizational knowledge and prevents future problems.