217 lines
8.8 KiB
Markdown
217 lines
8.8 KiB
Markdown
# Advanced Troubleshooting Analysis Task
|
|
|
|
## Purpose
|
|
To provide comprehensive troubleshooting analysis for complex technical issues across React, TypeScript, Node.js, ASP.NET, and Python technology stacks, utilizing systematic debugging methodologies and root cause analysis techniques.
|
|
|
|
## Task Overview
|
|
This task guides the Advanced Troubleshooting Specialist through a structured approach to diagnosing and resolving sophisticated technical problems, ensuring thorough analysis, effective solutions, and comprehensive documentation.
|
|
|
|
## Inputs for this Task
|
|
- Problem description and symptoms
|
|
- System logs and error messages
|
|
- Performance metrics and monitoring data
|
|
- Environment configuration details
|
|
- Reproduction steps and conditions
|
|
- Impact assessment and urgency level
|
|
|
|
## Task Execution Instructions
|
|
|
|
### Phase 1: Problem Assessment and Information Gathering
|
|
|
|
#### 1.1 Initial Problem Analysis
|
|
- **Problem Definition:**
|
|
- Clearly define the issue, symptoms, and observable behaviors
|
|
- Identify affected systems, components, and user groups
|
|
- Assess business impact and urgency level
|
|
- Determine problem scope and boundaries
|
|
|
|
- **Information Collection:**
|
|
- Gather system logs, error messages, and stack traces
|
|
- Collect performance metrics and monitoring data
|
|
- Document environment configuration and recent changes
|
|
- Obtain reproduction steps and conditions
|
|
- Interview stakeholders and affected users
|
|
|
|
#### 1.2 Environmental Assessment
|
|
- **System Health Check:**
|
|
- Verify system resource utilization (CPU, memory, disk, network)
|
|
- Check service status and connectivity
|
|
- Validate configuration settings and dependencies
|
|
- Review recent deployments and changes
|
|
|
|
- **Technology Stack Analysis:**
|
|
- Identify all components in the technology stack
|
|
- Verify version compatibility and dependencies
|
|
- Check for known issues or vulnerabilities
|
|
- Assess integration points and data flows
|
|
|
|
### Phase 2: Systematic Analysis and Root Cause Investigation
|
|
|
|
#### 2.1 Log Analysis and Pattern Recognition
|
|
- **Log Examination:**
|
|
- Analyze application logs for error patterns and anomalies
|
|
- Examine system logs for infrastructure issues
|
|
- Review security logs for potential security incidents
|
|
- Correlate logs across multiple systems and timeframes
|
|
|
|
- **Error Pattern Analysis:**
|
|
- Identify recurring error patterns and frequencies
|
|
- Analyze error correlation with system events
|
|
- Map errors to specific components or operations
|
|
- Determine error propagation paths
|
|
|
|
#### 2.2 Performance Analysis
|
|
- **Metrics Evaluation:**
|
|
- Analyze response times, throughput, and latency metrics
|
|
- Examine resource utilization patterns and trends
|
|
- Identify performance bottlenecks and constraints
|
|
- Assess scalability and capacity issues
|
|
|
|
- **Profiling and Tracing:**
|
|
- Conduct application profiling for performance hotspots
|
|
- Implement distributed tracing for request flows
|
|
- Analyze database query performance and optimization
|
|
- Examine memory usage patterns and garbage collection
|
|
|
|
#### 2.3 Root Cause Analysis
|
|
- **Hypothesis Formation:**
|
|
- Develop multiple hypotheses for potential root causes
|
|
- Prioritize hypotheses based on evidence and probability
|
|
- Design tests to validate or eliminate hypotheses
|
|
- Consider both technical and process-related causes
|
|
|
|
- **Systematic Investigation:**
|
|
- Apply 5 Whys methodology for deep analysis
|
|
- Use fishbone diagrams for comprehensive cause mapping
|
|
- Implement fault tree analysis for complex systems
|
|
- Conduct timeline reconstruction for incident analysis
|
|
|
|
### Phase 3: Solution Development and Strategy Planning
|
|
|
|
#### 3.1 Solution Strategy Development
|
|
- **Multiple Approach Development:**
|
|
- Design immediate workarounds for urgent issues
|
|
- Develop short-term fixes for quick resolution
|
|
- Plan long-term solutions for permanent resolution
|
|
- Consider preventive measures and improvements
|
|
|
|
- **Risk Assessment:**
|
|
- Evaluate risks associated with each solution approach
|
|
- Assess potential side effects and system impacts
|
|
- Determine rollback procedures and contingency plans
|
|
- Consider resource requirements and timelines
|
|
|
|
#### 3.2 Implementation Planning
|
|
- **Solution Prioritization:**
|
|
- Rank solutions by effectiveness and feasibility
|
|
- Consider implementation complexity and resource requirements
|
|
- Assess business impact and user experience implications
|
|
- Plan phased implementation for complex solutions
|
|
|
|
- **Testing Strategy:**
|
|
- Design comprehensive testing procedures
|
|
- Plan validation criteria and success metrics
|
|
- Implement monitoring and alerting for solution effectiveness
|
|
- Prepare rollback procedures and emergency responses
|
|
|
|
### Phase 4: Implementation, Validation, and Documentation
|
|
|
|
#### 4.1 Solution Implementation
|
|
- **Controlled Deployment:**
|
|
- Implement solutions in controlled environments first
|
|
- Monitor system behavior and performance during implementation
|
|
- Validate solution effectiveness against defined criteria
|
|
- Ensure proper backup and rollback capabilities
|
|
|
|
- **Monitoring and Validation:**
|
|
- Implement comprehensive monitoring for solution effectiveness
|
|
- Track key performance indicators and success metrics
|
|
- Monitor for side effects or unintended consequences
|
|
- Validate user experience and business impact improvements
|
|
|
|
#### 4.2 Documentation and Knowledge Sharing
|
|
- **Comprehensive Documentation:**
|
|
- Document problem description, analysis, and root cause
|
|
- Record solution implementation steps and procedures
|
|
- Create troubleshooting runbooks for similar issues
|
|
- Document lessons learned and improvement recommendations
|
|
|
|
- **Knowledge Base Integration:**
|
|
- Add findings to organizational knowledge base
|
|
- Create searchable documentation for future reference
|
|
- Share insights with relevant teams and stakeholders
|
|
- Update procedures and best practices based on learnings
|
|
|
|
## Quality Validation
|
|
|
|
### Technical Quality Checks
|
|
- [ ] Root cause analysis is thorough and evidence-based
|
|
- [ ] Solutions address underlying causes, not just symptoms
|
|
- [ ] Implementation includes proper testing and validation
|
|
- [ ] Monitoring and alerting are implemented for ongoing detection
|
|
- [ ] Documentation is comprehensive and actionable
|
|
|
|
### Process Quality Checks
|
|
- [ ] Systematic troubleshooting methodology was followed
|
|
- [ ] Multiple solution approaches were considered
|
|
- [ ] Risk assessment and mitigation planning were conducted
|
|
- [ ] Stakeholder communication was maintained throughout
|
|
- [ ] Knowledge sharing and documentation were completed
|
|
|
|
### Outcome Quality Checks
|
|
- [ ] Problem resolution meets defined success criteria
|
|
- [ ] Solution implementation does not introduce new issues
|
|
- [ ] System performance and stability are maintained or improved
|
|
- [ ] User experience and business impact are positively affected
|
|
- [ ] Prevention strategies are implemented to avoid recurrence
|
|
|
|
## Integration Points
|
|
|
|
### BMAD Method Integration
|
|
- Seamless integration with BMAD orchestrator for task management
|
|
- Cross-persona collaboration for complex multi-domain issues
|
|
- Integration with quality validation frameworks and standards
|
|
- Support for automated workflow and documentation generation
|
|
|
|
### Tool and Platform Integration
|
|
- Integration with monitoring and observability platforms
|
|
- Support for log aggregation and analysis tools
|
|
- Compatibility with debugging and profiling tools
|
|
- Integration with incident management and ticketing systems
|
|
|
|
## Success Metrics
|
|
|
|
### Resolution Effectiveness
|
|
- Mean time to resolution (MTTR)
|
|
- First-call resolution rate
|
|
- Problem recurrence rate
|
|
- Solution effectiveness score
|
|
|
|
### Process Efficiency
|
|
- Troubleshooting methodology adherence
|
|
- Documentation completeness and quality
|
|
- Knowledge base contribution and utilization
|
|
- Team skill development and knowledge transfer
|
|
|
|
### System Improvement
|
|
- Incident reduction rate
|
|
- Proactive issue identification and prevention
|
|
- Monitoring and alerting coverage improvement
|
|
- Overall system reliability and performance enhancement
|
|
|
|
## Deliverables
|
|
|
|
### Primary Deliverables
|
|
- **Troubleshooting Analysis Report:** Comprehensive analysis of the problem, root cause, and solution
|
|
- **Solution Implementation Guide:** Step-by-step procedures for implementing the solution
|
|
- **Monitoring and Alerting Configuration:** Setup for ongoing detection and prevention
|
|
- **Troubleshooting Runbook:** Reusable procedures for similar issues
|
|
|
|
### Supporting Deliverables
|
|
- **Root Cause Analysis Documentation:** Detailed analysis of underlying causes
|
|
- **Risk Assessment and Mitigation Plan:** Comprehensive risk analysis and mitigation strategies
|
|
- **Knowledge Base Entries:** Searchable documentation for organizational learning
|
|
- **Process Improvement Recommendations:** Suggestions for preventing similar issues
|
|
|
|
Remember: This task ensures systematic, thorough troubleshooting that not only resolves immediate issues but also builds organizational knowledge and prevents future problems.
|