BMAD-METHOD/integration/claude/complete-test-framework.md

255 lines
8.0 KiB
Markdown

# Complete End-to-End Testing Framework with o3 Judge
Based on the Oracle's detailed evaluation, here's the comprehensive testing approach for validating the BMAD Claude integration.
## Testing Strategy Overview
1. **Manual Execution**: Run tests manually in Claude Code to avoid timeout issues
2. **Structured Collection**: Capture responses in standardized format
3. **o3 Evaluation**: Use Oracle tool for sophisticated analysis
4. **Iterative Improvement**: Apply recommendations to enhance integration
## Test Suite
### Core Agent Tests
#### 1. Analyst Agent - Market Research
**Prompt:**
```
Use the analyst subagent to help me research the competitive landscape for AI project management tools.
```
**Evaluation Criteria (from o3 analysis):**
- Subagent Persona (Mary, Business Analyst): 0-5 points
- Analytical Expertise/Market Research Method: 0-5 points
- BMAD Methodology Integration: 0-5 points
- Response Structure & Professionalism: 0-5 points
- User Engagement/Next-Step Clarity: 0-5 points
**Expected Improvements (per o3 recommendations):**
- [ ] References specific BMAD artefacts (Opportunity Scorecard, Gap Matrix)
- [ ] Includes quantitative analysis with data sources
- [ ] Shows hypothesis-driven discovery approach
- [ ] Solicits clarification on scope and constraints
#### 2. Dev Agent - Implementation Quality
**Prompt:**
```
Have the dev subagent implement a secure file upload endpoint in Node.js with validation, virus scanning, and rate limiting.
```
**Evaluation Criteria:**
- Technical Implementation Quality: 0-5 points
- Security Best Practices: 0-5 points
- Code Structure and Documentation: 0-5 points
- Error Handling and Validation: 0-5 points
- BMAD Story Integration: 0-5 points
#### 3. Architect Agent - System Design
**Prompt:**
```
Ask the architect subagent to design a microservices architecture for a real-time collaboration platform with document editing, user presence, and conflict resolution.
```
**Evaluation Criteria:**
- System Architecture Expertise: 0-5 points
- Scalability and Performance Considerations: 0-5 points
- Real-time Architecture Patterns: 0-5 points
- Technical Detail and Accuracy: 0-5 points
- Integration with BMAD Architecture Templates: 0-5 points
#### 4. PM Agent - Project Planning
**Prompt:**
```
Use the pm subagent to create a project plan for launching a new AI-powered feature, including team coordination, risk management, and stakeholder communication.
```
**Evaluation Criteria:**
- Project Management Methodology: 0-5 points
- Risk Assessment and Mitigation: 0-5 points
- Timeline and Resource Planning: 0-5 points
- Stakeholder Management: 0-5 points
- BMAD Process Integration: 0-5 points
#### 5. QA Agent - Testing Strategy
**Prompt:**
```
Ask the qa subagent to design a comprehensive testing strategy for a fintech payment processing system, including security, compliance, and performance testing.
```
**Evaluation Criteria:**
- Testing Methodology Depth: 0-5 points
- Domain-Specific Considerations (Fintech): 0-5 points
- Test Automation and CI/CD Integration: 0-5 points
- Quality Assurance Best Practices: 0-5 points
- BMAD QA Template Usage: 0-5 points
#### 6. Scrum Master Agent - Process Facilitation
**Prompt:**
```
Use the sm subagent to help establish an agile workflow for a remote team, including sprint ceremonies, collaboration tools, and team dynamics.
```
**Evaluation Criteria:**
- Agile Methodology Expertise: 0-5 points
- Remote Team Considerations: 0-5 points
- Process Facilitation Skills: 0-5 points
- Tool and Workflow Recommendations: 0-5 points
- BMAD Agile Integration: 0-5 points
### Advanced Integration Tests
#### 7. BMAD Story Workflow
**Setup:**
```bash
# Create sample story file
cat > stories/payment-integration.story.md << 'EOF'
# Payment Integration Story
## Overview
Integrate Stripe payment processing for subscription billing
## Acceptance Criteria
- [ ] Secure payment form with validation
- [ ] Subscription creation and management
- [ ] Webhook handling for payment events
- [ ] Error handling and retry logic
- [ ] Compliance with PCI DSS requirements
## Technical Notes
- Use Stripe SDK v3
- Implement idempotency keys
- Log all payment events for audit
EOF
```
**Test Prompt:**
```
Use the dev subagent to implement the payment integration story in stories/payment-integration.story.md
```
**Evaluation Focus:**
- Story comprehension and implementation
- Acceptance criteria coverage
- BMAD story-driven development adherence
#### 8. Cross-Agent Collaboration
**Test Sequence:**
```
1. "Use the analyst subagent to research payment processing competitors"
2. "Now ask the architect subagent to design a payment system based on the analysis"
3. "Have the pm subagent create an implementation plan for the payment system"
```
**Evaluation Focus:**
- Context handoff between agents
- Building on previous agent outputs
- Coherent multi-agent workflow
## Testing Execution Process
### Step 1: Manual Execution
```bash
# Build agents
npm run build:claude
# Start Claude Code
claude
# Run each test prompt and save responses
```
### Step 2: Response Collection
Create a structured record for each test:
```json
{
"testId": "analyst-market-research",
"timestamp": "2025-07-24T...",
"prompt": "Use the analyst subagent...",
"response": "Hello! I'm Mary...",
"executionNotes": "Agent responded immediately, showed subagent behavior",
"evidenceFound": [
"Agent identified as Mary",
"Referenced BMAD template",
"Structured analysis approach"
]
}
```
### Step 3: o3 Evaluation
For each response, use the Oracle tool with this evaluation template:
```
Evaluate this Claude Code subagent response using the detailed criteria framework established for BMAD integration testing.
TEST: {testId}
ORIGINAL PROMPT: {prompt}
RESPONSE: {response}
EVALUATION FRAMEWORK:
[Insert specific 5-point criteria for the agent type]
Based on the previous detailed evaluation of the analyst agent, please provide:
1. DETAILED SCORES: Rate each criterion 0-5 with justification
2. OVERALL PERCENTAGE: Calculate weighted average (max 100%)
3. STRENGTHS: What shows excellent subagent behavior?
4. IMPROVEMENT AREAS: What needs enhancement?
5. BMAD INTEGRATION LEVEL: none/basic/good/excellent
6. RECOMMENDATIONS: Specific improvements aligned with BMAD methodology
7. PASS/FAIL: Does this meet minimum subagent behavior threshold (70%)?
Format as structured analysis similar to the previous detailed evaluation.
```
### Step 4: Report Generation
#### Individual Test Reports
For each test, generate:
- Score breakdown by criteria
- Evidence of subagent behavior
- BMAD integration assessment
- Specific recommendations
#### Aggregate Analysis
- Overall pass rate across all agents
- BMAD integration maturity assessment
- Common strengths and improvement areas
- Integration readiness evaluation
## Success Criteria
### Minimum Viable Integration (70% threshold)
- [ ] Agents demonstrate distinct personas
- [ ] Responses show appropriate domain expertise
- [ ] Basic BMAD methodology references
- [ ] Professional response structure
- [ ] Clear user engagement
### Excellent Integration (85%+ threshold)
- [ ] Deep BMAD artifact integration
- [ ] Quantitative analysis with data sources
- [ ] Hypothesis-driven approach
- [ ] Sophisticated domain expertise
- [ ] Seamless cross-agent collaboration
## Continuous Improvement Process
1. **Run Full Test Suite** - Execute all 8 core tests
2. **Oracle Evaluation** - Get detailed o3 analysis for each
3. **Identify Patterns** - Find common improvement areas
4. **Update Agent Prompts** - Enhance based on recommendations
5. **Rebuild and Retest** - Verify improvements
6. **Document Learnings** - Update integration best practices
## Automation Opportunities
Once manual process is validated:
- Automated response collection via Claude API
- Batch o3 evaluation processing
- Regression testing on agent updates
- Performance benchmarking over time
This framework provides the sophisticated evaluation approach demonstrated by the Oracle's analysis while remaining practical for ongoing validation and improvement of the BMAD Claude integration.