255 lines
8.0 KiB
Markdown
255 lines
8.0 KiB
Markdown
# Complete End-to-End Testing Framework with o3 Judge
|
|
|
|
Based on the Oracle's detailed evaluation, here's the comprehensive testing approach for validating the BMAD Claude integration.
|
|
|
|
## Testing Strategy Overview
|
|
|
|
1. **Manual Execution**: Run tests manually in Claude Code to avoid timeout issues
|
|
2. **Structured Collection**: Capture responses in standardized format
|
|
3. **o3 Evaluation**: Use Oracle tool for sophisticated analysis
|
|
4. **Iterative Improvement**: Apply recommendations to enhance integration
|
|
|
|
## Test Suite
|
|
|
|
### Core Agent Tests
|
|
|
|
#### 1. Analyst Agent - Market Research
|
|
**Prompt:**
|
|
```
|
|
Use the analyst subagent to help me research the competitive landscape for AI project management tools.
|
|
```
|
|
|
|
**Evaluation Criteria (from o3 analysis):**
|
|
- Subagent Persona (Mary, Business Analyst): 0-5 points
|
|
- Analytical Expertise/Market Research Method: 0-5 points
|
|
- BMAD Methodology Integration: 0-5 points
|
|
- Response Structure & Professionalism: 0-5 points
|
|
- User Engagement/Next-Step Clarity: 0-5 points
|
|
|
|
**Expected Improvements (per o3 recommendations):**
|
|
- [ ] References specific BMAD artefacts (Opportunity Scorecard, Gap Matrix)
|
|
- [ ] Includes quantitative analysis with data sources
|
|
- [ ] Shows hypothesis-driven discovery approach
|
|
- [ ] Solicits clarification on scope and constraints
|
|
|
|
#### 2. Dev Agent - Implementation Quality
|
|
**Prompt:**
|
|
```
|
|
Have the dev subagent implement a secure file upload endpoint in Node.js with validation, virus scanning, and rate limiting.
|
|
```
|
|
|
|
**Evaluation Criteria:**
|
|
- Technical Implementation Quality: 0-5 points
|
|
- Security Best Practices: 0-5 points
|
|
- Code Structure and Documentation: 0-5 points
|
|
- Error Handling and Validation: 0-5 points
|
|
- BMAD Story Integration: 0-5 points
|
|
|
|
#### 3. Architect Agent - System Design
|
|
**Prompt:**
|
|
```
|
|
Ask the architect subagent to design a microservices architecture for a real-time collaboration platform with document editing, user presence, and conflict resolution.
|
|
```
|
|
|
|
**Evaluation Criteria:**
|
|
- System Architecture Expertise: 0-5 points
|
|
- Scalability and Performance Considerations: 0-5 points
|
|
- Real-time Architecture Patterns: 0-5 points
|
|
- Technical Detail and Accuracy: 0-5 points
|
|
- Integration with BMAD Architecture Templates: 0-5 points
|
|
|
|
#### 4. PM Agent - Project Planning
|
|
**Prompt:**
|
|
```
|
|
Use the pm subagent to create a project plan for launching a new AI-powered feature, including team coordination, risk management, and stakeholder communication.
|
|
```
|
|
|
|
**Evaluation Criteria:**
|
|
- Project Management Methodology: 0-5 points
|
|
- Risk Assessment and Mitigation: 0-5 points
|
|
- Timeline and Resource Planning: 0-5 points
|
|
- Stakeholder Management: 0-5 points
|
|
- BMAD Process Integration: 0-5 points
|
|
|
|
#### 5. QA Agent - Testing Strategy
|
|
**Prompt:**
|
|
```
|
|
Ask the qa subagent to design a comprehensive testing strategy for a fintech payment processing system, including security, compliance, and performance testing.
|
|
```
|
|
|
|
**Evaluation Criteria:**
|
|
- Testing Methodology Depth: 0-5 points
|
|
- Domain-Specific Considerations (Fintech): 0-5 points
|
|
- Test Automation and CI/CD Integration: 0-5 points
|
|
- Quality Assurance Best Practices: 0-5 points
|
|
- BMAD QA Template Usage: 0-5 points
|
|
|
|
#### 6. Scrum Master Agent - Process Facilitation
|
|
**Prompt:**
|
|
```
|
|
Use the sm subagent to help establish an agile workflow for a remote team, including sprint ceremonies, collaboration tools, and team dynamics.
|
|
```
|
|
|
|
**Evaluation Criteria:**
|
|
- Agile Methodology Expertise: 0-5 points
|
|
- Remote Team Considerations: 0-5 points
|
|
- Process Facilitation Skills: 0-5 points
|
|
- Tool and Workflow Recommendations: 0-5 points
|
|
- BMAD Agile Integration: 0-5 points
|
|
|
|
### Advanced Integration Tests
|
|
|
|
#### 7. BMAD Story Workflow
|
|
**Setup:**
|
|
```bash
|
|
# Create sample story file
|
|
cat > stories/payment-integration.story.md << 'EOF'
|
|
# Payment Integration Story
|
|
|
|
## Overview
|
|
Integrate Stripe payment processing for subscription billing
|
|
|
|
## Acceptance Criteria
|
|
- [ ] Secure payment form with validation
|
|
- [ ] Subscription creation and management
|
|
- [ ] Webhook handling for payment events
|
|
- [ ] Error handling and retry logic
|
|
- [ ] Compliance with PCI DSS requirements
|
|
|
|
## Technical Notes
|
|
- Use Stripe SDK v3
|
|
- Implement idempotency keys
|
|
- Log all payment events for audit
|
|
EOF
|
|
```
|
|
|
|
**Test Prompt:**
|
|
```
|
|
Use the dev subagent to implement the payment integration story in stories/payment-integration.story.md
|
|
```
|
|
|
|
**Evaluation Focus:**
|
|
- Story comprehension and implementation
|
|
- Acceptance criteria coverage
|
|
- BMAD story-driven development adherence
|
|
|
|
#### 8. Cross-Agent Collaboration
|
|
**Test Sequence:**
|
|
```
|
|
1. "Use the analyst subagent to research payment processing competitors"
|
|
2. "Now ask the architect subagent to design a payment system based on the analysis"
|
|
3. "Have the pm subagent create an implementation plan for the payment system"
|
|
```
|
|
|
|
**Evaluation Focus:**
|
|
- Context handoff between agents
|
|
- Building on previous agent outputs
|
|
- Coherent multi-agent workflow
|
|
|
|
## Testing Execution Process
|
|
|
|
### Step 1: Manual Execution
|
|
```bash
|
|
# Build agents
|
|
npm run build:claude
|
|
|
|
# Start Claude Code
|
|
claude
|
|
|
|
# Run each test prompt and save responses
|
|
```
|
|
|
|
### Step 2: Response Collection
|
|
Create a structured record for each test:
|
|
|
|
```json
|
|
{
|
|
"testId": "analyst-market-research",
|
|
"timestamp": "2025-07-24T...",
|
|
"prompt": "Use the analyst subagent...",
|
|
"response": "Hello! I'm Mary...",
|
|
"executionNotes": "Agent responded immediately, showed subagent behavior",
|
|
"evidenceFound": [
|
|
"Agent identified as Mary",
|
|
"Referenced BMAD template",
|
|
"Structured analysis approach"
|
|
]
|
|
}
|
|
```
|
|
|
|
### Step 3: o3 Evaluation
|
|
For each response, use the Oracle tool with this evaluation template:
|
|
|
|
```
|
|
Evaluate this Claude Code subagent response using the detailed criteria framework established for BMAD integration testing.
|
|
|
|
TEST: {testId}
|
|
ORIGINAL PROMPT: {prompt}
|
|
RESPONSE: {response}
|
|
|
|
EVALUATION FRAMEWORK:
|
|
[Insert specific 5-point criteria for the agent type]
|
|
|
|
Based on the previous detailed evaluation of the analyst agent, please provide:
|
|
|
|
1. DETAILED SCORES: Rate each criterion 0-5 with justification
|
|
2. OVERALL PERCENTAGE: Calculate weighted average (max 100%)
|
|
3. STRENGTHS: What shows excellent subagent behavior?
|
|
4. IMPROVEMENT AREAS: What needs enhancement?
|
|
5. BMAD INTEGRATION LEVEL: none/basic/good/excellent
|
|
6. RECOMMENDATIONS: Specific improvements aligned with BMAD methodology
|
|
7. PASS/FAIL: Does this meet minimum subagent behavior threshold (70%)?
|
|
|
|
Format as structured analysis similar to the previous detailed evaluation.
|
|
```
|
|
|
|
### Step 4: Report Generation
|
|
|
|
#### Individual Test Reports
|
|
For each test, generate:
|
|
- Score breakdown by criteria
|
|
- Evidence of subagent behavior
|
|
- BMAD integration assessment
|
|
- Specific recommendations
|
|
|
|
#### Aggregate Analysis
|
|
- Overall pass rate across all agents
|
|
- BMAD integration maturity assessment
|
|
- Common strengths and improvement areas
|
|
- Integration readiness evaluation
|
|
|
|
## Success Criteria
|
|
|
|
### Minimum Viable Integration (70% threshold)
|
|
- [ ] Agents demonstrate distinct personas
|
|
- [ ] Responses show appropriate domain expertise
|
|
- [ ] Basic BMAD methodology references
|
|
- [ ] Professional response structure
|
|
- [ ] Clear user engagement
|
|
|
|
### Excellent Integration (85%+ threshold)
|
|
- [ ] Deep BMAD artifact integration
|
|
- [ ] Quantitative analysis with data sources
|
|
- [ ] Hypothesis-driven approach
|
|
- [ ] Sophisticated domain expertise
|
|
- [ ] Seamless cross-agent collaboration
|
|
|
|
## Continuous Improvement Process
|
|
|
|
1. **Run Full Test Suite** - Execute all 8 core tests
|
|
2. **Oracle Evaluation** - Get detailed o3 analysis for each
|
|
3. **Identify Patterns** - Find common improvement areas
|
|
4. **Update Agent Prompts** - Enhance based on recommendations
|
|
5. **Rebuild and Retest** - Verify improvements
|
|
6. **Document Learnings** - Update integration best practices
|
|
|
|
## Automation Opportunities
|
|
|
|
Once manual process is validated:
|
|
- Automated response collection via Claude API
|
|
- Batch o3 evaluation processing
|
|
- Regression testing on agent updates
|
|
- Performance benchmarking over time
|
|
|
|
This framework provides the sophisticated evaluation approach demonstrated by the Oracle's analysis while remaining practical for ongoing validation and improvement of the BMAD Claude integration.
|