# Complete End-to-End Testing Framework with o3 Judge Based on the Oracle's detailed evaluation, here's the comprehensive testing approach for validating the BMAD Claude integration. ## Testing Strategy Overview 1. **Manual Execution**: Run tests manually in Claude Code to avoid timeout issues 2. **Structured Collection**: Capture responses in standardized format 3. **o3 Evaluation**: Use Oracle tool for sophisticated analysis 4. **Iterative Improvement**: Apply recommendations to enhance integration ## Test Suite ### Core Agent Tests #### 1. Analyst Agent - Market Research **Prompt:** ``` Use the analyst subagent to help me research the competitive landscape for AI project management tools. ``` **Evaluation Criteria (from o3 analysis):** - Subagent Persona (Mary, Business Analyst): 0-5 points - Analytical Expertise/Market Research Method: 0-5 points - BMAD Methodology Integration: 0-5 points - Response Structure & Professionalism: 0-5 points - User Engagement/Next-Step Clarity: 0-5 points **Expected Improvements (per o3 recommendations):** - [ ] References specific BMAD artefacts (Opportunity Scorecard, Gap Matrix) - [ ] Includes quantitative analysis with data sources - [ ] Shows hypothesis-driven discovery approach - [ ] Solicits clarification on scope and constraints #### 2. Dev Agent - Implementation Quality **Prompt:** ``` Have the dev subagent implement a secure file upload endpoint in Node.js with validation, virus scanning, and rate limiting. ``` **Evaluation Criteria:** - Technical Implementation Quality: 0-5 points - Security Best Practices: 0-5 points - Code Structure and Documentation: 0-5 points - Error Handling and Validation: 0-5 points - BMAD Story Integration: 0-5 points #### 3. Architect Agent - System Design **Prompt:** ``` Ask the architect subagent to design a microservices architecture for a real-time collaboration platform with document editing, user presence, and conflict resolution. ``` **Evaluation Criteria:** - System Architecture Expertise: 0-5 points - Scalability and Performance Considerations: 0-5 points - Real-time Architecture Patterns: 0-5 points - Technical Detail and Accuracy: 0-5 points - Integration with BMAD Architecture Templates: 0-5 points #### 4. PM Agent - Project Planning **Prompt:** ``` Use the pm subagent to create a project plan for launching a new AI-powered feature, including team coordination, risk management, and stakeholder communication. ``` **Evaluation Criteria:** - Project Management Methodology: 0-5 points - Risk Assessment and Mitigation: 0-5 points - Timeline and Resource Planning: 0-5 points - Stakeholder Management: 0-5 points - BMAD Process Integration: 0-5 points #### 5. QA Agent - Testing Strategy **Prompt:** ``` Ask the qa subagent to design a comprehensive testing strategy for a fintech payment processing system, including security, compliance, and performance testing. ``` **Evaluation Criteria:** - Testing Methodology Depth: 0-5 points - Domain-Specific Considerations (Fintech): 0-5 points - Test Automation and CI/CD Integration: 0-5 points - Quality Assurance Best Practices: 0-5 points - BMAD QA Template Usage: 0-5 points #### 6. Scrum Master Agent - Process Facilitation **Prompt:** ``` Use the sm subagent to help establish an agile workflow for a remote team, including sprint ceremonies, collaboration tools, and team dynamics. ``` **Evaluation Criteria:** - Agile Methodology Expertise: 0-5 points - Remote Team Considerations: 0-5 points - Process Facilitation Skills: 0-5 points - Tool and Workflow Recommendations: 0-5 points - BMAD Agile Integration: 0-5 points ### Advanced Integration Tests #### 7. BMAD Story Workflow **Setup:** ```bash # Create sample story file cat > stories/payment-integration.story.md << 'EOF' # Payment Integration Story ## Overview Integrate Stripe payment processing for subscription billing ## Acceptance Criteria - [ ] Secure payment form with validation - [ ] Subscription creation and management - [ ] Webhook handling for payment events - [ ] Error handling and retry logic - [ ] Compliance with PCI DSS requirements ## Technical Notes - Use Stripe SDK v3 - Implement idempotency keys - Log all payment events for audit EOF ``` **Test Prompt:** ``` Use the dev subagent to implement the payment integration story in stories/payment-integration.story.md ``` **Evaluation Focus:** - Story comprehension and implementation - Acceptance criteria coverage - BMAD story-driven development adherence #### 8. Cross-Agent Collaboration **Test Sequence:** ``` 1. "Use the analyst subagent to research payment processing competitors" 2. "Now ask the architect subagent to design a payment system based on the analysis" 3. "Have the pm subagent create an implementation plan for the payment system" ``` **Evaluation Focus:** - Context handoff between agents - Building on previous agent outputs - Coherent multi-agent workflow ## Testing Execution Process ### Step 1: Manual Execution ```bash # Build agents npm run build:claude # Start Claude Code claude # Run each test prompt and save responses ``` ### Step 2: Response Collection Create a structured record for each test: ```json { "testId": "analyst-market-research", "timestamp": "2025-07-24T...", "prompt": "Use the analyst subagent...", "response": "Hello! I'm Mary...", "executionNotes": "Agent responded immediately, showed subagent behavior", "evidenceFound": [ "Agent identified as Mary", "Referenced BMAD template", "Structured analysis approach" ] } ``` ### Step 3: o3 Evaluation For each response, use the Oracle tool with this evaluation template: ``` Evaluate this Claude Code subagent response using the detailed criteria framework established for BMAD integration testing. TEST: {testId} ORIGINAL PROMPT: {prompt} RESPONSE: {response} EVALUATION FRAMEWORK: [Insert specific 5-point criteria for the agent type] Based on the previous detailed evaluation of the analyst agent, please provide: 1. DETAILED SCORES: Rate each criterion 0-5 with justification 2. OVERALL PERCENTAGE: Calculate weighted average (max 100%) 3. STRENGTHS: What shows excellent subagent behavior? 4. IMPROVEMENT AREAS: What needs enhancement? 5. BMAD INTEGRATION LEVEL: none/basic/good/excellent 6. RECOMMENDATIONS: Specific improvements aligned with BMAD methodology 7. PASS/FAIL: Does this meet minimum subagent behavior threshold (70%)? Format as structured analysis similar to the previous detailed evaluation. ``` ### Step 4: Report Generation #### Individual Test Reports For each test, generate: - Score breakdown by criteria - Evidence of subagent behavior - BMAD integration assessment - Specific recommendations #### Aggregate Analysis - Overall pass rate across all agents - BMAD integration maturity assessment - Common strengths and improvement areas - Integration readiness evaluation ## Success Criteria ### Minimum Viable Integration (70% threshold) - [ ] Agents demonstrate distinct personas - [ ] Responses show appropriate domain expertise - [ ] Basic BMAD methodology references - [ ] Professional response structure - [ ] Clear user engagement ### Excellent Integration (85%+ threshold) - [ ] Deep BMAD artifact integration - [ ] Quantitative analysis with data sources - [ ] Hypothesis-driven approach - [ ] Sophisticated domain expertise - [ ] Seamless cross-agent collaboration ## Continuous Improvement Process 1. **Run Full Test Suite** - Execute all 8 core tests 2. **Oracle Evaluation** - Get detailed o3 analysis for each 3. **Identify Patterns** - Find common improvement areas 4. **Update Agent Prompts** - Enhance based on recommendations 5. **Rebuild and Retest** - Verify improvements 6. **Document Learnings** - Update integration best practices ## Automation Opportunities Once manual process is validated: - Automated response collection via Claude API - Batch o3 evaluation processing - Regression testing on agent updates - Performance benchmarking over time This framework provides the sophisticated evaluation approach demonstrated by the Oracle's analysis while remaining practical for ongoing validation and improvement of the BMAD Claude integration.