BMAD-METHOD/integration/claude/complete-test-framework.md

8.0 KiB

Complete End-to-End Testing Framework with o3 Judge

Based on the Oracle's detailed evaluation, here's the comprehensive testing approach for validating the BMAD Claude integration.

Testing Strategy Overview

  1. Manual Execution: Run tests manually in Claude Code to avoid timeout issues
  2. Structured Collection: Capture responses in standardized format
  3. o3 Evaluation: Use Oracle tool for sophisticated analysis
  4. Iterative Improvement: Apply recommendations to enhance integration

Test Suite

Core Agent Tests

1. Analyst Agent - Market Research

Prompt:

Use the analyst subagent to help me research the competitive landscape for AI project management tools.

Evaluation Criteria (from o3 analysis):

  • Subagent Persona (Mary, Business Analyst): 0-5 points
  • Analytical Expertise/Market Research Method: 0-5 points
  • BMAD Methodology Integration: 0-5 points
  • Response Structure & Professionalism: 0-5 points
  • User Engagement/Next-Step Clarity: 0-5 points

Expected Improvements (per o3 recommendations):

  • References specific BMAD artefacts (Opportunity Scorecard, Gap Matrix)
  • Includes quantitative analysis with data sources
  • Shows hypothesis-driven discovery approach
  • Solicits clarification on scope and constraints

2. Dev Agent - Implementation Quality

Prompt:

Have the dev subagent implement a secure file upload endpoint in Node.js with validation, virus scanning, and rate limiting.

Evaluation Criteria:

  • Technical Implementation Quality: 0-5 points
  • Security Best Practices: 0-5 points
  • Code Structure and Documentation: 0-5 points
  • Error Handling and Validation: 0-5 points
  • BMAD Story Integration: 0-5 points

3. Architect Agent - System Design

Prompt:

Ask the architect subagent to design a microservices architecture for a real-time collaboration platform with document editing, user presence, and conflict resolution.

Evaluation Criteria:

  • System Architecture Expertise: 0-5 points
  • Scalability and Performance Considerations: 0-5 points
  • Real-time Architecture Patterns: 0-5 points
  • Technical Detail and Accuracy: 0-5 points
  • Integration with BMAD Architecture Templates: 0-5 points

4. PM Agent - Project Planning

Prompt:

Use the pm subagent to create a project plan for launching a new AI-powered feature, including team coordination, risk management, and stakeholder communication.

Evaluation Criteria:

  • Project Management Methodology: 0-5 points
  • Risk Assessment and Mitigation: 0-5 points
  • Timeline and Resource Planning: 0-5 points
  • Stakeholder Management: 0-5 points
  • BMAD Process Integration: 0-5 points

5. QA Agent - Testing Strategy

Prompt:

Ask the qa subagent to design a comprehensive testing strategy for a fintech payment processing system, including security, compliance, and performance testing.

Evaluation Criteria:

  • Testing Methodology Depth: 0-5 points
  • Domain-Specific Considerations (Fintech): 0-5 points
  • Test Automation and CI/CD Integration: 0-5 points
  • Quality Assurance Best Practices: 0-5 points
  • BMAD QA Template Usage: 0-5 points

6. Scrum Master Agent - Process Facilitation

Prompt:

Use the sm subagent to help establish an agile workflow for a remote team, including sprint ceremonies, collaboration tools, and team dynamics.

Evaluation Criteria:

  • Agile Methodology Expertise: 0-5 points
  • Remote Team Considerations: 0-5 points
  • Process Facilitation Skills: 0-5 points
  • Tool and Workflow Recommendations: 0-5 points
  • BMAD Agile Integration: 0-5 points

Advanced Integration Tests

7. BMAD Story Workflow

Setup:

# Create sample story file
cat > stories/payment-integration.story.md << 'EOF'
# Payment Integration Story

## Overview
Integrate Stripe payment processing for subscription billing

## Acceptance Criteria
- [ ] Secure payment form with validation
- [ ] Subscription creation and management
- [ ] Webhook handling for payment events
- [ ] Error handling and retry logic
- [ ] Compliance with PCI DSS requirements

## Technical Notes
- Use Stripe SDK v3
- Implement idempotency keys
- Log all payment events for audit
EOF

Test Prompt:

Use the dev subagent to implement the payment integration story in stories/payment-integration.story.md

Evaluation Focus:

  • Story comprehension and implementation
  • Acceptance criteria coverage
  • BMAD story-driven development adherence

8. Cross-Agent Collaboration

Test Sequence:

1. "Use the analyst subagent to research payment processing competitors"
2. "Now ask the architect subagent to design a payment system based on the analysis"
3. "Have the pm subagent create an implementation plan for the payment system"

Evaluation Focus:

  • Context handoff between agents
  • Building on previous agent outputs
  • Coherent multi-agent workflow

Testing Execution Process

Step 1: Manual Execution

# Build agents
npm run build:claude

# Start Claude Code
claude

# Run each test prompt and save responses

Step 2: Response Collection

Create a structured record for each test:

{
  "testId": "analyst-market-research",
  "timestamp": "2025-07-24T...",
  "prompt": "Use the analyst subagent...",
  "response": "Hello! I'm Mary...",
  "executionNotes": "Agent responded immediately, showed subagent behavior",
  "evidenceFound": [
    "Agent identified as Mary",
    "Referenced BMAD template",
    "Structured analysis approach"
  ]
}

Step 3: o3 Evaluation

For each response, use the Oracle tool with this evaluation template:

Evaluate this Claude Code subagent response using the detailed criteria framework established for BMAD integration testing.

TEST: {testId}
ORIGINAL PROMPT: {prompt}
RESPONSE: {response}

EVALUATION FRAMEWORK:
[Insert specific 5-point criteria for the agent type]

Based on the previous detailed evaluation of the analyst agent, please provide:

1. DETAILED SCORES: Rate each criterion 0-5 with justification
2. OVERALL PERCENTAGE: Calculate weighted average (max 100%)
3. STRENGTHS: What shows excellent subagent behavior?
4. IMPROVEMENT AREAS: What needs enhancement?
5. BMAD INTEGRATION LEVEL: none/basic/good/excellent
6. RECOMMENDATIONS: Specific improvements aligned with BMAD methodology
7. PASS/FAIL: Does this meet minimum subagent behavior threshold (70%)?

Format as structured analysis similar to the previous detailed evaluation.

Step 4: Report Generation

Individual Test Reports

For each test, generate:

  • Score breakdown by criteria
  • Evidence of subagent behavior
  • BMAD integration assessment
  • Specific recommendations

Aggregate Analysis

  • Overall pass rate across all agents
  • BMAD integration maturity assessment
  • Common strengths and improvement areas
  • Integration readiness evaluation

Success Criteria

Minimum Viable Integration (70% threshold)

  • Agents demonstrate distinct personas
  • Responses show appropriate domain expertise
  • Basic BMAD methodology references
  • Professional response structure
  • Clear user engagement

Excellent Integration (85%+ threshold)

  • Deep BMAD artifact integration
  • Quantitative analysis with data sources
  • Hypothesis-driven approach
  • Sophisticated domain expertise
  • Seamless cross-agent collaboration

Continuous Improvement Process

  1. Run Full Test Suite - Execute all 8 core tests
  2. Oracle Evaluation - Get detailed o3 analysis for each
  3. Identify Patterns - Find common improvement areas
  4. Update Agent Prompts - Enhance based on recommendations
  5. Rebuild and Retest - Verify improvements
  6. Document Learnings - Update integration best practices

Automation Opportunities

Once manual process is validated:

  • Automated response collection via Claude API
  • Batch o3 evaluation processing
  • Regression testing on agent updates
  • Performance benchmarking over time

This framework provides the sophisticated evaluation approach demonstrated by the Oracle's analysis while remaining practical for ongoing validation and improvement of the BMAD Claude integration.