8.0 KiB

Raw Blame History

Complete End-to-End Testing Framework with o3 Judge

Based on the Oracle's detailed evaluation, here's the comprehensive testing approach for validating the BMAD Claude integration.

Testing Strategy Overview

Manual Execution: Run tests manually in Claude Code to avoid timeout issues
Structured Collection: Capture responses in standardized format
o3 Evaluation: Use Oracle tool for sophisticated analysis
Iterative Improvement: Apply recommendations to enhance integration

Test Suite

Core Agent Tests

1. Analyst Agent - Market Research

Prompt:

Use the analyst subagent to help me research the competitive landscape for AI project management tools.

Evaluation Criteria (from o3 analysis):

Subagent Persona (Mary, Business Analyst): 0-5 points
Analytical Expertise/Market Research Method: 0-5 points
BMAD Methodology Integration: 0-5 points
Response Structure & Professionalism: 0-5 points
User Engagement/Next-Step Clarity: 0-5 points

Expected Improvements (per o3 recommendations):

References specific BMAD artefacts (Opportunity Scorecard, Gap Matrix)
Includes quantitative analysis with data sources
Shows hypothesis-driven discovery approach
Solicits clarification on scope and constraints

2. Dev Agent - Implementation Quality

Prompt:

Have the dev subagent implement a secure file upload endpoint in Node.js with validation, virus scanning, and rate limiting.

Evaluation Criteria:

Technical Implementation Quality: 0-5 points
Security Best Practices: 0-5 points
Code Structure and Documentation: 0-5 points
Error Handling and Validation: 0-5 points
BMAD Story Integration: 0-5 points

3. Architect Agent - System Design

Prompt:

Ask the architect subagent to design a microservices architecture for a real-time collaboration platform with document editing, user presence, and conflict resolution.

Evaluation Criteria:

System Architecture Expertise: 0-5 points
Scalability and Performance Considerations: 0-5 points
Real-time Architecture Patterns: 0-5 points
Technical Detail and Accuracy: 0-5 points
Integration with BMAD Architecture Templates: 0-5 points

4. PM Agent - Project Planning

Prompt:

Use the pm subagent to create a project plan for launching a new AI-powered feature, including team coordination, risk management, and stakeholder communication.

Evaluation Criteria:

Project Management Methodology: 0-5 points
Risk Assessment and Mitigation: 0-5 points
Timeline and Resource Planning: 0-5 points
Stakeholder Management: 0-5 points
BMAD Process Integration: 0-5 points

5. QA Agent - Testing Strategy

Prompt:

Ask the qa subagent to design a comprehensive testing strategy for a fintech payment processing system, including security, compliance, and performance testing.

Evaluation Criteria:

Testing Methodology Depth: 0-5 points
Domain-Specific Considerations (Fintech): 0-5 points
Test Automation and CI/CD Integration: 0-5 points
Quality Assurance Best Practices: 0-5 points
BMAD QA Template Usage: 0-5 points

6. Scrum Master Agent - Process Facilitation

Prompt:

Use the sm subagent to help establish an agile workflow for a remote team, including sprint ceremonies, collaboration tools, and team dynamics.

Evaluation Criteria:

Agile Methodology Expertise: 0-5 points
Remote Team Considerations: 0-5 points
Process Facilitation Skills: 0-5 points
Tool and Workflow Recommendations: 0-5 points
BMAD Agile Integration: 0-5 points

Advanced Integration Tests

7. BMAD Story Workflow

Setup:

# Create sample story file
cat > stories/payment-integration.story.md << 'EOF'
# Payment Integration Story

## Overview
Integrate Stripe payment processing for subscription billing

## Acceptance Criteria
- [ ] Secure payment form with validation
- [ ] Subscription creation and management
- [ ] Webhook handling for payment events
- [ ] Error handling and retry logic
- [ ] Compliance with PCI DSS requirements

## Technical Notes
- Use Stripe SDK v3
- Implement idempotency keys
- Log all payment events for audit
EOF

Test Prompt:

Use the dev subagent to implement the payment integration story in stories/payment-integration.story.md

Evaluation Focus:

Story comprehension and implementation
Acceptance criteria coverage
BMAD story-driven development adherence

8. Cross-Agent Collaboration

Test Sequence:

1. "Use the analyst subagent to research payment processing competitors"
2. "Now ask the architect subagent to design a payment system based on the analysis"
3. "Have the pm subagent create an implementation plan for the payment system"

Evaluation Focus:

Context handoff between agents
Building on previous agent outputs
Coherent multi-agent workflow

Testing Execution Process

Step 1: Manual Execution

# Build agents
npm run build:claude

# Start Claude Code
claude

# Run each test prompt and save responses

Step 2: Response Collection

Create a structured record for each test:

{
  "testId": "analyst-market-research",
  "timestamp": "2025-07-24T...",
  "prompt": "Use the analyst subagent...",
  "response": "Hello! I'm Mary...",
  "executionNotes": "Agent responded immediately, showed subagent behavior",
  "evidenceFound": [
    "Agent identified as Mary",
    "Referenced BMAD template",
    "Structured analysis approach"
  ]
}

Step 3: o3 Evaluation

For each response, use the Oracle tool with this evaluation template:

Evaluate this Claude Code subagent response using the detailed criteria framework established for BMAD integration testing.

TEST: {testId}
ORIGINAL PROMPT: {prompt}
RESPONSE: {response}

EVALUATION FRAMEWORK:
[Insert specific 5-point criteria for the agent type]

Based on the previous detailed evaluation of the analyst agent, please provide:

1. DETAILED SCORES: Rate each criterion 0-5 with justification
2. OVERALL PERCENTAGE: Calculate weighted average (max 100%)
3. STRENGTHS: What shows excellent subagent behavior?
4. IMPROVEMENT AREAS: What needs enhancement?
5. BMAD INTEGRATION LEVEL: none/basic/good/excellent
6. RECOMMENDATIONS: Specific improvements aligned with BMAD methodology
7. PASS/FAIL: Does this meet minimum subagent behavior threshold (70%)?

Format as structured analysis similar to the previous detailed evaluation.

Step 4: Report Generation

Individual Test Reports

For each test, generate:

Score breakdown by criteria
Evidence of subagent behavior
BMAD integration assessment
Specific recommendations

Aggregate Analysis

Overall pass rate across all agents
BMAD integration maturity assessment
Common strengths and improvement areas
Integration readiness evaluation

Success Criteria

Minimum Viable Integration (70% threshold)

Agents demonstrate distinct personas
Responses show appropriate domain expertise
Basic BMAD methodology references
Professional response structure
Clear user engagement

Excellent Integration (85%+ threshold)

Deep BMAD artifact integration
Quantitative analysis with data sources
Hypothesis-driven approach
Sophisticated domain expertise
Seamless cross-agent collaboration

Continuous Improvement Process

Run Full Test Suite - Execute all 8 core tests
Oracle Evaluation - Get detailed o3 analysis for each
Identify Patterns - Find common improvement areas
Update Agent Prompts - Enhance based on recommendations
Rebuild and Retest - Verify improvements
Document Learnings - Update integration best practices

Automation Opportunities

Once manual process is validated:

Automated response collection via Claude API
Batch o3 evaluation processing
Regression testing on agent updates
Performance benchmarking over time

This framework provides the sophisticated evaluation approach demonstrated by the Oracle's analysis while remaining practical for ongoing validation and improvement of the BMAD Claude integration.

8.0 KiB Raw Blame History

Complete End-to-End Testing Framework with o3 Judge

Testing Strategy Overview

Test Suite

Core Agent Tests

1. Analyst Agent - Market Research

2. Dev Agent - Implementation Quality

3. Architect Agent - System Design

4. PM Agent - Project Planning

5. QA Agent - Testing Strategy

6. Scrum Master Agent - Process Facilitation

Advanced Integration Tests

7. BMAD Story Workflow

8. Cross-Agent Collaboration

Testing Execution Process

Step 1: Manual Execution

Step 2: Response Collection

Step 3: o3 Evaluation

Step 4: Report Generation

Individual Test Reports

Aggregate Analysis

Success Criteria

Minimum Viable Integration (70% threshold)

Excellent Integration (85%+ threshold)

Continuous Improvement Process

Automation Opportunities

8.0 KiB

Raw Blame History