26 KiB

Raw Blame History

description

argument-hint

allowed-tools

Orchestrate test failure analysis and coordinate parallel specialist test fixers with strategic analysis mode

[test_scope] [--run-first] [--coverage] [--fast] [--strategic] [--research] [--force-escalate] [--no-chain] [--api-only] [--database-only] [--vitest-only] [--pytest-only] [--playwright-only] [--only-category=<unit|integration|e2e|acceptance>]

Task

TodoWrite

Bash

Grep

Read

Glob

SlashCommand

Test Orchestration Command (v2.0)

Execute this test orchestration procedure for: "$ARGUMENTS"

ORCHESTRATOR GUARD RAILS

PROHIBITED (NEVER do directly):

Direct edits to test files
Direct edits to source files
pytest --fix or similar
git add / git commit
pip install / uv add
Modifying test configuration

ALLOWED (delegation only):

Task(subagent_type="unit-test-fixer", ...)
Task(subagent_type="api-test-fixer", ...)
Task(subagent_type="database-test-fixer", ...)
Task(subagent_type="e2e-test-fixer", ...)
Task(subagent_type="type-error-fixer", ...)
Task(subagent_type="import-error-fixer", ...)
Read-only bash commands for analysis
Grep/Glob/Read for investigation

WHY: Ensures expert handling by specialists, prevents conflicts, maintains audit trail.

STEP 0: MODE DETECTION + AUTO-ESCALATION + DEPTH PROTECTION

0a. Depth Protection (prevent infinite loops)

echo "SLASH_DEPTH=${SLASH_DEPTH:-0}"

If SLASH_DEPTH >= 3:

Report: "Maximum orchestration depth (3) reached. Exiting to prevent loop."
EXIT immediately

Otherwise, set for any chained commands:

export SLASH_DEPTH=$((${SLASH_DEPTH:-0} + 1))

0b. Parse Strategic Flags

Check "$ARGUMENTS" for strategic triggers:

--strategic = Force strategic mode
--research = Research best practices only (no fixes)
--force-escalate = Force strategic mode regardless of history

If ANY strategic flag present → Set STRATEGIC_MODE=true

0c. Auto-Escalation Detection

Check git history for recurring test fix attempts:

TEST_FIX_COUNT=$(git log --oneline -20 | grep -iE "fix.*(test|spec|jest|pytest|vitest)" | wc -l | tr -d ' ')
echo "TEST_FIX_COUNT=$TEST_FIX_COUNT"

If TEST_FIX_COUNT >= 3:

Report: "Detected $TEST_FIX_COUNT test fix attempts in recent history. Auto-escalating to strategic mode."
Set STRATEGIC_MODE=true

0d. Mode Decision

Condition	Mode
--strategic OR --research OR --force-escalate	STRATEGIC
TEST_FIX_COUNT >= 3	STRATEGIC (auto-escalated)
Otherwise	TACTICAL (default)

Report the mode: "Operating in [TACTICAL/STRATEGIC] mode."

STEP 1: Parse Arguments

Check "$ARGUMENTS" for these flags:

--run-first = Ignore cached results, run fresh tests
--pytest-only = Focus on pytest (backend) only
--vitest-only = Focus on Vitest (frontend) only
--playwright-only = Focus on Playwright (E2E) only
--coverage = Include coverage analysis
--fast = Skip slow tests
--no-chain = Disable chain invocation after fixes
--only-category=<category> = Target specific test category for faster iteration

Parse --only-category for targeted test execution:

# Parse --only-category for finer control
if [[ "$ARGUMENTS" =~ "--only-category="([a-zA-Z]+) ]]; then
    TARGET_CATEGORY="${BASH_REMATCH[1]}"
    echo "🎯 Targeting only '$TARGET_CATEGORY' tests"
    # Used in STEP 4 to filter pytest: -k $TARGET_CATEGORY
fi

Valid categories: unit, integration, e2e, acceptance, api, database

STEP 2: Discover Cached Test Results

Run these commands ONE AT A TIME:

2a. Project info:

echo "Project: $(basename $PWD) | Branch: $(git branch --show-current) | Root: $PWD"

2b. Check if pytest results exist:

test -f "test-results/pytest/junit.xml" && echo "PYTEST_EXISTS=yes" || echo "PYTEST_EXISTS=no"

2c. If pytest results exist, get stats:

echo "PYTEST_AGE=$(($(date +%s) - $(stat -f %m test-results/pytest/junit.xml 2>/dev/null || stat -c %Y test-results/pytest/junit.xml 2>/dev/null)))s"

echo "PYTEST_TESTS=$(grep -o 'tests="[0-9]*"' test-results/pytest/junit.xml | head -1 | grep -o '[0-9]*')"

echo "PYTEST_FAILURES=$(grep -o 'failures="[0-9]*"' test-results/pytest/junit.xml | head -1 | grep -o '[0-9]*')"

2d. Check Vitest results:

test -f "test-results/vitest/results.json" && echo "VITEST_EXISTS=yes" || echo "VITEST_EXISTS=no"

2e. Check Playwright results:

test -f "test-results/playwright/results.json" && echo "PLAYWRIGHT_EXISTS=yes" || echo "PLAYWRIGHT_EXISTS=no"

STEP 2.5: Test Framework Intelligence

Detect test framework configuration:

2.5a. Pytest configuration:

grep -A 20 "\[tool.pytest" pyproject.toml 2>/dev/null | head -25 || echo "No pytest config in pyproject.toml"

2.5b. Available pytest markers:

grep -rh "pytest.mark\." tests/ 2>/dev/null | sed 's/.*@pytest.mark.\([a-zA-Z_]*\).*/\1/' | sort -u | head -10

2.5c. Check for slow tests:

grep -l "@pytest.mark.slow" tests/**/*.py 2>/dev/null | wc -l | xargs echo "Slow tests:"

Save detected markers and configuration for agent context.

STEP 2.6: Discover Project Context (SHARED CACHE - Token Efficient)

Token Savings: Using shared discovery cache saves ~14K tokens (2K per agent x 7 agents).

# 📊 SHARED DISCOVERY - Use cached context, refresh if stale (>15 min)
echo "=== Loading Shared Project Context ==="

# Source shared discovery helper (creates/uses cache)
if [[ -f "$HOME/.claude/scripts/shared-discovery.sh" ]]; then
    source "$HOME/.claude/scripts/shared-discovery.sh"
    discover_project_context

    # SHARED_CONTEXT now contains pre-built context for agents
    # Variables available: PROJECT_TYPE, VALIDATION_CMD, TEST_FRAMEWORK, RULES_SUMMARY
else
    # Fallback: inline discovery (less efficient)
    echo "⚠️ Shared discovery not found, using inline discovery"

    PROJECT_CONTEXT=""
    [ -f "CLAUDE.md" ] && PROJECT_CONTEXT="Read CLAUDE.md for project conventions. "
    [ -d ".claude/rules" ] && PROJECT_CONTEXT+="Check .claude/rules/ for patterns. "

    PROJECT_TYPE=""
    [ -f "pyproject.toml" ] && PROJECT_TYPE="python"
    [ -f "package.json" ] && PROJECT_TYPE="${PROJECT_TYPE:+$PROJECT_TYPE+}node"

    SHARED_CONTEXT="$PROJECT_CONTEXT"
fi

# Display cached context summary
echo "PROJECT_TYPE=$PROJECT_TYPE"
echo "VALIDATION_CMD=${VALIDATION_CMD:-pnpm prepush}"
echo "TEST_FRAMEWORK=${TEST_FRAMEWORK:-pytest}"

CRITICAL: Pass $SHARED_CONTEXT to ALL agent prompts instead of asking each agent to discover. This prevents 7 agents from each running discovery independently.

STEP 3: Decision Logic + Early Exit

Based on discovery, decide:

Condition	Action
`--run-first` flag present	Go to STEP 4 (run fresh tests)
PYTEST_EXISTS=yes AND AGE < 900s AND FAILURES > 0	Go to STEP 5 (read results)
PYTEST_EXISTS=yes AND AGE < 900s AND FAILURES = 0	EARLY EXIT (see below)
PYTEST_EXISTS=no OR AGE >= 900s	Go to STEP 4 (run fresh tests)

EARLY EXIT OPTIMIZATION (Token Savings: ~80%)

If ALL tests are passing from cached results:

✅ All tests passing (PYTEST_FAILURES=0, VITEST_FAILURES=0)
📊 No failures to fix. Skipping agent dispatch.
💰 Token savings: ~80K tokens (avoided 7 agent dispatches)

Output JSON summary:
{
  "status": "all_passing",
  "tests_run": $PYTEST_TESTS,
  "failures": 0,
  "agents_dispatched": 0,
  "action": "none_required"
}

→ Go to STEP 10 (chain invocation) or EXIT if --no-chain

DO NOT:

Run discovery phase (STEP 2.6) if no failures
Dispatch any agents
Run strategic analysis
Generate documentation

This avoids full pipeline when unnecessary.

STEP 4: Run Fresh Tests (if needed)

4a. Run pytest:

mkdir -p test-results/pytest && cd apps/api && uv run pytest -v --tb=short --junitxml=../../test-results/pytest/junit.xml 2>&1 | tail -40

4b. Run Vitest (if config exists):

test -f "apps/web/vitest.config.ts" && mkdir -p test-results/vitest && cd apps/web && npx vitest run --reporter=json --outputFile=../../test-results/vitest/results.json 2>&1 | tail -25

4c. Run Playwright (if config exists):

test -f "playwright.config.ts" && mkdir -p test-results/playwright && npx playwright test --reporter=json 2>&1 | tee test-results/playwright/results.json | tail -25

4d. If --coverage flag present:

mkdir -p test-results/pytest && cd apps/api && uv run pytest --cov=app --cov-report=xml:../../test-results/pytest/coverage.xml --cov-report=term-missing 2>&1 | tail -30

STEP 5: Read Test Result Files

Use the Read tool:

For pytest: Read(file_path="test-results/pytest/junit.xml")

Look for <testcase> with <failure> or <error> children
Extract: test name, classname (file path), failure message, full stack trace

For Vitest: Read(file_path="test-results/vitest/results.json")

Look for "status": "failed" entries
Extract: test name, file path, failure messages

For Playwright: Read(file_path="test-results/playwright/results.json")

Look for specs where "ok": false
Extract: test title, browser, error message

STEP 5.5: ANALYSIS PHASE

5.5a. Test Isolation Analysis

Check for potential isolation issues:

echo "=== Shared State Detection ===" && grep -rn "global\|class.*:$" tests/ 2>/dev/null | grep -v "conftest\|__pycache__" | head -10

echo "=== Fixture Scope Analysis ===" && grep -rn "@pytest.fixture.*scope=" tests/ 2>/dev/null | head -10

echo "=== Order Dependency Markers ===" && grep -rn "pytest.mark.order\|pytest.mark.serial" tests/ 2>/dev/null | head -5

If isolation issues detected:

Add to agent context: "WARNING: Potential test isolation issues detected"
List affected files

5.5b. Flakiness Detection

Check for flaky test indicators:

echo "=== Timing Dependencies ===" && grep -rn "sleep\|time.sleep\|setTimeout" tests/ 2>/dev/null | grep -v "__pycache__" | head -5

echo "=== Async Race Conditions ===" && grep -rn "asyncio.gather\|Promise.all" tests/ 2>/dev/null | head -5

If flakiness indicators found:

Add to agent context: "Known flaky patterns detected"
Recommend: pytest-rerunfailures or vitest retry

5.5c. Coverage Analysis (if --coverage)

test -f "test-results/pytest/coverage.xml" && grep -o 'line-rate="[0-9.]*"' test-results/pytest/coverage.xml | head -1

Coverage gates:

< 60%: WARN "Critical: Coverage below 60%"
60-80%: INFO "Coverage could be improved"
80%: OK

STEP 6: Enhanced Failure Categorization (Regex-Based)

Use regex pattern matching for precise categorization:

Unit Test Patterns → unit-test-fixer

/AssertionError:.*expected.*got/ → Assertion mismatch
/Mock.*call_count.*expected/ → Mock verification failure
/fixture.*not found/ → Fixture missing
Business logic failures

API Test Patterns → api-test-fixer

/status.*(4\d\d|5\d\d)/ → HTTP error response
/validation.*failed|ValidationError/ → Schema validation
/timeout.*\d+\s*(s|ms)/ → Request timeout
FastAPI/Flask/Django endpoint failures

Database Test Patterns → database-test-fixer

/connection.*refused|ConnectionError/ → Connection failure
/relation.*does not exist|table.*not found/ → Schema mismatch
/deadlock.*detected/ → Concurrency issue
/IntegrityError|UniqueViolation/ → Constraint violation
Fixture/mock database issues

E2E Test Patterns → e2e-test-fixer

/locator.*timeout|element.*not found/ → Selector failure
/navigation.*failed|page.*crashed/ → Page load issue
/screenshot.*captured/ → Visual regression
Playwright/Cypress failures

Type Error Patterns → type-error-fixer

/TypeError:.*expected.*got/ → Type mismatch
/mypy.*error/ → Static type check failure
/TypeScript.*error TS/ → TS compilation error

Import Error Patterns → import-error-fixer

/ModuleNotFoundError|ImportError/ → Missing module
/circular import/ → Circular dependency
/cannot import name/ → Named import failure

STEP 6.5: FAILURE PRIORITIZATION

Assign priority based on test type:

Priority	Criteria	Detection
P0 Critical	Security/auth tests	`test_auth_`, `test_security_`, `test_permission_*`
P1 High	Core business logic	`test__service`, `test__handler`, most unit tests
P2 Medium	Integration tests	`test_*_integration`, API tests
P3 Low	Edge cases, performance	`test__edge_`, `test__perf_`, `test_*_slow`

Pass priority information to agents:

"Priority: P0 - Fix these FIRST (security critical)"
"Priority: P1 - High importance (core logic)"

STEP 7: STRATEGIC MODE (if triggered)

If STRATEGIC_MODE=true:

7a. Launch Test Strategy Analyst

Task(subagent_type="test-strategy-analyst",
     model="opus",
     description="Analyze recurring test failures",
     prompt="Analyze test failures in this project using Five Whys methodology.

Git history shows $TEST_FIX_COUNT recent test fix attempts.
Current failures: [FAILURE SUMMARY]

Research:
1. Best practices for the detected failure patterns
2. Common pitfalls in pytest/vitest testing
3. Root cause analysis for recurring issues

Provide strategic recommendations for systemic fixes.

MANDATORY OUTPUT FORMAT - Return ONLY JSON:
{
  \"root_causes\": [{\"issue\": \"...\", \"five_whys\": [...], \"recommendation\": \"...\"}],
  \"infrastructure_changes\": [\"...\"],
  \"prevention_mechanisms\": [\"...\"],
  \"priority\": \"P0|P1|P2\",
  \"summary\": \"Brief strategic overview\"
}
DO NOT include verbose analysis or full code examples.")

7b. After Strategy Analyst Completes

If fixes are recommended, proceed to STEP 8.

7c. Launch Documentation Generator (optional)

If significant insights were found:

Task(subagent_type="test-documentation-generator",
     model="haiku",
     description="Generate test knowledge documentation",
     prompt="Based on the strategic analysis results, generate:
1. Test failure runbook (docs/test-failure-runbook.md)
2. Test strategy summary (docs/test-strategy.md)
3. Pattern-specific knowledge (docs/test-knowledge/)

MANDATORY OUTPUT FORMAT - Return ONLY JSON:
{
  \"files_created\": [\"docs/test-failure-runbook.md\"],
  \"patterns_documented\": 3,
  \"summary\": \"Created runbook with 5 failure patterns\"
}
DO NOT include file contents in response.")

STEP 7.5: Conflict Detection for Parallel Agents

Before launching agents, detect overlapping file scopes to prevent conflicts:

SAFE TO PARALLELIZE (different test domains):

unit-test-fixer + e2e-test-fixer → ✅ Different test directories
api-test-fixer + database-test-fixer → ✅ Different concerns
vitest tests + pytest tests → ✅ Different frameworks

MUST SERIALIZE (overlapping files):

unit-test-fixer + import-error-fixer → ⚠️ Both may modify conftest.py → SEQUENTIAL
type-error-fixer + any test fixer → ⚠️ Type fixes affect test expectations → RUN FIRST
Multiple fixers for same test file → ⚠️ RUN SEQUENTIALLY

Execution Phases:

PHASE 1 (First): type-error-fixer, import-error-fixer
   └── These fix foundational issues that other agents depend on

PHASE 2 (Parallel): unit-test-fixer, api-test-fixer, database-test-fixer
   └── These target different test categories, safe to run together

PHASE 3 (Last): e2e-test-fixer
   └── E2E depends on backend fixes being complete

PHASE 4 (Validation): Run full test suite to verify all fixes

Conflict Detection Algorithm:

# Check if multiple agents target same file patterns
# If conftest.py in scope of multiple agents → serialize them
# If same test file reported → assign to single agent only

STEP 7.6: Test File Modification Safety (NEW)

CRITICAL: When multiple test files need modification, apply dependency-aware batching similar to source file refactoring.

Analyze Test File Dependencies

Before spawning test fixers, identify shared fixtures and conftest dependencies:

echo "=== Test Dependency Analysis ==="

# Find all conftest.py files
CONFTEST_FILES=$(find tests/ -name "conftest.py" 2>/dev/null)
echo "Shared fixture files: $CONFTEST_FILES"

# For each failing test file, find its fixture dependencies
for TEST_FILE in $FAILING_TEST_FILES; do
    # Find imports from conftest
    FIXTURE_IMPORTS=$(grep -E "^from.*conftest|@pytest.fixture" "$TEST_FILE" 2>/dev/null | head -10)

    # Find shared fixtures used
    FIXTURES_USED=$(grep -oE "[a-z_]+_fixture|@pytest.fixture" "$TEST_FILE" 2>/dev/null | sort -u)

    echo "  $TEST_FILE -> fixtures: [$FIXTURES_USED]"
done

Group Test Files by Shared Fixtures

# Files sharing conftest.py fixtures MUST serialize
# Files with independent fixtures CAN parallelize

# Example output:
echo "
Test Cluster A (SERIAL - shared fixtures in tests/conftest.py):
  - tests/unit/test_user.py
  - tests/unit/test_auth.py

Test Cluster B (PARALLEL - independent fixtures):
  - tests/integration/test_api.py
  - tests/integration/test_database.py

Test Cluster C (SPECIAL - conftest modification needed):
  - tests/conftest.py (SERIALIZE - blocks all others)
"

Execution Rules for Test Modifications

Scenario	Execution Mode	Reason
Multiple test files, no shared fixtures	PARALLEL	Safe, independent
Multiple test files, shared fixtures	SERIAL within fixture scope	Fixture state conflicts
conftest.py needs modification	SERIAL (blocks all)	Critical shared state
Same test file reported by multiple fixers	Single agent only	Avoid merge conflicts

conftest.py Special Handling

If conftest.py needs modification:

Run conftest fixer FIRST (before any other test fixers)
Wait for completion before proceeding
Re-run baseline tests to verify fixture changes don't break existing tests
Then parallelize remaining independent test fixes

PHASE 1 (First, blocking): conftest.py modification
   └── WAIT for completion

PHASE 2 (Sequential): Test files sharing modified fixtures
   └── Run one at a time, verify after each

PHASE 3 (Parallel): Independent test files
   └── Safe to parallelize

Failure Handling for Test Modifications

When a test fixer fails:

AskUserQuestion(
  questions=[{
    "question": "Test fixer for {test_file} failed: {error}. {N} test files remain. What would you like to do?",
    "header": "Test Fix Failure",
    "options": [
      {"label": "Continue", "description": "Skip this test file, proceed with remaining"},
      {"label": "Abort", "description": "Stop test fixing, preserve current state"},
      {"label": "Retry", "description": "Attempt to fix {test_file} again"}
    ],
    "multiSelect": false
  }]
)

Test Fixer Dispatch with Scope

Include scope information when dispatching test fixers:

Task(
    subagent_type="unit-test-fixer",
    description="Fix unit tests in {test_file}",
    prompt="Fix failing tests in this file:

    TEST FILE CONTEXT:
    - file: {test_file}
    - shared_fixtures: {list of conftest fixtures used}
    - parallel_peers: {other test files being fixed simultaneously}
    - conftest_modified: {true|false - was conftest changed this session?}

    SCOPE CONSTRAINTS:
    - ONLY modify: {test_file}
    - DO NOT modify: conftest.py (unless explicitly assigned)
    - DO NOT modify: {parallel_peer_files}

    MANDATORY OUTPUT FORMAT - Return ONLY JSON:
    {
      \"status\": \"fixed|partial|failed\",
      \"test_file\": \"{test_file}\",
      \"tests_fixed\": N,
      \"fixtures_modified\": [],
      \"remaining_failures\": N,
      \"summary\": \"...\"
    }"
)

STEP 8: PARALLEL AGENT DISPATCH

CRITICAL: Launch ALL agents in ONE response with multiple Task calls.

ENHANCED AGENT CONTEXT TEMPLATE

For each agent, provide this comprehensive context:

Test Specialist Task: [Agent Type] - Test Failure Fix

## Context
- Project: [detected from git remote]
- Branch: [from git branch --show-current]
- Framework: pytest [version] / vitest [version]
- Python/Node version: [detected]

## Project Patterns (DISCOVER DYNAMICALLY - Do This First!)
**CRITICAL - Project Context Discovery:**
Before making any fixes, you MUST:
1. Read CLAUDE.md at project root (if exists) for project conventions
2. Check .claude/rules/ directory for domain-specific rule files:
   - If editing Python test files → read python*.md rules
   - If editing TypeScript tests → read typescript*.md rules
   - If graphiti/temporal patterns exist → read graphiti.md rules
3. Detect test patterns from config files (pytest.ini, vitest.config.ts)
4. Apply discovered patterns to ALL your fixes

This ensures fixes follow project conventions, not generic patterns.

[Include PROJECT_CONTEXT from STEP 2.6 here]

## Recent Test Changes
[git diff HEAD~3 --name-only | grep -E "(test|spec)\.(py|ts|tsx)$"]

## Failures to Fix
[FAILURE LIST with full stack traces]

## Test Isolation Status
[From STEP 5.5a - any warnings]

## Flakiness Report
[From STEP 5.5b - any detected patterns]

## Priority
[From STEP 6.5 - P0/P1/P2/P3 with reasoning]

## Framework Configuration
[From STEP 2.5 - markers, config]

## Constraints
- Follow project's test method length limits (check CLAUDE.md or file-size-guidelines.md)
- Pre-flight: Verify baseline tests pass
- Post-flight: Ensure no broken existing tests
- Cannot modify implementation code (test expectations only unless bug found)
- Apply project-specific patterns discovered from CLAUDE.md/.claude/rules/

## Expected Output
- Summary of fixes made
- Files modified with line numbers
- Verification commands run
- Remaining issues (if any)

Dispatch Example (with Model Strategy + JSON Output)

Task(subagent_type="unit-test-fixer",
     model="sonnet",
     description="Fix unit test failures (P1)",
     prompt="[FULL ENHANCED CONTEXT TEMPLATE]

MANDATORY OUTPUT FORMAT - Return ONLY JSON:
{
  \"status\": \"fixed|partial|failed\",
  \"tests_fixed\": N,
  \"files_modified\": [\"path/to/file.py\"],
  \"remaining_failures\": N,
  \"summary\": \"Brief description of fixes\"
}
DO NOT include full file content or verbose logs.")

Task(subagent_type="api-test-fixer",
     model="sonnet",
     description="Fix API test failures (P2)",
     prompt="[FULL ENHANCED CONTEXT TEMPLATE]

MANDATORY OUTPUT FORMAT - Return ONLY JSON:
{...same format...}
DO NOT include full file content or verbose logs.")

Task(subagent_type="import-error-fixer",
     model="haiku",
     description="Fix import errors (P1)",
     prompt="[CONTEXT]

MANDATORY OUTPUT FORMAT - Return ONLY JSON:
{...same format...}")

Model Strategy

Agent Type	Model	Rationale
test-strategy-analyst	opus	Complex research + Five Whys
unit/api/database/e2e-test-fixer	sonnet	Balanced speed + quality
type-error-fixer	sonnet	Type inference complexity
import-error-fixer	haiku	Simple pattern matching
linting-fixer	haiku	Rule-based fixes
test-documentation-generator	haiku	Template-based docs

STEP 9: Validate Fixes

After agents complete:

cd apps/api && uv run pytest -v --tb=short --junitxml=../../test-results/pytest/junit.xml 2>&1 | tail -40

Check results:

If ALL tests pass → Go to STEP 10
If SOME tests still fail → Report remaining failures, suggest --strategic

STEP 10: INTELLIGENT CHAIN INVOCATION

10a. Check Depth