BMAD-METHOD/_experiment/planning/roadmap/task-18-eval-end-to-end.md

766 B

Task 18: End-to-End Evaluation

Prerequisite

All step-level test/eval cycles complete.

Intent

Run the fully tightened QD2 on a real task end-to-end. Compare holistically against QD + QS combined baseline.

Method

  1. Pick a representative task (not trivial, not huge — the sweet spot QD2 is designed for).
  2. Run QD2 start to finish.
  3. If possible, run the same task through QS → QD for comparison.

Metrics

  • Total human turns (north star)
  • Total time
  • Total tokens
  • Output quality (code correctness, spec quality, review thoroughness)
  • Number of unnecessary human interactions
  • Would you ship this?

Output

Final eval report: _experiment/results/end-to-end-eval.md with go/no-go recommendation for promoting QD2 to replace QS+QD.