766 B
766 B
Task 18: End-to-End Evaluation
Prerequisite
All step-level test/eval cycles complete.
Intent
Run the fully tightened QD2 on a real task end-to-end. Compare holistically against QD + QS combined baseline.
Method
- Pick a representative task (not trivial, not huge — the sweet spot QD2 is designed for).
- Run QD2 start to finish.
- If possible, run the same task through QS → QD for comparison.
Metrics
- Total human turns (north star)
- Total time
- Total tokens
- Output quality (code correctness, spec quality, review thoroughness)
- Number of unnecessary human interactions
- Would you ship this?
Output
Final eval report: _experiment/results/end-to-end-eval.md with go/no-go recommendation for promoting QD2 to replace QS+QD.