25 lines
876 B
Markdown
25 lines
876 B
Markdown
# Task 02: Eval Bare Skeleton
|
|
|
|
## Prerequisite
|
|
|
|
Task 01 test cycle is clean (all gaps and plumbing issues resolved).
|
|
|
|
## Intent
|
|
|
|
Evaluate the bare skeleton's efficiency against the existing QD workflow as baseline.
|
|
|
|
## Method
|
|
|
|
1. Run the same task through both QD (old) and QD2 (skeleton) if possible, or compare against a recent QD session log.
|
|
2. Measure:
|
|
- Total human turns (the north star metric)
|
|
- Total agent turns / API round-trips
|
|
- Approximate token usage (context window utilization)
|
|
- Time to completion
|
|
- Quality of output (subjective: did it produce what was asked for?)
|
|
3. Note where QD2 is better, where it's worse, where it's equivalent.
|
|
|
|
## Output
|
|
|
|
An eval report: `_experiment/results/skeleton-eval-report.md` with metrics comparison and recommendations for which steps need tightening first (prioritized by impact on the north star metric).
|