docs: add roadmap task files for QD2 experiment

18 tasks: skeleton test/eval, then per-step tighten/test/eval cycles, plus end-to-end eval. Each task file is a self-contained intent expression that can be fed to QD2.
2026-02-22 13:02:06 -07:00 · 2026-02-22 13:02:06 -07:00 · 4dd5a0c871
parent c2da172988
commit 4dd5a0c871
19 changed files with 403 additions and 44 deletions
--- a/_experiment/planning/roadmap.md
+++ b/_experiment/planning/roadmap.md
@ -1,44 +0,0 @@
-# Quick-Flow Redesign — Implementation Roadmap
-
-## Strategy: Skeleton-First with Real Plumbing
-
-Build the full BMM sharded workflow infrastructure with minimal step prompts. Run it. See where training falls short. Tighten only where it demonstrably fails.
-
-## Phase 1: Working Skeleton (Current)
-
-Full BMM plumbing, thin prompts.
-
-### Create
-```
-src/bmm/workflows/bmad-quick-flow/
-  workflow.md                     # Real entry point — config loading, path resolution, step-file architecture rules
-  tech-spec-template.md           # Real template — frozen sections, change log, golden examples
-  steps/
-    step-01-clarify-and-route.md  # Slightly more flesh (routing criteria matter)
-    step-02-plan.md               # Thin prompt, real plumbing
-    step-03-implement.md          # Thin prompt, real plumbing (task sharding structure)
-    step-04-review.md             # Thin prompt, real plumbing
-    step-05-present.md            # Thin prompt, real plumbing
-```
-
-### Modify
- `src/bmm/agents/quick-flow-solo-dev.agent.yaml` — single QF trigger
- `src/bmm/module-help.csv` — replace QS/QD with QF
-
-### Delete
- `src/bmm/workflows/bmad-quick-flow/quick-spec/` (entire directory)
- `src/bmm/workflows/bmad-quick-flow/quick-dev/` (entire directory)
-
-### Test
-Run the skeleton on a real task. Observe where it works, where it breaks.
-
-## Phase 2+: Iterative Tightening
-
-Add specificity to step prompts only where Phase 1 testing reveals gaps. The detailed plan (`quick-flow-redesign-plan.md`) is the reference spec — pull from it as needed, don't front-load it all.
-
-Candidates for tightening (in likely priority order based on complexity):
- Step 1 routing criteria (one-shot vs plan-code-review vs full-BMM)
- Step 4 review layers and classification cascade
- Step 3 crash recovery and resume logic
- Step 4 spec loop oscillation mitigations (frozen sections, guardrails ratchet, positive preservation)
- Step 5 findings presentation and PR creation
--- a/_experiment/planning/roadmap/task-01-test-skeleton.md
+++ b/_experiment/planning/roadmap/task-01-test-skeleton.md
@ -0,0 +1,24 @@
+# Task 01: Test Bare Skeleton
+
+## Intent
+
+Run QD2 as-is (bare one-liner prompts + BMM plumbing) on a real small task. Document what works and what breaks.
+
+## Method
+
+1. Pick a small real task in the BMAD-METHOD repo (or a test project).
+2. Invoke QD2 via the QD2 trigger.
+3. Let it run through all 5 steps without intervention (except at the two checkpoints).
+4. Record observations per step:
+   - Did it follow the plumbing? (config loading, step transitions, NEXT directives)
+   - Did it produce reasonable output from training alone?
+   - Where did it go off the rails or get stuck?
+   - What questions did it ask that it shouldn't have?
+   - What did it fail to do that it should have?
+
+## Output
+
+A findings document: `_experiment/results/skeleton-test-findings.md` with per-step observations classified as:
+- **Works** — training handled it fine, no tightening needed
+- **Gap** — specific behavior missing or wrong, needs prompt tightening
+- **Plumbing** — structural issue with the BMM infrastructure itself
--- a/_experiment/planning/roadmap/task-02-eval-skeleton.md
+++ b/_experiment/planning/roadmap/task-02-eval-skeleton.md
@ -0,0 +1,24 @@
+# Task 02: Eval Bare Skeleton
+
+## Prerequisite
+
+Task 01 test cycle is clean (all gaps and plumbing issues resolved).
+
+## Intent
+
+Evaluate the bare skeleton's efficiency against the existing QD workflow as baseline.
+
+## Method
+
+1. Run the same task through both QD (old) and QD2 (skeleton) if possible, or compare against a recent QD session log.
+2. Measure:
+   - Total human turns (the north star metric)
+   - Total agent turns / API round-trips
+   - Approximate token usage (context window utilization)
+   - Time to completion
+   - Quality of output (subjective: did it produce what was asked for?)
+3. Note where QD2 is better, where it's worse, where it's equivalent.
+
+## Output
+
+An eval report: `_experiment/results/skeleton-eval-report.md` with metrics comparison and recommendations for which steps need tightening first (prioritized by impact on the north star metric).
--- a/_experiment/planning/roadmap/task-03-tighten-step-01-routing.md
+++ b/_experiment/planning/roadmap/task-03-tighten-step-01-routing.md
@ -0,0 +1,20 @@
+# Task 03: Tighten Step 1 — Routing
+
+## Prerequisite
+
+Task 01/02 findings indicate step 1 needs tightening.
+
+## Intent
+
+Add specificity to step-01-clarify-and-route.md only where skeleton testing revealed gaps. Reference the detailed plan in `_experiment/planning/redesign-plan.md` for the full spec of what step 1 should do — pull only what's needed.
+
+## Likely areas (from the plan)
+
+- Routing criteria precision (one-shot vs plan-code-review vs full-BMM boundaries)
+- Intent exit criteria enforcement
+- WIP/ready-for-dev artifact detection
+- Project context backfill behavior
+
+## Constraint
+
+Keep prompts thin. Add the minimum words needed to close the gap. If the LLM was already doing it right from training, don't add instructions for it.
--- a/_experiment/planning/roadmap/task-04-test-step-01-routing.md
+++ b/_experiment/planning/roadmap/task-04-test-step-01-routing.md
@ -0,0 +1,22 @@
+# Task 04: Test Step 1 — Routing
+
+## Prerequisite
+
+Task 03 tightening applied.
+
+## Intent
+
+Verify step 1 routing works correctly across different request types.
+
+## Test cases
+
+- Trivial request (e.g., "fix a typo in README") → should route one-shot
+- Normal feature request → should route plan-code-review
+- Large/ambiguous request → should route full BMM
+- Ambiguous between one-shot and plan → should default to plan-code-review
+- Existing ready-for-dev spec → should skip to step 3
+- Existing WIP file → should offer resume or archive
+
+## Output
+
+Per-test-case pass/fail with observations. Any failures feed back into task 03 (re-tighten).
--- a/_experiment/planning/roadmap/task-05-eval-step-01-routing.md
+++ b/_experiment/planning/roadmap/task-05-eval-step-01-routing.md
@ -0,0 +1,20 @@
+# Task 05: Eval Step 1 — Routing Efficiency
+
+## Prerequisite
+
+Task 04 test cycle is clean.
+
+## Intent
+
+Evaluate step 1 efficiency. How many turns does it take to capture intent and route?
+
+## Metrics
+
+- Human turns to reach a routing decision
+- Unnecessary questions asked (things it could have figured out from codebase investigation)
+- Time spent in step 1 vs total flow time
+- Compare against QD step-01-mode-detection baseline
+
+## Output
+
+Eval report appended to `_experiment/results/step-01-eval.md`.
--- a/_experiment/planning/roadmap/task-06-tighten-step-02-plan.md
+++ b/_experiment/planning/roadmap/task-06-tighten-step-02-plan.md
@ -0,0 +1,20 @@
+# Task 06: Tighten Step 2 — Plan
+
+## Prerequisite
+
+Step 1 test/eval cycle complete. Task 01/02 findings indicate step 2 needs tightening.
+
+## Intent
+
+Add specificity to step-02-plan.md where skeleton testing revealed gaps. Reference `_experiment/planning/redesign-plan.md` Step 2 section.
+
+## Likely areas
+
+- Spec quality (does it meet Ready for Development standard without being told how?)
+- Checkpoint 1 presentation format
+- Frozen section enforcement after approval
+- Investigation depth (does it use subagents? does it ask the human things it should figure out itself?)
+
+## Constraint
+
+Minimum words to close the gap. Don't over-specify what training already handles.
--- a/_experiment/planning/roadmap/task-07-test-step-02-plan.md
+++ b/_experiment/planning/roadmap/task-07-test-step-02-plan.md
@ -0,0 +1,22 @@
+# Task 07: Test Step 2 — Plan
+
+## Prerequisite
+
+Task 06 tightening applied.
+
+## Intent
+
+Verify step 2 produces a usable spec and the checkpoint works correctly.
+
+## Test cases
+
+- Does investigation happen autonomously without unnecessary human questions?
+- Does the spec use the tech-spec-template correctly?
+- Does it meet Ready for Development standard (actionable, logical, testable, complete, self-contained)?
+- Does checkpoint 1 present a useful summary?
+- Do approve/edit/full-BMM options work?
+- After approve, are frozen sections actually respected downstream?
+
+## Output
+
+Pass/fail per test case. Failures feed back into task 06.
--- a/_experiment/planning/roadmap/task-08-eval-step-02-plan.md
+++ b/_experiment/planning/roadmap/task-08-eval-step-02-plan.md
@ -0,0 +1,20 @@
+# Task 08: Eval Step 2 — Plan Efficiency
+
+## Prerequisite
+
+Task 07 test cycle clean.
+
+## Intent
+
+Evaluate spec generation quality and efficiency.
+
+## Metrics
+
+- Spec quality score (subjective: would you trust a fresh agent to implement from this alone?)
+- Investigation depth vs time spent
+- Tokens consumed in step 2
+- Compare against QS (old quick-spec) output quality
+
+## Output
+
+Eval report: `_experiment/results/step-02-eval.md`.
--- a/_experiment/planning/roadmap/task-09-tighten-step-03-implement.md
+++ b/_experiment/planning/roadmap/task-09-tighten-step-03-implement.md
@ -0,0 +1,22 @@
+# Task 09: Tighten Step 3 — Implement
+
+## Prerequisite
+
+Step 2 test/eval cycle complete. Findings indicate step 3 needs tightening.
+
+## Intent
+
+Add specificity to step-03-implement.md. Reference `_experiment/planning/redesign-plan.md` Step 3 section.
+
+## Likely areas
+
+- Task sharding mechanics (does it create task files and sequence file correctly?)
+- Baseline commit capture
+- Branch creation and idempotent reuse
+- Clean tree assertion and resume policy
+- Sequential execution discipline
+- Commit message quality
+
+## Constraint
+
+Minimum words. The sharding structure is in the plumbing (frontmatter paths). Only add instructions where the agent demonstrably fails.
--- a/_experiment/planning/roadmap/task-10-test-step-03-implement.md
+++ b/_experiment/planning/roadmap/task-10-test-step-03-implement.md
@ -0,0 +1,25 @@
+# Task 10: Test Step 3 — Implement
+
+## Prerequisite
+
+Task 09 tightening applied.
+
+## Intent
+
+Verify implementation mechanics work correctly.
+
+## Test cases
+
+- Task files created in correct location with correct format
+- Sequence file tracks status on disk (not just in memory)
+- Tasks execute sequentially (no parallel)
+- Each task file read fresh before execution
+- Feature branch created with correct naming
+- Commit produced with conventional message
+- No push or remote operations
+- Resume: kill mid-task, restart, verify completed tasks skipped
+- Dirty tree: verify halt on fresh start, resume policy on restart
+
+## Output
+
+Pass/fail per test case. Failures feed back into task 09.
--- a/_experiment/planning/roadmap/task-11-eval-step-03-implement.md
+++ b/_experiment/planning/roadmap/task-11-eval-step-03-implement.md
@ -0,0 +1,21 @@
+# Task 11: Eval Step 3 — Implementation Efficiency
+
+## Prerequisite
+
+Task 10 test cycle clean.
+
+## Intent
+
+Evaluate implementation quality and efficiency.
+
+## Metrics
+
+- Code quality (does it follow project patterns?)
+- AC coverage (are acceptance criteria actually verified?)
+- Tokens consumed in step 3
+- Time to implement vs QD baseline
+- Task sharding overhead (does it add value or just ceremony?)
+
+## Output
+
+Eval report: `_experiment/results/step-03-eval.md`.
--- a/_experiment/planning/roadmap/task-12-tighten-step-04-review.md
+++ b/_experiment/planning/roadmap/task-12-tighten-step-04-review.md
@ -0,0 +1,22 @@
+# Task 12: Tighten Step 4 — Review
+
+## Prerequisite
+
+Step 3 test/eval cycle complete. Findings indicate step 4 needs tightening.
+
+## Intent
+
+Add specificity to step-04-review.md. This is the highest-risk step. Reference `_experiment/planning/redesign-plan.md` Step 4 section.
+
+## Likely areas
+
+- Context isolation for review subagents (does it actually strip context?)
+- Layer 1 vs Layer 2 separation
+- Classification cascade (intent > spec > patch)
+- INTENT_GAP two-question test
+- Spec loop mechanics (frozen sections, change log ratchet, positive preservation)
+- Iteration cap enforcement
+
+## Constraint
+
+This step has the most detailed plan material. Resist the urge to dump it all in. Add only what testing proves is needed.
--- a/_experiment/planning/roadmap/task-13-test-step-04-review.md
+++ b/_experiment/planning/roadmap/task-13-test-step-04-review.md
@ -0,0 +1,28 @@
+# Task 13: Test Step 4 — Review
+
+## Prerequisite
+
+Task 12 tightening applied.
+
+## Intent
+
+Verify the review and classification system works.
+
+## Test cases
+
+- Diff constructed correctly from baseline
+- Layer 1 runs in context-free subagent (plan-code-review route)
+- Layer 1 skipped for one-shot route
+- Layer 2 (adversarial review) runs for all routes, context-free
+- One-shot Layer 2 receives user prompt alongside diff
+- Findings classified using priority cascade
+- Spec-class findings trigger spec amendment and re-derive
+- Spec loop respects iteration cap
+- Change log populated correctly
+- Positive preservation: KEEP instructions extracted and carried forward
+- Patch-class findings auto-fixed and committed
+- Defer-class findings written to wip.md
+
+## Output
+
+Pass/fail per test case. Failures feed back into task 12.
--- a/_experiment/planning/roadmap/task-14-eval-step-04-review.md
+++ b/_experiment/planning/roadmap/task-14-eval-step-04-review.md
@ -0,0 +1,22 @@
+# Task 14: Eval Step 4 — Review Efficiency
+
+## Prerequisite
+
+Task 13 test cycle clean.
+
+## Intent
+
+Evaluate review quality and efficiency.
+
+## Metrics
+
+- Finding quality (real issues vs noise)
+- Classification accuracy (spec vs patch vs defer — are they right?)
+- Spec loop iterations used vs cap
+- Tokens consumed in review (especially spec loop cost)
+- Compare review quality against QD step-05 adversarial review baseline
+- False positive rate (reject-class findings as % of total)
+
+## Output
+
+Eval report: `_experiment/results/step-04-eval.md`.
--- a/_experiment/planning/roadmap/task-15-tighten-step-05-present.md
+++ b/_experiment/planning/roadmap/task-15-tighten-step-05-present.md
@ -0,0 +1,21 @@
+# Task 15: Tighten Step 5 — Present
+
+## Prerequisite
+
+Step 4 test/eval cycle complete. Findings indicate step 5 needs tightening.
+
+## Intent
+
+Add specificity to step-05-present.md. Reference `_experiment/planning/redesign-plan.md` Step 5 section.
+
+## Likely areas
+
+- Findings presentation order and format
+- Low-confidence bucket handling
+- Push prohibition enforcement
+- PR creation quality (summary derived from spec + findings)
+- Approve/edit/reject option handling
+
+## Constraint
+
+Minimum words. This is the simplest step — mostly presentation and plumbing.
--- a/_experiment/planning/roadmap/task-16-test-step-05-present.md
+++ b/_experiment/planning/roadmap/task-16-test-step-05-present.md
@ -0,0 +1,22 @@
+# Task 16: Test Step 5 — Present
+
+## Prerequisite
+
+Task 15 tightening applied.
+
+## Intent
+
+Verify the final checkpoint and PR creation work.
+
+## Test cases
+
+- Findings presented in correct priority order
+- Low-confidence findings surfaced as primary action item
+- Approve/edit/reject options work
+- Push command printed, never auto-executed
+- PR created with meaningful summary after human confirms push
+- Deferred items documented in wip.md
+
+## Output
+
+Pass/fail per test case. Failures feed back into task 15.
--- a/_experiment/planning/roadmap/task-17-eval-step-05-present.md
+++ b/_experiment/planning/roadmap/task-17-eval-step-05-present.md
@ -0,0 +1,20 @@
+# Task 17: Eval Step 5 — Present Efficiency
+
+## Prerequisite
+
+Task 16 test cycle clean.
+
+## Intent
+
+Evaluate the final presentation quality and efficiency.
+
+## Metrics
+
+- Human effort at checkpoint 2 (how many items need manual classification?)
+- PR quality (is the summary useful? would you merge from this?)
+- Time in step 5 vs total flow
+- Compare against QD step-06 resolve-findings baseline
+
+## Output
+
+Eval report: `_experiment/results/step-05-eval.md`.
--- a/_experiment/planning/roadmap/task-18-eval-end-to-end.md
+++ b/_experiment/planning/roadmap/task-18-eval-end-to-end.md
@ -0,0 +1,28 @@
+# Task 18: End-to-End Evaluation
+
+## Prerequisite
+
+All step-level test/eval cycles complete.
+
+## Intent
+
+Run the fully tightened QD2 on a real task end-to-end. Compare holistically against QD + QS combined baseline.
+
+## Method
+
+1. Pick a representative task (not trivial, not huge — the sweet spot QD2 is designed for).
+2. Run QD2 start to finish.
+3. If possible, run the same task through QS → QD for comparison.
+
+## Metrics
+
+- Total human turns (north star)
+- Total time
+- Total tokens
+- Output quality (code correctness, spec quality, review thoroughness)
+- Number of unnecessary human interactions
+- Would you ship this?
+
+## Output
+
+Final eval report: `_experiment/results/end-to-end-eval.md` with go/no-go recommendation for promoting QD2 to replace QS+QD.