From 4dd5a0c871ea868b3bb118b2114f5a702d465a98 Mon Sep 17 00:00:00 2001 From: Alex Verkhovsky Date: Sun, 22 Feb 2026 13:02:06 -0700 Subject: [PATCH] docs: add roadmap task files for QD2 experiment 18 tasks: skeleton test/eval, then per-step tighten/test/eval cycles, plus end-to-end eval. Each task file is a self-contained intent expression that can be fed to QD2. --- _experiment/planning/roadmap.md | 44 ------------------- .../planning/roadmap/task-01-test-skeleton.md | 24 ++++++++++ .../planning/roadmap/task-02-eval-skeleton.md | 24 ++++++++++ .../task-03-tighten-step-01-routing.md | 20 +++++++++ .../roadmap/task-04-test-step-01-routing.md | 22 ++++++++++ .../roadmap/task-05-eval-step-01-routing.md | 20 +++++++++ .../roadmap/task-06-tighten-step-02-plan.md | 20 +++++++++ .../roadmap/task-07-test-step-02-plan.md | 22 ++++++++++ .../roadmap/task-08-eval-step-02-plan.md | 20 +++++++++ .../task-09-tighten-step-03-implement.md | 22 ++++++++++ .../roadmap/task-10-test-step-03-implement.md | 25 +++++++++++ .../roadmap/task-11-eval-step-03-implement.md | 21 +++++++++ .../roadmap/task-12-tighten-step-04-review.md | 22 ++++++++++ .../roadmap/task-13-test-step-04-review.md | 28 ++++++++++++ .../roadmap/task-14-eval-step-04-review.md | 22 ++++++++++ .../task-15-tighten-step-05-present.md | 21 +++++++++ .../roadmap/task-16-test-step-05-present.md | 22 ++++++++++ .../roadmap/task-17-eval-step-05-present.md | 20 +++++++++ .../roadmap/task-18-eval-end-to-end.md | 28 ++++++++++++ 19 files changed, 403 insertions(+), 44 deletions(-) delete mode 100644 _experiment/planning/roadmap.md create mode 100644 _experiment/planning/roadmap/task-01-test-skeleton.md create mode 100644 _experiment/planning/roadmap/task-02-eval-skeleton.md create mode 100644 _experiment/planning/roadmap/task-03-tighten-step-01-routing.md create mode 100644 _experiment/planning/roadmap/task-04-test-step-01-routing.md create mode 100644 _experiment/planning/roadmap/task-05-eval-step-01-routing.md create mode 100644 _experiment/planning/roadmap/task-06-tighten-step-02-plan.md create mode 100644 _experiment/planning/roadmap/task-07-test-step-02-plan.md create mode 100644 _experiment/planning/roadmap/task-08-eval-step-02-plan.md create mode 100644 _experiment/planning/roadmap/task-09-tighten-step-03-implement.md create mode 100644 _experiment/planning/roadmap/task-10-test-step-03-implement.md create mode 100644 _experiment/planning/roadmap/task-11-eval-step-03-implement.md create mode 100644 _experiment/planning/roadmap/task-12-tighten-step-04-review.md create mode 100644 _experiment/planning/roadmap/task-13-test-step-04-review.md create mode 100644 _experiment/planning/roadmap/task-14-eval-step-04-review.md create mode 100644 _experiment/planning/roadmap/task-15-tighten-step-05-present.md create mode 100644 _experiment/planning/roadmap/task-16-test-step-05-present.md create mode 100644 _experiment/planning/roadmap/task-17-eval-step-05-present.md create mode 100644 _experiment/planning/roadmap/task-18-eval-end-to-end.md diff --git a/_experiment/planning/roadmap.md b/_experiment/planning/roadmap.md deleted file mode 100644 index af076b06b..000000000 --- a/_experiment/planning/roadmap.md +++ /dev/null @@ -1,44 +0,0 @@ -# Quick-Flow Redesign — Implementation Roadmap - -## Strategy: Skeleton-First with Real Plumbing - -Build the full BMM sharded workflow infrastructure with minimal step prompts. Run it. See where training falls short. Tighten only where it demonstrably fails. - -## Phase 1: Working Skeleton (Current) - -Full BMM plumbing, thin prompts. - -### Create -``` -src/bmm/workflows/bmad-quick-flow/ - workflow.md # Real entry point — config loading, path resolution, step-file architecture rules - tech-spec-template.md # Real template — frozen sections, change log, golden examples - steps/ - step-01-clarify-and-route.md # Slightly more flesh (routing criteria matter) - step-02-plan.md # Thin prompt, real plumbing - step-03-implement.md # Thin prompt, real plumbing (task sharding structure) - step-04-review.md # Thin prompt, real plumbing - step-05-present.md # Thin prompt, real plumbing -``` - -### Modify -- `src/bmm/agents/quick-flow-solo-dev.agent.yaml` — single QF trigger -- `src/bmm/module-help.csv` — replace QS/QD with QF - -### Delete -- `src/bmm/workflows/bmad-quick-flow/quick-spec/` (entire directory) -- `src/bmm/workflows/bmad-quick-flow/quick-dev/` (entire directory) - -### Test -Run the skeleton on a real task. Observe where it works, where it breaks. - -## Phase 2+: Iterative Tightening - -Add specificity to step prompts only where Phase 1 testing reveals gaps. The detailed plan (`quick-flow-redesign-plan.md`) is the reference spec — pull from it as needed, don't front-load it all. - -Candidates for tightening (in likely priority order based on complexity): -- Step 1 routing criteria (one-shot vs plan-code-review vs full-BMM) -- Step 4 review layers and classification cascade -- Step 3 crash recovery and resume logic -- Step 4 spec loop oscillation mitigations (frozen sections, guardrails ratchet, positive preservation) -- Step 5 findings presentation and PR creation diff --git a/_experiment/planning/roadmap/task-01-test-skeleton.md b/_experiment/planning/roadmap/task-01-test-skeleton.md new file mode 100644 index 000000000..ba7347565 --- /dev/null +++ b/_experiment/planning/roadmap/task-01-test-skeleton.md @@ -0,0 +1,24 @@ +# Task 01: Test Bare Skeleton + +## Intent + +Run QD2 as-is (bare one-liner prompts + BMM plumbing) on a real small task. Document what works and what breaks. + +## Method + +1. Pick a small real task in the BMAD-METHOD repo (or a test project). +2. Invoke QD2 via the QD2 trigger. +3. Let it run through all 5 steps without intervention (except at the two checkpoints). +4. Record observations per step: + - Did it follow the plumbing? (config loading, step transitions, NEXT directives) + - Did it produce reasonable output from training alone? + - Where did it go off the rails or get stuck? + - What questions did it ask that it shouldn't have? + - What did it fail to do that it should have? + +## Output + +A findings document: `_experiment/results/skeleton-test-findings.md` with per-step observations classified as: +- **Works** — training handled it fine, no tightening needed +- **Gap** — specific behavior missing or wrong, needs prompt tightening +- **Plumbing** — structural issue with the BMM infrastructure itself diff --git a/_experiment/planning/roadmap/task-02-eval-skeleton.md b/_experiment/planning/roadmap/task-02-eval-skeleton.md new file mode 100644 index 000000000..a7a4d185e --- /dev/null +++ b/_experiment/planning/roadmap/task-02-eval-skeleton.md @@ -0,0 +1,24 @@ +# Task 02: Eval Bare Skeleton + +## Prerequisite + +Task 01 test cycle is clean (all gaps and plumbing issues resolved). + +## Intent + +Evaluate the bare skeleton's efficiency against the existing QD workflow as baseline. + +## Method + +1. Run the same task through both QD (old) and QD2 (skeleton) if possible, or compare against a recent QD session log. +2. Measure: + - Total human turns (the north star metric) + - Total agent turns / API round-trips + - Approximate token usage (context window utilization) + - Time to completion + - Quality of output (subjective: did it produce what was asked for?) +3. Note where QD2 is better, where it's worse, where it's equivalent. + +## Output + +An eval report: `_experiment/results/skeleton-eval-report.md` with metrics comparison and recommendations for which steps need tightening first (prioritized by impact on the north star metric). diff --git a/_experiment/planning/roadmap/task-03-tighten-step-01-routing.md b/_experiment/planning/roadmap/task-03-tighten-step-01-routing.md new file mode 100644 index 000000000..81b462f45 --- /dev/null +++ b/_experiment/planning/roadmap/task-03-tighten-step-01-routing.md @@ -0,0 +1,20 @@ +# Task 03: Tighten Step 1 — Routing + +## Prerequisite + +Task 01/02 findings indicate step 1 needs tightening. + +## Intent + +Add specificity to step-01-clarify-and-route.md only where skeleton testing revealed gaps. Reference the detailed plan in `_experiment/planning/redesign-plan.md` for the full spec of what step 1 should do — pull only what's needed. + +## Likely areas (from the plan) + +- Routing criteria precision (one-shot vs plan-code-review vs full-BMM boundaries) +- Intent exit criteria enforcement +- WIP/ready-for-dev artifact detection +- Project context backfill behavior + +## Constraint + +Keep prompts thin. Add the minimum words needed to close the gap. If the LLM was already doing it right from training, don't add instructions for it. diff --git a/_experiment/planning/roadmap/task-04-test-step-01-routing.md b/_experiment/planning/roadmap/task-04-test-step-01-routing.md new file mode 100644 index 000000000..e075da18a --- /dev/null +++ b/_experiment/planning/roadmap/task-04-test-step-01-routing.md @@ -0,0 +1,22 @@ +# Task 04: Test Step 1 — Routing + +## Prerequisite + +Task 03 tightening applied. + +## Intent + +Verify step 1 routing works correctly across different request types. + +## Test cases + +- Trivial request (e.g., "fix a typo in README") → should route one-shot +- Normal feature request → should route plan-code-review +- Large/ambiguous request → should route full BMM +- Ambiguous between one-shot and plan → should default to plan-code-review +- Existing ready-for-dev spec → should skip to step 3 +- Existing WIP file → should offer resume or archive + +## Output + +Per-test-case pass/fail with observations. Any failures feed back into task 03 (re-tighten). diff --git a/_experiment/planning/roadmap/task-05-eval-step-01-routing.md b/_experiment/planning/roadmap/task-05-eval-step-01-routing.md new file mode 100644 index 000000000..44db2282c --- /dev/null +++ b/_experiment/planning/roadmap/task-05-eval-step-01-routing.md @@ -0,0 +1,20 @@ +# Task 05: Eval Step 1 — Routing Efficiency + +## Prerequisite + +Task 04 test cycle is clean. + +## Intent + +Evaluate step 1 efficiency. How many turns does it take to capture intent and route? + +## Metrics + +- Human turns to reach a routing decision +- Unnecessary questions asked (things it could have figured out from codebase investigation) +- Time spent in step 1 vs total flow time +- Compare against QD step-01-mode-detection baseline + +## Output + +Eval report appended to `_experiment/results/step-01-eval.md`. diff --git a/_experiment/planning/roadmap/task-06-tighten-step-02-plan.md b/_experiment/planning/roadmap/task-06-tighten-step-02-plan.md new file mode 100644 index 000000000..8d303c7da --- /dev/null +++ b/_experiment/planning/roadmap/task-06-tighten-step-02-plan.md @@ -0,0 +1,20 @@ +# Task 06: Tighten Step 2 — Plan + +## Prerequisite + +Step 1 test/eval cycle complete. Task 01/02 findings indicate step 2 needs tightening. + +## Intent + +Add specificity to step-02-plan.md where skeleton testing revealed gaps. Reference `_experiment/planning/redesign-plan.md` Step 2 section. + +## Likely areas + +- Spec quality (does it meet Ready for Development standard without being told how?) +- Checkpoint 1 presentation format +- Frozen section enforcement after approval +- Investigation depth (does it use subagents? does it ask the human things it should figure out itself?) + +## Constraint + +Minimum words to close the gap. Don't over-specify what training already handles. diff --git a/_experiment/planning/roadmap/task-07-test-step-02-plan.md b/_experiment/planning/roadmap/task-07-test-step-02-plan.md new file mode 100644 index 000000000..0eb36cc07 --- /dev/null +++ b/_experiment/planning/roadmap/task-07-test-step-02-plan.md @@ -0,0 +1,22 @@ +# Task 07: Test Step 2 — Plan + +## Prerequisite + +Task 06 tightening applied. + +## Intent + +Verify step 2 produces a usable spec and the checkpoint works correctly. + +## Test cases + +- Does investigation happen autonomously without unnecessary human questions? +- Does the spec use the tech-spec-template correctly? +- Does it meet Ready for Development standard (actionable, logical, testable, complete, self-contained)? +- Does checkpoint 1 present a useful summary? +- Do approve/edit/full-BMM options work? +- After approve, are frozen sections actually respected downstream? + +## Output + +Pass/fail per test case. Failures feed back into task 06. diff --git a/_experiment/planning/roadmap/task-08-eval-step-02-plan.md b/_experiment/planning/roadmap/task-08-eval-step-02-plan.md new file mode 100644 index 000000000..138b5da81 --- /dev/null +++ b/_experiment/planning/roadmap/task-08-eval-step-02-plan.md @@ -0,0 +1,20 @@ +# Task 08: Eval Step 2 — Plan Efficiency + +## Prerequisite + +Task 07 test cycle clean. + +## Intent + +Evaluate spec generation quality and efficiency. + +## Metrics + +- Spec quality score (subjective: would you trust a fresh agent to implement from this alone?) +- Investigation depth vs time spent +- Tokens consumed in step 2 +- Compare against QS (old quick-spec) output quality + +## Output + +Eval report: `_experiment/results/step-02-eval.md`. diff --git a/_experiment/planning/roadmap/task-09-tighten-step-03-implement.md b/_experiment/planning/roadmap/task-09-tighten-step-03-implement.md new file mode 100644 index 000000000..d04fe66f3 --- /dev/null +++ b/_experiment/planning/roadmap/task-09-tighten-step-03-implement.md @@ -0,0 +1,22 @@ +# Task 09: Tighten Step 3 — Implement + +## Prerequisite + +Step 2 test/eval cycle complete. Findings indicate step 3 needs tightening. + +## Intent + +Add specificity to step-03-implement.md. Reference `_experiment/planning/redesign-plan.md` Step 3 section. + +## Likely areas + +- Task sharding mechanics (does it create task files and sequence file correctly?) +- Baseline commit capture +- Branch creation and idempotent reuse +- Clean tree assertion and resume policy +- Sequential execution discipline +- Commit message quality + +## Constraint + +Minimum words. The sharding structure is in the plumbing (frontmatter paths). Only add instructions where the agent demonstrably fails. diff --git a/_experiment/planning/roadmap/task-10-test-step-03-implement.md b/_experiment/planning/roadmap/task-10-test-step-03-implement.md new file mode 100644 index 000000000..2e5c34dbd --- /dev/null +++ b/_experiment/planning/roadmap/task-10-test-step-03-implement.md @@ -0,0 +1,25 @@ +# Task 10: Test Step 3 — Implement + +## Prerequisite + +Task 09 tightening applied. + +## Intent + +Verify implementation mechanics work correctly. + +## Test cases + +- Task files created in correct location with correct format +- Sequence file tracks status on disk (not just in memory) +- Tasks execute sequentially (no parallel) +- Each task file read fresh before execution +- Feature branch created with correct naming +- Commit produced with conventional message +- No push or remote operations +- Resume: kill mid-task, restart, verify completed tasks skipped +- Dirty tree: verify halt on fresh start, resume policy on restart + +## Output + +Pass/fail per test case. Failures feed back into task 09. diff --git a/_experiment/planning/roadmap/task-11-eval-step-03-implement.md b/_experiment/planning/roadmap/task-11-eval-step-03-implement.md new file mode 100644 index 000000000..f96c08feb --- /dev/null +++ b/_experiment/planning/roadmap/task-11-eval-step-03-implement.md @@ -0,0 +1,21 @@ +# Task 11: Eval Step 3 — Implementation Efficiency + +## Prerequisite + +Task 10 test cycle clean. + +## Intent + +Evaluate implementation quality and efficiency. + +## Metrics + +- Code quality (does it follow project patterns?) +- AC coverage (are acceptance criteria actually verified?) +- Tokens consumed in step 3 +- Time to implement vs QD baseline +- Task sharding overhead (does it add value or just ceremony?) + +## Output + +Eval report: `_experiment/results/step-03-eval.md`. diff --git a/_experiment/planning/roadmap/task-12-tighten-step-04-review.md b/_experiment/planning/roadmap/task-12-tighten-step-04-review.md new file mode 100644 index 000000000..de35bb98e --- /dev/null +++ b/_experiment/planning/roadmap/task-12-tighten-step-04-review.md @@ -0,0 +1,22 @@ +# Task 12: Tighten Step 4 — Review + +## Prerequisite + +Step 3 test/eval cycle complete. Findings indicate step 4 needs tightening. + +## Intent + +Add specificity to step-04-review.md. This is the highest-risk step. Reference `_experiment/planning/redesign-plan.md` Step 4 section. + +## Likely areas + +- Context isolation for review subagents (does it actually strip context?) +- Layer 1 vs Layer 2 separation +- Classification cascade (intent > spec > patch) +- INTENT_GAP two-question test +- Spec loop mechanics (frozen sections, change log ratchet, positive preservation) +- Iteration cap enforcement + +## Constraint + +This step has the most detailed plan material. Resist the urge to dump it all in. Add only what testing proves is needed. diff --git a/_experiment/planning/roadmap/task-13-test-step-04-review.md b/_experiment/planning/roadmap/task-13-test-step-04-review.md new file mode 100644 index 000000000..0e63cc802 --- /dev/null +++ b/_experiment/planning/roadmap/task-13-test-step-04-review.md @@ -0,0 +1,28 @@ +# Task 13: Test Step 4 — Review + +## Prerequisite + +Task 12 tightening applied. + +## Intent + +Verify the review and classification system works. + +## Test cases + +- Diff constructed correctly from baseline +- Layer 1 runs in context-free subagent (plan-code-review route) +- Layer 1 skipped for one-shot route +- Layer 2 (adversarial review) runs for all routes, context-free +- One-shot Layer 2 receives user prompt alongside diff +- Findings classified using priority cascade +- Spec-class findings trigger spec amendment and re-derive +- Spec loop respects iteration cap +- Change log populated correctly +- Positive preservation: KEEP instructions extracted and carried forward +- Patch-class findings auto-fixed and committed +- Defer-class findings written to wip.md + +## Output + +Pass/fail per test case. Failures feed back into task 12. diff --git a/_experiment/planning/roadmap/task-14-eval-step-04-review.md b/_experiment/planning/roadmap/task-14-eval-step-04-review.md new file mode 100644 index 000000000..29d0a3ae8 --- /dev/null +++ b/_experiment/planning/roadmap/task-14-eval-step-04-review.md @@ -0,0 +1,22 @@ +# Task 14: Eval Step 4 — Review Efficiency + +## Prerequisite + +Task 13 test cycle clean. + +## Intent + +Evaluate review quality and efficiency. + +## Metrics + +- Finding quality (real issues vs noise) +- Classification accuracy (spec vs patch vs defer — are they right?) +- Spec loop iterations used vs cap +- Tokens consumed in review (especially spec loop cost) +- Compare review quality against QD step-05 adversarial review baseline +- False positive rate (reject-class findings as % of total) + +## Output + +Eval report: `_experiment/results/step-04-eval.md`. diff --git a/_experiment/planning/roadmap/task-15-tighten-step-05-present.md b/_experiment/planning/roadmap/task-15-tighten-step-05-present.md new file mode 100644 index 000000000..b88d5a039 --- /dev/null +++ b/_experiment/planning/roadmap/task-15-tighten-step-05-present.md @@ -0,0 +1,21 @@ +# Task 15: Tighten Step 5 — Present + +## Prerequisite + +Step 4 test/eval cycle complete. Findings indicate step 5 needs tightening. + +## Intent + +Add specificity to step-05-present.md. Reference `_experiment/planning/redesign-plan.md` Step 5 section. + +## Likely areas + +- Findings presentation order and format +- Low-confidence bucket handling +- Push prohibition enforcement +- PR creation quality (summary derived from spec + findings) +- Approve/edit/reject option handling + +## Constraint + +Minimum words. This is the simplest step — mostly presentation and plumbing. diff --git a/_experiment/planning/roadmap/task-16-test-step-05-present.md b/_experiment/planning/roadmap/task-16-test-step-05-present.md new file mode 100644 index 000000000..fd4431cf4 --- /dev/null +++ b/_experiment/planning/roadmap/task-16-test-step-05-present.md @@ -0,0 +1,22 @@ +# Task 16: Test Step 5 — Present + +## Prerequisite + +Task 15 tightening applied. + +## Intent + +Verify the final checkpoint and PR creation work. + +## Test cases + +- Findings presented in correct priority order +- Low-confidence findings surfaced as primary action item +- Approve/edit/reject options work +- Push command printed, never auto-executed +- PR created with meaningful summary after human confirms push +- Deferred items documented in wip.md + +## Output + +Pass/fail per test case. Failures feed back into task 15. diff --git a/_experiment/planning/roadmap/task-17-eval-step-05-present.md b/_experiment/planning/roadmap/task-17-eval-step-05-present.md new file mode 100644 index 000000000..880746810 --- /dev/null +++ b/_experiment/planning/roadmap/task-17-eval-step-05-present.md @@ -0,0 +1,20 @@ +# Task 17: Eval Step 5 — Present Efficiency + +## Prerequisite + +Task 16 test cycle clean. + +## Intent + +Evaluate the final presentation quality and efficiency. + +## Metrics + +- Human effort at checkpoint 2 (how many items need manual classification?) +- PR quality (is the summary useful? would you merge from this?) +- Time in step 5 vs total flow +- Compare against QD step-06 resolve-findings baseline + +## Output + +Eval report: `_experiment/results/step-05-eval.md`. diff --git a/_experiment/planning/roadmap/task-18-eval-end-to-end.md b/_experiment/planning/roadmap/task-18-eval-end-to-end.md new file mode 100644 index 000000000..4b8571753 --- /dev/null +++ b/_experiment/planning/roadmap/task-18-eval-end-to-end.md @@ -0,0 +1,28 @@ +# Task 18: End-to-End Evaluation + +## Prerequisite + +All step-level test/eval cycles complete. + +## Intent + +Run the fully tightened QD2 on a real task end-to-end. Compare holistically against QD + QS combined baseline. + +## Method + +1. Pick a representative task (not trivial, not huge — the sweet spot QD2 is designed for). +2. Run QD2 start to finish. +3. If possible, run the same task through QS → QD for comparison. + +## Metrics + +- Total human turns (north star) +- Total time +- Total tokens +- Output quality (code correctness, spec quality, review thoroughness) +- Number of unnecessary human interactions +- Would you ship this? + +## Output + +Final eval report: `_experiment/results/end-to-end-eval.md` with go/no-go recommendation for promoting QD2 to replace QS+QD.