docs: add roadmap task files for QD2 experiment

18 tasks: skeleton test/eval, then per-step tighten/test/eval
cycles, plus end-to-end eval. Each task file is a self-contained
intent expression that can be fed to QD2.
This commit is contained in:
Alex Verkhovsky 2026-02-22 13:02:06 -07:00
parent c2da172988
commit 4dd5a0c871
19 changed files with 403 additions and 44 deletions

View File

@ -1,44 +0,0 @@
# Quick-Flow Redesign — Implementation Roadmap
## Strategy: Skeleton-First with Real Plumbing
Build the full BMM sharded workflow infrastructure with minimal step prompts. Run it. See where training falls short. Tighten only where it demonstrably fails.
## Phase 1: Working Skeleton (Current)
Full BMM plumbing, thin prompts.
### Create
```
src/bmm/workflows/bmad-quick-flow/
workflow.md # Real entry point — config loading, path resolution, step-file architecture rules
tech-spec-template.md # Real template — frozen sections, change log, golden examples
steps/
step-01-clarify-and-route.md # Slightly more flesh (routing criteria matter)
step-02-plan.md # Thin prompt, real plumbing
step-03-implement.md # Thin prompt, real plumbing (task sharding structure)
step-04-review.md # Thin prompt, real plumbing
step-05-present.md # Thin prompt, real plumbing
```
### Modify
- `src/bmm/agents/quick-flow-solo-dev.agent.yaml` — single QF trigger
- `src/bmm/module-help.csv` — replace QS/QD with QF
### Delete
- `src/bmm/workflows/bmad-quick-flow/quick-spec/` (entire directory)
- `src/bmm/workflows/bmad-quick-flow/quick-dev/` (entire directory)
### Test
Run the skeleton on a real task. Observe where it works, where it breaks.
## Phase 2+: Iterative Tightening
Add specificity to step prompts only where Phase 1 testing reveals gaps. The detailed plan (`quick-flow-redesign-plan.md`) is the reference spec — pull from it as needed, don't front-load it all.
Candidates for tightening (in likely priority order based on complexity):
- Step 1 routing criteria (one-shot vs plan-code-review vs full-BMM)
- Step 4 review layers and classification cascade
- Step 3 crash recovery and resume logic
- Step 4 spec loop oscillation mitigations (frozen sections, guardrails ratchet, positive preservation)
- Step 5 findings presentation and PR creation

View File

@ -0,0 +1,24 @@
# Task 01: Test Bare Skeleton
## Intent
Run QD2 as-is (bare one-liner prompts + BMM plumbing) on a real small task. Document what works and what breaks.
## Method
1. Pick a small real task in the BMAD-METHOD repo (or a test project).
2. Invoke QD2 via the QD2 trigger.
3. Let it run through all 5 steps without intervention (except at the two checkpoints).
4. Record observations per step:
- Did it follow the plumbing? (config loading, step transitions, NEXT directives)
- Did it produce reasonable output from training alone?
- Where did it go off the rails or get stuck?
- What questions did it ask that it shouldn't have?
- What did it fail to do that it should have?
## Output
A findings document: `_experiment/results/skeleton-test-findings.md` with per-step observations classified as:
- **Works** — training handled it fine, no tightening needed
- **Gap** — specific behavior missing or wrong, needs prompt tightening
- **Plumbing** — structural issue with the BMM infrastructure itself

View File

@ -0,0 +1,24 @@
# Task 02: Eval Bare Skeleton
## Prerequisite
Task 01 test cycle is clean (all gaps and plumbing issues resolved).
## Intent
Evaluate the bare skeleton's efficiency against the existing QD workflow as baseline.
## Method
1. Run the same task through both QD (old) and QD2 (skeleton) if possible, or compare against a recent QD session log.
2. Measure:
- Total human turns (the north star metric)
- Total agent turns / API round-trips
- Approximate token usage (context window utilization)
- Time to completion
- Quality of output (subjective: did it produce what was asked for?)
3. Note where QD2 is better, where it's worse, where it's equivalent.
## Output
An eval report: `_experiment/results/skeleton-eval-report.md` with metrics comparison and recommendations for which steps need tightening first (prioritized by impact on the north star metric).

View File

@ -0,0 +1,20 @@
# Task 03: Tighten Step 1 — Routing
## Prerequisite
Task 01/02 findings indicate step 1 needs tightening.
## Intent
Add specificity to step-01-clarify-and-route.md only where skeleton testing revealed gaps. Reference the detailed plan in `_experiment/planning/redesign-plan.md` for the full spec of what step 1 should do — pull only what's needed.
## Likely areas (from the plan)
- Routing criteria precision (one-shot vs plan-code-review vs full-BMM boundaries)
- Intent exit criteria enforcement
- WIP/ready-for-dev artifact detection
- Project context backfill behavior
## Constraint
Keep prompts thin. Add the minimum words needed to close the gap. If the LLM was already doing it right from training, don't add instructions for it.

View File

@ -0,0 +1,22 @@
# Task 04: Test Step 1 — Routing
## Prerequisite
Task 03 tightening applied.
## Intent
Verify step 1 routing works correctly across different request types.
## Test cases
- Trivial request (e.g., "fix a typo in README") → should route one-shot
- Normal feature request → should route plan-code-review
- Large/ambiguous request → should route full BMM
- Ambiguous between one-shot and plan → should default to plan-code-review
- Existing ready-for-dev spec → should skip to step 3
- Existing WIP file → should offer resume or archive
## Output
Per-test-case pass/fail with observations. Any failures feed back into task 03 (re-tighten).

View File

@ -0,0 +1,20 @@
# Task 05: Eval Step 1 — Routing Efficiency
## Prerequisite
Task 04 test cycle is clean.
## Intent
Evaluate step 1 efficiency. How many turns does it take to capture intent and route?
## Metrics
- Human turns to reach a routing decision
- Unnecessary questions asked (things it could have figured out from codebase investigation)
- Time spent in step 1 vs total flow time
- Compare against QD step-01-mode-detection baseline
## Output
Eval report appended to `_experiment/results/step-01-eval.md`.

View File

@ -0,0 +1,20 @@
# Task 06: Tighten Step 2 — Plan
## Prerequisite
Step 1 test/eval cycle complete. Task 01/02 findings indicate step 2 needs tightening.
## Intent
Add specificity to step-02-plan.md where skeleton testing revealed gaps. Reference `_experiment/planning/redesign-plan.md` Step 2 section.
## Likely areas
- Spec quality (does it meet Ready for Development standard without being told how?)
- Checkpoint 1 presentation format
- Frozen section enforcement after approval
- Investigation depth (does it use subagents? does it ask the human things it should figure out itself?)
## Constraint
Minimum words to close the gap. Don't over-specify what training already handles.

View File

@ -0,0 +1,22 @@
# Task 07: Test Step 2 — Plan
## Prerequisite
Task 06 tightening applied.
## Intent
Verify step 2 produces a usable spec and the checkpoint works correctly.
## Test cases
- Does investigation happen autonomously without unnecessary human questions?
- Does the spec use the tech-spec-template correctly?
- Does it meet Ready for Development standard (actionable, logical, testable, complete, self-contained)?
- Does checkpoint 1 present a useful summary?
- Do approve/edit/full-BMM options work?
- After approve, are frozen sections actually respected downstream?
## Output
Pass/fail per test case. Failures feed back into task 06.

View File

@ -0,0 +1,20 @@
# Task 08: Eval Step 2 — Plan Efficiency
## Prerequisite
Task 07 test cycle clean.
## Intent
Evaluate spec generation quality and efficiency.
## Metrics
- Spec quality score (subjective: would you trust a fresh agent to implement from this alone?)
- Investigation depth vs time spent
- Tokens consumed in step 2
- Compare against QS (old quick-spec) output quality
## Output
Eval report: `_experiment/results/step-02-eval.md`.

View File

@ -0,0 +1,22 @@
# Task 09: Tighten Step 3 — Implement
## Prerequisite
Step 2 test/eval cycle complete. Findings indicate step 3 needs tightening.
## Intent
Add specificity to step-03-implement.md. Reference `_experiment/planning/redesign-plan.md` Step 3 section.
## Likely areas
- Task sharding mechanics (does it create task files and sequence file correctly?)
- Baseline commit capture
- Branch creation and idempotent reuse
- Clean tree assertion and resume policy
- Sequential execution discipline
- Commit message quality
## Constraint
Minimum words. The sharding structure is in the plumbing (frontmatter paths). Only add instructions where the agent demonstrably fails.

View File

@ -0,0 +1,25 @@
# Task 10: Test Step 3 — Implement
## Prerequisite
Task 09 tightening applied.
## Intent
Verify implementation mechanics work correctly.
## Test cases
- Task files created in correct location with correct format
- Sequence file tracks status on disk (not just in memory)
- Tasks execute sequentially (no parallel)
- Each task file read fresh before execution
- Feature branch created with correct naming
- Commit produced with conventional message
- No push or remote operations
- Resume: kill mid-task, restart, verify completed tasks skipped
- Dirty tree: verify halt on fresh start, resume policy on restart
## Output
Pass/fail per test case. Failures feed back into task 09.

View File

@ -0,0 +1,21 @@
# Task 11: Eval Step 3 — Implementation Efficiency
## Prerequisite
Task 10 test cycle clean.
## Intent
Evaluate implementation quality and efficiency.
## Metrics
- Code quality (does it follow project patterns?)
- AC coverage (are acceptance criteria actually verified?)
- Tokens consumed in step 3
- Time to implement vs QD baseline
- Task sharding overhead (does it add value or just ceremony?)
## Output
Eval report: `_experiment/results/step-03-eval.md`.

View File

@ -0,0 +1,22 @@
# Task 12: Tighten Step 4 — Review
## Prerequisite
Step 3 test/eval cycle complete. Findings indicate step 4 needs tightening.
## Intent
Add specificity to step-04-review.md. This is the highest-risk step. Reference `_experiment/planning/redesign-plan.md` Step 4 section.
## Likely areas
- Context isolation for review subagents (does it actually strip context?)
- Layer 1 vs Layer 2 separation
- Classification cascade (intent > spec > patch)
- INTENT_GAP two-question test
- Spec loop mechanics (frozen sections, change log ratchet, positive preservation)
- Iteration cap enforcement
## Constraint
This step has the most detailed plan material. Resist the urge to dump it all in. Add only what testing proves is needed.

View File

@ -0,0 +1,28 @@
# Task 13: Test Step 4 — Review
## Prerequisite
Task 12 tightening applied.
## Intent
Verify the review and classification system works.
## Test cases
- Diff constructed correctly from baseline
- Layer 1 runs in context-free subagent (plan-code-review route)
- Layer 1 skipped for one-shot route
- Layer 2 (adversarial review) runs for all routes, context-free
- One-shot Layer 2 receives user prompt alongside diff
- Findings classified using priority cascade
- Spec-class findings trigger spec amendment and re-derive
- Spec loop respects iteration cap
- Change log populated correctly
- Positive preservation: KEEP instructions extracted and carried forward
- Patch-class findings auto-fixed and committed
- Defer-class findings written to wip.md
## Output
Pass/fail per test case. Failures feed back into task 12.

View File

@ -0,0 +1,22 @@
# Task 14: Eval Step 4 — Review Efficiency
## Prerequisite
Task 13 test cycle clean.
## Intent
Evaluate review quality and efficiency.
## Metrics
- Finding quality (real issues vs noise)
- Classification accuracy (spec vs patch vs defer — are they right?)
- Spec loop iterations used vs cap
- Tokens consumed in review (especially spec loop cost)
- Compare review quality against QD step-05 adversarial review baseline
- False positive rate (reject-class findings as % of total)
## Output
Eval report: `_experiment/results/step-04-eval.md`.

View File

@ -0,0 +1,21 @@
# Task 15: Tighten Step 5 — Present
## Prerequisite
Step 4 test/eval cycle complete. Findings indicate step 5 needs tightening.
## Intent
Add specificity to step-05-present.md. Reference `_experiment/planning/redesign-plan.md` Step 5 section.
## Likely areas
- Findings presentation order and format
- Low-confidence bucket handling
- Push prohibition enforcement
- PR creation quality (summary derived from spec + findings)
- Approve/edit/reject option handling
## Constraint
Minimum words. This is the simplest step — mostly presentation and plumbing.

View File

@ -0,0 +1,22 @@
# Task 16: Test Step 5 — Present
## Prerequisite
Task 15 tightening applied.
## Intent
Verify the final checkpoint and PR creation work.
## Test cases
- Findings presented in correct priority order
- Low-confidence findings surfaced as primary action item
- Approve/edit/reject options work
- Push command printed, never auto-executed
- PR created with meaningful summary after human confirms push
- Deferred items documented in wip.md
## Output
Pass/fail per test case. Failures feed back into task 15.

View File

@ -0,0 +1,20 @@
# Task 17: Eval Step 5 — Present Efficiency
## Prerequisite
Task 16 test cycle clean.
## Intent
Evaluate the final presentation quality and efficiency.
## Metrics
- Human effort at checkpoint 2 (how many items need manual classification?)
- PR quality (is the summary useful? would you merge from this?)
- Time in step 5 vs total flow
- Compare against QD step-06 resolve-findings baseline
## Output
Eval report: `_experiment/results/step-05-eval.md`.

View File

@ -0,0 +1,28 @@
# Task 18: End-to-End Evaluation
## Prerequisite
All step-level test/eval cycles complete.
## Intent
Run the fully tightened QD2 on a real task end-to-end. Compare holistically against QD + QS combined baseline.
## Method
1. Pick a representative task (not trivial, not huge — the sweet spot QD2 is designed for).
2. Run QD2 start to finish.
3. If possible, run the same task through QS → QD for comparison.
## Metrics
- Total human turns (north star)
- Total time
- Total tokens
- Output quality (code correctness, spec quality, review thoroughness)
- Number of unnecessary human interactions
- Would you ship this?
## Output
Final eval report: `_experiment/results/end-to-end-eval.md` with go/no-go recommendation for promoting QD2 to replace QS+QD.