docs: add roadmap task files for QD2 experiment
18 tasks: skeleton test/eval, then per-step tighten/test/eval cycles, plus end-to-end eval. Each task file is a self-contained intent expression that can be fed to QD2.
This commit is contained in:
parent
c2da172988
commit
4dd5a0c871
|
|
@ -1,44 +0,0 @@
|
|||
# Quick-Flow Redesign — Implementation Roadmap
|
||||
|
||||
## Strategy: Skeleton-First with Real Plumbing
|
||||
|
||||
Build the full BMM sharded workflow infrastructure with minimal step prompts. Run it. See where training falls short. Tighten only where it demonstrably fails.
|
||||
|
||||
## Phase 1: Working Skeleton (Current)
|
||||
|
||||
Full BMM plumbing, thin prompts.
|
||||
|
||||
### Create
|
||||
```
|
||||
src/bmm/workflows/bmad-quick-flow/
|
||||
workflow.md # Real entry point — config loading, path resolution, step-file architecture rules
|
||||
tech-spec-template.md # Real template — frozen sections, change log, golden examples
|
||||
steps/
|
||||
step-01-clarify-and-route.md # Slightly more flesh (routing criteria matter)
|
||||
step-02-plan.md # Thin prompt, real plumbing
|
||||
step-03-implement.md # Thin prompt, real plumbing (task sharding structure)
|
||||
step-04-review.md # Thin prompt, real plumbing
|
||||
step-05-present.md # Thin prompt, real plumbing
|
||||
```
|
||||
|
||||
### Modify
|
||||
- `src/bmm/agents/quick-flow-solo-dev.agent.yaml` — single QF trigger
|
||||
- `src/bmm/module-help.csv` — replace QS/QD with QF
|
||||
|
||||
### Delete
|
||||
- `src/bmm/workflows/bmad-quick-flow/quick-spec/` (entire directory)
|
||||
- `src/bmm/workflows/bmad-quick-flow/quick-dev/` (entire directory)
|
||||
|
||||
### Test
|
||||
Run the skeleton on a real task. Observe where it works, where it breaks.
|
||||
|
||||
## Phase 2+: Iterative Tightening
|
||||
|
||||
Add specificity to step prompts only where Phase 1 testing reveals gaps. The detailed plan (`quick-flow-redesign-plan.md`) is the reference spec — pull from it as needed, don't front-load it all.
|
||||
|
||||
Candidates for tightening (in likely priority order based on complexity):
|
||||
- Step 1 routing criteria (one-shot vs plan-code-review vs full-BMM)
|
||||
- Step 4 review layers and classification cascade
|
||||
- Step 3 crash recovery and resume logic
|
||||
- Step 4 spec loop oscillation mitigations (frozen sections, guardrails ratchet, positive preservation)
|
||||
- Step 5 findings presentation and PR creation
|
||||
|
|
@ -0,0 +1,24 @@
|
|||
# Task 01: Test Bare Skeleton
|
||||
|
||||
## Intent
|
||||
|
||||
Run QD2 as-is (bare one-liner prompts + BMM plumbing) on a real small task. Document what works and what breaks.
|
||||
|
||||
## Method
|
||||
|
||||
1. Pick a small real task in the BMAD-METHOD repo (or a test project).
|
||||
2. Invoke QD2 via the QD2 trigger.
|
||||
3. Let it run through all 5 steps without intervention (except at the two checkpoints).
|
||||
4. Record observations per step:
|
||||
- Did it follow the plumbing? (config loading, step transitions, NEXT directives)
|
||||
- Did it produce reasonable output from training alone?
|
||||
- Where did it go off the rails or get stuck?
|
||||
- What questions did it ask that it shouldn't have?
|
||||
- What did it fail to do that it should have?
|
||||
|
||||
## Output
|
||||
|
||||
A findings document: `_experiment/results/skeleton-test-findings.md` with per-step observations classified as:
|
||||
- **Works** — training handled it fine, no tightening needed
|
||||
- **Gap** — specific behavior missing or wrong, needs prompt tightening
|
||||
- **Plumbing** — structural issue with the BMM infrastructure itself
|
||||
|
|
@ -0,0 +1,24 @@
|
|||
# Task 02: Eval Bare Skeleton
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Task 01 test cycle is clean (all gaps and plumbing issues resolved).
|
||||
|
||||
## Intent
|
||||
|
||||
Evaluate the bare skeleton's efficiency against the existing QD workflow as baseline.
|
||||
|
||||
## Method
|
||||
|
||||
1. Run the same task through both QD (old) and QD2 (skeleton) if possible, or compare against a recent QD session log.
|
||||
2. Measure:
|
||||
- Total human turns (the north star metric)
|
||||
- Total agent turns / API round-trips
|
||||
- Approximate token usage (context window utilization)
|
||||
- Time to completion
|
||||
- Quality of output (subjective: did it produce what was asked for?)
|
||||
3. Note where QD2 is better, where it's worse, where it's equivalent.
|
||||
|
||||
## Output
|
||||
|
||||
An eval report: `_experiment/results/skeleton-eval-report.md` with metrics comparison and recommendations for which steps need tightening first (prioritized by impact on the north star metric).
|
||||
|
|
@ -0,0 +1,20 @@
|
|||
# Task 03: Tighten Step 1 — Routing
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Task 01/02 findings indicate step 1 needs tightening.
|
||||
|
||||
## Intent
|
||||
|
||||
Add specificity to step-01-clarify-and-route.md only where skeleton testing revealed gaps. Reference the detailed plan in `_experiment/planning/redesign-plan.md` for the full spec of what step 1 should do — pull only what's needed.
|
||||
|
||||
## Likely areas (from the plan)
|
||||
|
||||
- Routing criteria precision (one-shot vs plan-code-review vs full-BMM boundaries)
|
||||
- Intent exit criteria enforcement
|
||||
- WIP/ready-for-dev artifact detection
|
||||
- Project context backfill behavior
|
||||
|
||||
## Constraint
|
||||
|
||||
Keep prompts thin. Add the minimum words needed to close the gap. If the LLM was already doing it right from training, don't add instructions for it.
|
||||
|
|
@ -0,0 +1,22 @@
|
|||
# Task 04: Test Step 1 — Routing
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Task 03 tightening applied.
|
||||
|
||||
## Intent
|
||||
|
||||
Verify step 1 routing works correctly across different request types.
|
||||
|
||||
## Test cases
|
||||
|
||||
- Trivial request (e.g., "fix a typo in README") → should route one-shot
|
||||
- Normal feature request → should route plan-code-review
|
||||
- Large/ambiguous request → should route full BMM
|
||||
- Ambiguous between one-shot and plan → should default to plan-code-review
|
||||
- Existing ready-for-dev spec → should skip to step 3
|
||||
- Existing WIP file → should offer resume or archive
|
||||
|
||||
## Output
|
||||
|
||||
Per-test-case pass/fail with observations. Any failures feed back into task 03 (re-tighten).
|
||||
|
|
@ -0,0 +1,20 @@
|
|||
# Task 05: Eval Step 1 — Routing Efficiency
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Task 04 test cycle is clean.
|
||||
|
||||
## Intent
|
||||
|
||||
Evaluate step 1 efficiency. How many turns does it take to capture intent and route?
|
||||
|
||||
## Metrics
|
||||
|
||||
- Human turns to reach a routing decision
|
||||
- Unnecessary questions asked (things it could have figured out from codebase investigation)
|
||||
- Time spent in step 1 vs total flow time
|
||||
- Compare against QD step-01-mode-detection baseline
|
||||
|
||||
## Output
|
||||
|
||||
Eval report appended to `_experiment/results/step-01-eval.md`.
|
||||
|
|
@ -0,0 +1,20 @@
|
|||
# Task 06: Tighten Step 2 — Plan
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Step 1 test/eval cycle complete. Task 01/02 findings indicate step 2 needs tightening.
|
||||
|
||||
## Intent
|
||||
|
||||
Add specificity to step-02-plan.md where skeleton testing revealed gaps. Reference `_experiment/planning/redesign-plan.md` Step 2 section.
|
||||
|
||||
## Likely areas
|
||||
|
||||
- Spec quality (does it meet Ready for Development standard without being told how?)
|
||||
- Checkpoint 1 presentation format
|
||||
- Frozen section enforcement after approval
|
||||
- Investigation depth (does it use subagents? does it ask the human things it should figure out itself?)
|
||||
|
||||
## Constraint
|
||||
|
||||
Minimum words to close the gap. Don't over-specify what training already handles.
|
||||
|
|
@ -0,0 +1,22 @@
|
|||
# Task 07: Test Step 2 — Plan
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Task 06 tightening applied.
|
||||
|
||||
## Intent
|
||||
|
||||
Verify step 2 produces a usable spec and the checkpoint works correctly.
|
||||
|
||||
## Test cases
|
||||
|
||||
- Does investigation happen autonomously without unnecessary human questions?
|
||||
- Does the spec use the tech-spec-template correctly?
|
||||
- Does it meet Ready for Development standard (actionable, logical, testable, complete, self-contained)?
|
||||
- Does checkpoint 1 present a useful summary?
|
||||
- Do approve/edit/full-BMM options work?
|
||||
- After approve, are frozen sections actually respected downstream?
|
||||
|
||||
## Output
|
||||
|
||||
Pass/fail per test case. Failures feed back into task 06.
|
||||
|
|
@ -0,0 +1,20 @@
|
|||
# Task 08: Eval Step 2 — Plan Efficiency
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Task 07 test cycle clean.
|
||||
|
||||
## Intent
|
||||
|
||||
Evaluate spec generation quality and efficiency.
|
||||
|
||||
## Metrics
|
||||
|
||||
- Spec quality score (subjective: would you trust a fresh agent to implement from this alone?)
|
||||
- Investigation depth vs time spent
|
||||
- Tokens consumed in step 2
|
||||
- Compare against QS (old quick-spec) output quality
|
||||
|
||||
## Output
|
||||
|
||||
Eval report: `_experiment/results/step-02-eval.md`.
|
||||
|
|
@ -0,0 +1,22 @@
|
|||
# Task 09: Tighten Step 3 — Implement
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Step 2 test/eval cycle complete. Findings indicate step 3 needs tightening.
|
||||
|
||||
## Intent
|
||||
|
||||
Add specificity to step-03-implement.md. Reference `_experiment/planning/redesign-plan.md` Step 3 section.
|
||||
|
||||
## Likely areas
|
||||
|
||||
- Task sharding mechanics (does it create task files and sequence file correctly?)
|
||||
- Baseline commit capture
|
||||
- Branch creation and idempotent reuse
|
||||
- Clean tree assertion and resume policy
|
||||
- Sequential execution discipline
|
||||
- Commit message quality
|
||||
|
||||
## Constraint
|
||||
|
||||
Minimum words. The sharding structure is in the plumbing (frontmatter paths). Only add instructions where the agent demonstrably fails.
|
||||
|
|
@ -0,0 +1,25 @@
|
|||
# Task 10: Test Step 3 — Implement
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Task 09 tightening applied.
|
||||
|
||||
## Intent
|
||||
|
||||
Verify implementation mechanics work correctly.
|
||||
|
||||
## Test cases
|
||||
|
||||
- Task files created in correct location with correct format
|
||||
- Sequence file tracks status on disk (not just in memory)
|
||||
- Tasks execute sequentially (no parallel)
|
||||
- Each task file read fresh before execution
|
||||
- Feature branch created with correct naming
|
||||
- Commit produced with conventional message
|
||||
- No push or remote operations
|
||||
- Resume: kill mid-task, restart, verify completed tasks skipped
|
||||
- Dirty tree: verify halt on fresh start, resume policy on restart
|
||||
|
||||
## Output
|
||||
|
||||
Pass/fail per test case. Failures feed back into task 09.
|
||||
|
|
@ -0,0 +1,21 @@
|
|||
# Task 11: Eval Step 3 — Implementation Efficiency
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Task 10 test cycle clean.
|
||||
|
||||
## Intent
|
||||
|
||||
Evaluate implementation quality and efficiency.
|
||||
|
||||
## Metrics
|
||||
|
||||
- Code quality (does it follow project patterns?)
|
||||
- AC coverage (are acceptance criteria actually verified?)
|
||||
- Tokens consumed in step 3
|
||||
- Time to implement vs QD baseline
|
||||
- Task sharding overhead (does it add value or just ceremony?)
|
||||
|
||||
## Output
|
||||
|
||||
Eval report: `_experiment/results/step-03-eval.md`.
|
||||
|
|
@ -0,0 +1,22 @@
|
|||
# Task 12: Tighten Step 4 — Review
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Step 3 test/eval cycle complete. Findings indicate step 4 needs tightening.
|
||||
|
||||
## Intent
|
||||
|
||||
Add specificity to step-04-review.md. This is the highest-risk step. Reference `_experiment/planning/redesign-plan.md` Step 4 section.
|
||||
|
||||
## Likely areas
|
||||
|
||||
- Context isolation for review subagents (does it actually strip context?)
|
||||
- Layer 1 vs Layer 2 separation
|
||||
- Classification cascade (intent > spec > patch)
|
||||
- INTENT_GAP two-question test
|
||||
- Spec loop mechanics (frozen sections, change log ratchet, positive preservation)
|
||||
- Iteration cap enforcement
|
||||
|
||||
## Constraint
|
||||
|
||||
This step has the most detailed plan material. Resist the urge to dump it all in. Add only what testing proves is needed.
|
||||
|
|
@ -0,0 +1,28 @@
|
|||
# Task 13: Test Step 4 — Review
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Task 12 tightening applied.
|
||||
|
||||
## Intent
|
||||
|
||||
Verify the review and classification system works.
|
||||
|
||||
## Test cases
|
||||
|
||||
- Diff constructed correctly from baseline
|
||||
- Layer 1 runs in context-free subagent (plan-code-review route)
|
||||
- Layer 1 skipped for one-shot route
|
||||
- Layer 2 (adversarial review) runs for all routes, context-free
|
||||
- One-shot Layer 2 receives user prompt alongside diff
|
||||
- Findings classified using priority cascade
|
||||
- Spec-class findings trigger spec amendment and re-derive
|
||||
- Spec loop respects iteration cap
|
||||
- Change log populated correctly
|
||||
- Positive preservation: KEEP instructions extracted and carried forward
|
||||
- Patch-class findings auto-fixed and committed
|
||||
- Defer-class findings written to wip.md
|
||||
|
||||
## Output
|
||||
|
||||
Pass/fail per test case. Failures feed back into task 12.
|
||||
|
|
@ -0,0 +1,22 @@
|
|||
# Task 14: Eval Step 4 — Review Efficiency
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Task 13 test cycle clean.
|
||||
|
||||
## Intent
|
||||
|
||||
Evaluate review quality and efficiency.
|
||||
|
||||
## Metrics
|
||||
|
||||
- Finding quality (real issues vs noise)
|
||||
- Classification accuracy (spec vs patch vs defer — are they right?)
|
||||
- Spec loop iterations used vs cap
|
||||
- Tokens consumed in review (especially spec loop cost)
|
||||
- Compare review quality against QD step-05 adversarial review baseline
|
||||
- False positive rate (reject-class findings as % of total)
|
||||
|
||||
## Output
|
||||
|
||||
Eval report: `_experiment/results/step-04-eval.md`.
|
||||
|
|
@ -0,0 +1,21 @@
|
|||
# Task 15: Tighten Step 5 — Present
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Step 4 test/eval cycle complete. Findings indicate step 5 needs tightening.
|
||||
|
||||
## Intent
|
||||
|
||||
Add specificity to step-05-present.md. Reference `_experiment/planning/redesign-plan.md` Step 5 section.
|
||||
|
||||
## Likely areas
|
||||
|
||||
- Findings presentation order and format
|
||||
- Low-confidence bucket handling
|
||||
- Push prohibition enforcement
|
||||
- PR creation quality (summary derived from spec + findings)
|
||||
- Approve/edit/reject option handling
|
||||
|
||||
## Constraint
|
||||
|
||||
Minimum words. This is the simplest step — mostly presentation and plumbing.
|
||||
|
|
@ -0,0 +1,22 @@
|
|||
# Task 16: Test Step 5 — Present
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Task 15 tightening applied.
|
||||
|
||||
## Intent
|
||||
|
||||
Verify the final checkpoint and PR creation work.
|
||||
|
||||
## Test cases
|
||||
|
||||
- Findings presented in correct priority order
|
||||
- Low-confidence findings surfaced as primary action item
|
||||
- Approve/edit/reject options work
|
||||
- Push command printed, never auto-executed
|
||||
- PR created with meaningful summary after human confirms push
|
||||
- Deferred items documented in wip.md
|
||||
|
||||
## Output
|
||||
|
||||
Pass/fail per test case. Failures feed back into task 15.
|
||||
|
|
@ -0,0 +1,20 @@
|
|||
# Task 17: Eval Step 5 — Present Efficiency
|
||||
|
||||
## Prerequisite
|
||||
|
||||
Task 16 test cycle clean.
|
||||
|
||||
## Intent
|
||||
|
||||
Evaluate the final presentation quality and efficiency.
|
||||
|
||||
## Metrics
|
||||
|
||||
- Human effort at checkpoint 2 (how many items need manual classification?)
|
||||
- PR quality (is the summary useful? would you merge from this?)
|
||||
- Time in step 5 vs total flow
|
||||
- Compare against QD step-06 resolve-findings baseline
|
||||
|
||||
## Output
|
||||
|
||||
Eval report: `_experiment/results/step-05-eval.md`.
|
||||
|
|
@ -0,0 +1,28 @@
|
|||
# Task 18: End-to-End Evaluation
|
||||
|
||||
## Prerequisite
|
||||
|
||||
All step-level test/eval cycles complete.
|
||||
|
||||
## Intent
|
||||
|
||||
Run the fully tightened QD2 on a real task end-to-end. Compare holistically against QD + QS combined baseline.
|
||||
|
||||
## Method
|
||||
|
||||
1. Pick a representative task (not trivial, not huge — the sweet spot QD2 is designed for).
|
||||
2. Run QD2 start to finish.
|
||||
3. If possible, run the same task through QS → QD for comparison.
|
||||
|
||||
## Metrics
|
||||
|
||||
- Total human turns (north star)
|
||||
- Total time
|
||||
- Total tokens
|
||||
- Output quality (code correctness, spec quality, review thoroughness)
|
||||
- Number of unnecessary human interactions
|
||||
- Would you ship this?
|
||||
|
||||
## Output
|
||||
|
||||
Final eval report: `_experiment/results/end-to-end-eval.md` with go/no-go recommendation for promoting QD2 to replace QS+QD.
|
||||
Loading…
Reference in New Issue