12 KiB

Raw Blame History

Skeleton Test Findings

Date: 2026-02-22 Runs analyzed: 3 Skeleton version: QD2 bare step files (post 3126b9c4) Plan reference: _experiment/planning/redesign-plan.md

Summary

Three QD2 runs were executed against real BMAD-METHOD tasks: two one-shots and one full plan-code-review. The step-file plumbing (config loading, step transitions, checkpoint halts) worked reliably. The core failures cluster around two themes: (1) one-shot route lacks discipline — agents bypass step architecture, skip review, and fantasize scope; (2) the skeleton step files are too terse to enforce plan-specified behaviors that the LLM doesn't do from training alone.

Per-Step Observations

Step 1 — Clarify and Route

#	Observation	Class	Run
1	Config loading failed — `config.yaml` doesn't exist in the BMAD source repo (it's the source, not an installed project). Had to resolve values manually.	Plumbing	Run 1
2	Step transitions and NEXT directives followed correctly across all runs that used the step architecture.	Works	All
3	Agent fantasized scope expansion: asked to remove 1 human turn, removed 2 without asking. Mentioned but didn't ask the clarifying question. Step file says "scope clear" in exit criteria but doesn't enforce ask-before-expanding.	Execution Gap	Run 1
4	Agent captured intent directionally but mangled specifics: conflated adversarial review with conformance review. No clarifying questions asked. Step file's terse "Clarify intent" wasn't sufficient.	Execution Gap	Run 2
5	Agent bypassed step architecture entirely for one-shot — acted immediately without routing through step files. The implicit one-shot path provides no structural enforcement.	Execution Gap	Run 2
6	Agent initially routed as one-shot when the task required design decisions (investigating sub-bugs, choosing an approach). User challenged; agent re-routed to plan-code-review without friction. The "Ambiguous? Default plan-code-review" guidance is in the step file — agent misjudged but the correction mechanism worked.	Works	Run 3
7	VC convention backfill didn't happen in any run. Both plan and step file specify it. Agent ignored.	Execution Gap	All

Step 2 — Plan

#	Observation	Class	Run
8	Spec generated, self-reviewed, presented at checkpoint. Approved first try. Workflow plumbing correct.	Works	Run 3
9	Agent over-confidently asserted "no consequences" when user asked about provenance of `with argument`. Dismissed skepticism. User had to insist on git history investigation, which then revealed important context (legacy input channel from `whats-after.md`). Neither plan nor step file tells the agent to investigate provenance before claiming safety.	Plan Gap	Run 3
10	After user pushed back, agent did the investigation, corrected its framing, and produced a stronger spec. The plan's "ask the human about intent and constraints" principle was retroactively satisfied, but only because the human forced it.	Gap	Run 3

Step 3 — Implement

#	Observation	Class	Run
11	Implementation of 7 single-line edits was clean and correct. Validation grep confirmed zero remaining instances.	Works	Run 3
12	Task sharding (plan: "each task to its own file, sequence file tracks execution") likely did not happen. Run log doesn't mention it. Agent probably executed tasks inline from the spec. Step file specifies sharding but for 7 identical edits, agent optimized.	Execution Gap	Run 3
13	One-shot implementations (runs 1 & 2) skipped step-03 entirely — changes were made inline during step 1. No baseline commit, no branch, no clean-tree assertion.	Execution Gap	Runs 1, 2
14	Conventional commit message produced correctly.	Works	Run 3
15	No push or remote operations during implementation — correctly local-only.	Works	Run 3

Step 4 — Review

#	Observation	Class	Run
16	One-shot runs had NO review at all. Plan says one-shot "Still does auto VC and review" and Layer 2 is "non-negotiable." Step files route step-03 → step-04, but one-shot agents bypassed the step architecture entirely.	Execution Gap	Runs 1, 2
17	Plan-code-review run did adversarial review in a context-free subagent. 10 findings returned, classified correctly (0 intent_gap, 0 bad_spec, 0 patch, 3 defer, 5 reject). Classification cascade worked.	Works	Run 3
18	Layer 1 (intent audit) unclear whether it was executed separately from Layer 2. Plan specifies two distinct layers in isolated subagents. Step file mentions both but doesn't enforce separate execution.	Gap	Run 3
19	Adversarial review consumed ~4 minutes and tens of thousands of tokens for a trivial 7-line change. Plan has no mechanism to scale review effort proportionally to change size.	Plan Gap	Run 3
20	Adversarial review found real pre-existing issues (duplicate next-steps sections, missing step numbers) correctly classified as defer (out of scope). Review quality was good.	Works	Run 3
21	Spec loop not triggered (no spec-class findings). Frozen sections, change log, positive preservation — all untested.	Not tested	Run 3

Step 5 — Present

#	Observation	Class	Run
22	Classified findings presented. Checkpoint halt worked. User approved.	Works	Run 3
23	Push command printed, user pushed manually. Correctly never auto-pushed.	Works	Run 3
24	PR #1739 created via `gh pr create`.	Works	Run 3

Cross-Cutting Findings

Finding A: One-shot route is structurally unenforceable

Class: Execution Gap (systemic)

The plan treats one-shot as a lighter version of the same flow — same steps, collapsed review. The skeleton routes one-shot to step-03. But in practice, agents treated one-shot as permission to bypass the step architecture entirely. Both one-shot runs (1 and 2) completed the work inline during step 1, never entering steps 3-5. This means:

No baseline commit captured
No clean-tree assertion
No review of any kind (Layer 2 is "non-negotiable" per plan)
No conventional commit
No PR creation

The step files have the routing (→ Step 3), but one-shot's "skip sharding, work from mental plan" instruction combined with the terse step files gives the agent implicit permission to collapse everything. The plan's one-shot design assumes the agent will still follow the step architecture — the skeleton doesn't enforce that assumption.

Recommendation: Step 1 needs explicit instruction that one-shot still proceeds through steps 3-5. The "skip sharding" instruction in step-03 should be the only concession, not a wholesale bypass of the remaining flow.

Finding B: Terse step files rely on training for behaviors the plan explicitly doesn't trust to training

Class: Execution Gap (design tension)

The plan contains detailed rationale for why specific behaviors need explicit prompting — e.g., "Do NOT prescribe how to capture intent" (trust training) vs. "Define exit criteria for captured intent" (don't trust training for this). The skeleton step files are so terse they push almost everything to training:

Step-01: "Clarify intent until: problem unambiguous, scope clear" — trusts training to know how to clarify. But runs 1 and 2 show training doesn't reliably enforce "ask before expanding scope."
Step-02: "Investigate codebase" — trusts training to investigate deeply. Run 3 shows the agent didn't investigate provenance until pushed.
Step-03: "Shard spec tasks → task files" — trusts training to do file-based sharding. Agent likely skipped it.
Step-04: "intent audit + adversarial code review" — trusts training to run two separate review layers. Unclear if this happened.

The plan's own philosophy distinguishes between what training handles (how to ask clarifying questions) and what needs explicit prompting (exit criteria, anti-fantasizing, investigation depth). The skeleton doesn't make this distinction — it's uniformly terse.

Recommendation: Identify the specific behaviors where training fails (from these runs) and add explicit enforcement to those step files. Keep everything else terse.

Finding C: Ask-don't-fantasize is the highest-signal gap

Class: Execution Gap

Two of three runs showed the same failure pattern: the agent understood the gist of the request but silently expanded or reframed the scope without asking. Run 1 removed 2 turns instead of 1. Run 2 conflated review types. Both could have been caught by a single clarifying question.

The plan addresses this: "Ask the human about intent and constraints, not implementation details." The step file has "scope clear" in exit criteria. But neither formulation was strong enough to prevent fantasizing. The commit 3126b9c4 added "ask-dont-fantasize" enforcement — but based on these runs, the current skeleton text still isn't forceful enough.

Recommendation: The ask-dont-fantasize principle needs to be the loudest instruction in step 1, not a bullet in exit criteria. Something like: "When in doubt about scope, ASK. Never silently expand, reframe, or interpret scope beyond what the human stated."

Finding D: No project-context backfill occurred