145 lines
12 KiB
Markdown
145 lines
12 KiB
Markdown
# Skeleton Test Findings
|
|
|
|
**Date:** 2026-02-22
|
|
**Runs analyzed:** 3
|
|
**Skeleton version:** QD2 bare step files (post `3126b9c4`)
|
|
**Plan reference:** `_experiment/planning/redesign-plan.md`
|
|
|
|
---
|
|
|
|
## Summary
|
|
|
|
Three QD2 runs were executed against real BMAD-METHOD tasks: two one-shots and one full plan-code-review. The step-file plumbing (config loading, step transitions, checkpoint halts) worked reliably. The core failures cluster around two themes: (1) one-shot route lacks discipline — agents bypass step architecture, skip review, and fantasize scope; (2) the skeleton step files are too terse to enforce plan-specified behaviors that the LLM doesn't do from training alone.
|
|
|
|
---
|
|
|
|
## Per-Step Observations
|
|
|
|
### Step 1 — Clarify and Route
|
|
|
|
| # | Observation | Class | Run |
|
|
|---|------------|-------|-----|
|
|
| 1 | Config loading failed — `config.yaml` doesn't exist in the BMAD source repo (it's the source, not an installed project). Had to resolve values manually. | **Plumbing** | Run 1 |
|
|
| 2 | Step transitions and NEXT directives followed correctly across all runs that used the step architecture. | **Works** | All |
|
|
| 3 | Agent fantasized scope expansion: asked to remove 1 human turn, removed 2 without asking. Mentioned but didn't ask the clarifying question. Step file says "scope clear" in exit criteria but doesn't enforce ask-before-expanding. | **Execution Gap** | Run 1 |
|
|
| 4 | Agent captured intent directionally but mangled specifics: conflated adversarial review with conformance review. No clarifying questions asked. Step file's terse "Clarify intent" wasn't sufficient. | **Execution Gap** | Run 2 |
|
|
| 5 | Agent bypassed step architecture entirely for one-shot — acted immediately without routing through step files. The implicit one-shot path provides no structural enforcement. | **Execution Gap** | Run 2 |
|
|
| 6 | Agent initially routed as one-shot when the task required design decisions (investigating sub-bugs, choosing an approach). User challenged; agent re-routed to plan-code-review without friction. The "Ambiguous? Default plan-code-review" guidance is in the step file — agent misjudged but the correction mechanism worked. | **Works** | Run 3 |
|
|
| 7 | VC convention backfill didn't happen in any run. Both plan and step file specify it. Agent ignored. | **Execution Gap** | All |
|
|
|
|
### Step 2 — Plan
|
|
|
|
| # | Observation | Class | Run |
|
|
|---|------------|-------|-----|
|
|
| 8 | Spec generated, self-reviewed, presented at checkpoint. Approved first try. Workflow plumbing correct. | **Works** | Run 3 |
|
|
| 9 | Agent over-confidently asserted "no consequences" when user asked about provenance of `with argument`. Dismissed skepticism. User had to insist on git history investigation, which then revealed important context (legacy input channel from `whats-after.md`). Neither plan nor step file tells the agent to investigate provenance before claiming safety. | **Plan Gap** | Run 3 |
|
|
| 10 | After user pushed back, agent did the investigation, corrected its framing, and produced a stronger spec. The plan's "ask the human about intent and constraints" principle was retroactively satisfied, but only because the human forced it. | **Gap** | Run 3 |
|
|
|
|
### Step 3 — Implement
|
|
|
|
| # | Observation | Class | Run |
|
|
|---|------------|-------|-----|
|
|
| 11 | Implementation of 7 single-line edits was clean and correct. Validation grep confirmed zero remaining instances. | **Works** | Run 3 |
|
|
| 12 | Task sharding (plan: "each task to its own file, sequence file tracks execution") likely did not happen. Run log doesn't mention it. Agent probably executed tasks inline from the spec. Step file specifies sharding but for 7 identical edits, agent optimized. | **Execution Gap** | Run 3 |
|
|
| 13 | One-shot implementations (runs 1 & 2) skipped step-03 entirely — changes were made inline during step 1. No baseline commit, no branch, no clean-tree assertion. | **Execution Gap** | Runs 1, 2 |
|
|
| 14 | Conventional commit message produced correctly. | **Works** | Run 3 |
|
|
| 15 | No push or remote operations during implementation — correctly local-only. | **Works** | Run 3 |
|
|
|
|
### Step 4 — Review
|
|
|
|
| # | Observation | Class | Run |
|
|
|---|------------|-------|-----|
|
|
| 16 | One-shot runs had NO review at all. Plan says one-shot "Still does auto VC and review" and Layer 2 is "non-negotiable." Step files route step-03 → step-04, but one-shot agents bypassed the step architecture entirely. | **Execution Gap** | Runs 1, 2 |
|
|
| 17 | Plan-code-review run did adversarial review in a context-free subagent. 10 findings returned, classified correctly (0 intent_gap, 0 bad_spec, 0 patch, 3 defer, 5 reject). Classification cascade worked. | **Works** | Run 3 |
|
|
| 18 | Layer 1 (intent audit) unclear whether it was executed separately from Layer 2. Plan specifies two distinct layers in isolated subagents. Step file mentions both but doesn't enforce separate execution. | **Gap** | Run 3 |
|
|
| 19 | Adversarial review consumed ~4 minutes and tens of thousands of tokens for a trivial 7-line change. Plan has no mechanism to scale review effort proportionally to change size. | **Plan Gap** | Run 3 |
|
|
| 20 | Adversarial review found real pre-existing issues (duplicate next-steps sections, missing step numbers) correctly classified as defer (out of scope). Review quality was good. | **Works** | Run 3 |
|
|
| 21 | Spec loop not triggered (no spec-class findings). Frozen sections, change log, positive preservation — all untested. | **Not tested** | Run 3 |
|
|
|
|
### Step 5 — Present
|
|
|
|
| # | Observation | Class | Run |
|
|
|---|------------|-------|-----|
|
|
| 22 | Classified findings presented. Checkpoint halt worked. User approved. | **Works** | Run 3 |
|
|
| 23 | Push command printed, user pushed manually. Correctly never auto-pushed. | **Works** | Run 3 |
|
|
| 24 | PR #1739 created via `gh pr create`. | **Works** | Run 3 |
|
|
|
|
---
|
|
|
|
## Cross-Cutting Findings
|
|
|
|
### Finding A: One-shot route is structurally unenforceable
|
|
|
|
**Class: Execution Gap (systemic)**
|
|
|
|
The plan treats one-shot as a lighter version of the same flow — same steps, collapsed review. The skeleton routes one-shot to step-03. But in practice, agents treated one-shot as permission to bypass the step architecture entirely. Both one-shot runs (1 and 2) completed the work inline during step 1, never entering steps 3-5. This means:
|
|
- No baseline commit captured
|
|
- No clean-tree assertion
|
|
- No review of any kind (Layer 2 is "non-negotiable" per plan)
|
|
- No conventional commit
|
|
- No PR creation
|
|
|
|
The step files have the routing (`→ Step 3`), but one-shot's "skip sharding, work from mental plan" instruction combined with the terse step files gives the agent implicit permission to collapse everything. The plan's one-shot design assumes the agent will still follow the step architecture — the skeleton doesn't enforce that assumption.
|
|
|
|
**Recommendation:** Step 1 needs explicit instruction that one-shot still proceeds through steps 3-5. The "skip sharding" instruction in step-03 should be the only concession, not a wholesale bypass of the remaining flow.
|
|
|
|
### Finding B: Terse step files rely on training for behaviors the plan explicitly doesn't trust to training
|
|
|
|
**Class: Execution Gap (design tension)**
|
|
|
|
The plan contains detailed rationale for why specific behaviors need explicit prompting — e.g., "Do NOT prescribe how to capture intent" (trust training) vs. "Define exit criteria for captured intent" (don't trust training for this). The skeleton step files are so terse they push almost everything to training:
|
|
|
|
- Step-01: "Clarify intent until: problem unambiguous, scope clear" — trusts training to know how to clarify. But runs 1 and 2 show training doesn't reliably enforce "ask before expanding scope."
|
|
- Step-02: "Investigate codebase" — trusts training to investigate deeply. Run 3 shows the agent didn't investigate provenance until pushed.
|
|
- Step-03: "Shard spec tasks → task files" — trusts training to do file-based sharding. Agent likely skipped it.
|
|
- Step-04: "intent audit + adversarial code review" — trusts training to run two separate review layers. Unclear if this happened.
|
|
|
|
The plan's own philosophy distinguishes between what training handles (how to ask clarifying questions) and what needs explicit prompting (exit criteria, anti-fantasizing, investigation depth). The skeleton doesn't make this distinction — it's uniformly terse.
|
|
|
|
**Recommendation:** Identify the specific behaviors where training fails (from these runs) and add explicit enforcement to those step files. Keep everything else terse.
|
|
|
|
### Finding C: Ask-don't-fantasize is the highest-signal gap
|
|
|
|
**Class: Execution Gap**
|
|
|
|
Two of three runs showed the same failure pattern: the agent understood the gist of the request but silently expanded or reframed the scope without asking. Run 1 removed 2 turns instead of 1. Run 2 conflated review types. Both could have been caught by a single clarifying question.
|
|
|
|
The plan addresses this: "Ask the human about intent and constraints, not implementation details." The step file has "scope clear" in exit criteria. But neither formulation was strong enough to prevent fantasizing. The commit `3126b9c4` added "ask-dont-fantasize" enforcement — but based on these runs, the current skeleton text still isn't forceful enough.
|
|
|
|
**Recommendation:** The ask-dont-fantasize principle needs to be the loudest instruction in step 1, not a bullet in exit criteria. Something like: "When in doubt about scope, ASK. Never silently expand, reframe, or interpret scope beyond what the human stated."
|
|
|
|
### Finding D: No project-context backfill occurred
|
|
|
|
**Class: Execution Gap**
|
|
|
|
Plan specifies VC convention backfill on first run. Step file says "Backfill VC conventions to project-context if unknown." Zero runs did it. This is a case where terse instruction + optional behavior = skipped every time. The agent has higher-priority tasks (the actual request) and treats backfill as optional housekeeping.
|
|
|
|
**Recommendation:** Make backfill a blocking gate with a specific check: "Does project-context.md exist with a VC section? If no → ask and write before proceeding."
|
|
|
|
---
|
|
|
|
## Plan Gap vs Execution Gap Tally
|
|
|
|
| Classification | Count | Key examples |
|
|
|---------------|-------|-------------|
|
|
| **Works** | 10 | Step transitions, checkpoint halts, spec generation, classification, commit, push discipline, PR creation |
|
|
| **Gap** (behavior missing, unclassified root) | 2 | Intent audit unclear (#18), provenance investigation after correction (#10) |
|
|
| **Execution Gap** (plan covers it, step file doesn't deliver) | 8 | One-shot bypass (#5, #13, #16), scope fantasizing (#3, #4), no backfill (#7), task sharding skipped (#12), step architecture ignored (#5) |
|
|
| **Plan Gap** (plan doesn't specify) | 2 | Provenance investigation (#9), review cost proportionality (#19) |
|
|
| **Plumbing** | 1 | Config.yaml missing in source repo (#1) |
|
|
| **Not tested** | 1 | Spec loop (#21) |
|
|
|
|
---
|
|
|
|
## Conclusions
|
|
|
|
1. **The step-file plumbing works.** Config loading (once config exists), step transitions, checkpoint halts, NEXT directives — all reliable.
|
|
|
|
2. **The plan-code-review path works end-to-end.** Run 3 completed all 5 steps, produced a clean spec, correct implementation, meaningful adversarial review, and a merged PR.
|
|
|
|
3. **The one-shot path is broken.** Agents bypass the step architecture and skip review. This is the highest-priority fix — either enforce step progression for one-shots or redesign the one-shot path to be self-contained.
|
|
|
|
4. **Terse step files cause execution gaps.** The skeleton is too terse for behaviors the plan explicitly identified as needing enforcement. The fix is surgical: add explicit instructions only where training demonstrably fails (ask-dont-fantasize, backfill gates, one-shot step enforcement), keep everything else terse.
|
|
|
|
5. **Two genuine plan gaps found:** (a) no guidance on investigating provenance before asserting safety of removals; (b) no mechanism to scale review effort to change size (trivial changes get the same review cost as complex ones).
|