Commit Graph

2 Commits

Author SHA1 Message Date
Brian Madison b410f2c436 test(bmm): overhaul product-brief evals into A/B/C pattern split
Refactor evals from 8 numbered single-shot tests into 16 typed evals:
- Pattern A (A1-A8): artifact-correctness tests with headless prompts and
  precise, falsifiable expectations (no invented facts, right-sized output,
  file boundary enforcement)
- Pattern B (B1-B8): process-discipline tests verifying decision-log fidelity,
  polish phase ordering, contradiction detection, and distillate generation
- Pattern C (C1): config-compliance test for custom output paths and
  document_output_language

Also tighten SKILL.md: add dependencies frontmatter, clarify headless override
autonomy in Update (proceed when intent is clear; block when ambiguous), and
make distillate skip explicit when bmad-distillator is not installed.
2026-05-09 17:10:59 -05:00
Brian Madison 9ef0b7f574 test(bmm): scaffold evals for bmad-product-brief
Adds an eval suite for bmad-product-brief following the Anthropic skill-creator
schema, plus a new "Extract, don't ingest" constraint on the skill itself.

Skill change:
- New constraint: source artifacts (user-provided or run-discovered) enter the
  parent conversation as relevance-filtered extracts via subagents, not loaded
  wholesale. Keeps the parent context lean against transcripts, brainstorms,
  research reports, code, web results, and prior briefs.

Evals (evals/bmm-skills/bmad-product-brief/):
- evals.json: 8 artifact/behavioral evals covering Create one-shot,
  source-memo ingest, Update with contradiction surfacing, Validate inline,
  Headless mode, brainstorm filtering, research-report filtering, persona
  filtering. All scenarios use fictional entities (InsuLens, Branfield CC,
  Forkbird Kitchen, Mossridge Library, Sproutkeeper, Hatchet & Loop Studio,
  Brightway, Pantry Bridge).
- triggers.json: 15 description-firing checks (7 should-fire, 8 should-not-fire)
  to catch under-triggering and adjacent-skill poaching.
- files/: realistic fixtures including a brainstorm with the relevant idea
  buried at the end, a 3000-word market research report with the relevant
  section in the middle, and customer interviews with the target persona in
  position 3 of 4 — each shaped to test that filtering happens against the
  user's stated focus regardless of where the relevant material sits.

Eval directory placement: top-level evals/ outside src/, matching the convention
in anthropics/skills (zero of 17 production skills include an evals/ subdir;
their skill-creator places dev workspaces as siblings to skill folders). Keeps
evals out of any installer or marketplace.json distribution path.
2026-05-09 10:03:56 -05:00