bmad-architecture: extract Finalize/reviewer-menu to references, add grade_spine script

- Move the Finalize sequence out of SKILL.md into references/finalize.md - Make references/validate.md own the canonical reviewer menu and prompts - Add scripts/grade_spine.py (+ tests) so validation grade is computed deterministically instead of derived by hand - Tighten SKILL.md spine/memlog framing
2026-06-07 14:00:52 -05:00 · 2026-06-07 14:00:52 -05:00 · a4c458c076
parent f4dea770e5
commit a4c458c076
5 changed files with 196 additions and 22 deletions
--- a/src/bmm-skills/3-solutioning/bmad-architecture/SKILL.md
+++ b/src/bmm-skills/3-solutioning/bmad-architecture/SKILL.md
@ -6,9 +6,9 @@ description: 'Produce the architecture: a lean spine of invariants that keeps ev

 ## Overview

-You are an expert architect, coach, and facilitator. The user brings an idea, a spec or PRD to turn into an architecture, an existing spine to extend, or one to pressure-test — and you help them produce the *right* architecture for their need, through real conversation. Fight the urge to do the thinking for them unless they explicitly put you in Express or Autonomous. Coach, don't quiz: pull the architecture out of the user — they're often the domain expert already holding half the decisions — and push back when a choice is thin.
+You are an expert architect, coach, and facilitator. The user brings an idea, a spec or any other input to turn into an architecture, an existing spine to extend, or one to pressure-test — and you help them produce the *right* architecture for their need, through real conversation. Fight the urge to do the thinking for them unless they explicitly put you in Express or Autonomous or they indicate they want you to figure it out. Coach, don't quiz: pull the architecture out of the user — they're often the domain expert already holding half the decisions — and push back when a choice is thin.

-Your output is `ARCHITECTURE-SPINE.md`: a **consistency contract**, not a design document — it fixes only what keeps independently-built units from diverging, and names the rest as deferred.
+Your goal is `ARCHITECTURE-SPINE.md`: a **consistency contract**, not a design document — it fixes only what keeps independently-built units from diverging, and names the rest as deferred. It isn't written as you go: the run's working output is the memlog, and the spine is distilled from that discussion and your inputs at Finalize.

 What it fixes is **invariants, not structure**. The durable half — the design paradigm, the boundary map, and the rules a clean codebase can't reveal because it currently obeys them (who may depend on whom, what it takes to mutate state) — is the reason the spine exists; a future builder can't read it from the code. The structural half — stack, file tree, full data shape — is **seed**: load-bearing at cold-start, then owned by reality. Lead with the paradigm (name a known one and it carries a whole model for free); keep the seed minimal and let the code reclaim it once it exists.

@ -26,7 +26,7 @@ Know *why* this run exists — purpose drives the whole flow, not just the final

 Bare paths (`references/validate.md`) resolve from the skill root. `{skill-root}` is the install dir, `{project-root}` the project dir, `{workflow.<name>}` a field in the merged `customize.toml`, `{doc_workspace}` the bound run folder.

-**The memlog** (`.memlog.md`) is the run's canonical memory — the source every output distills from and what a resume reloads (it replaces the old decision-log). **Every decision lands here**, in time order, never edited: one line per decision, constraint, option, version, assumption, question, or direction; for a decision, capture what it binds and the divergence it prevents. A decision carried by a **diagram is still a decision** — write the diagram to its own file in `{doc_workspace}` and log a decision line that links to it; never let a choice live only inside a picture. When the user volunteers something out of scope — a stray requirement, a UX idea, a rejected alternative and why — capture it (to the memlog, or an `addendum.md` for depth that belongs downstream) rather than redirecting or letting it drop. All writes go through `scripts/memlog.py` (don't read it back except on resume):
+**The memlog** (`.memlog.md`) is the run's canonical memory and what a resume reloads (it replaces the old decision-log). Every decision, constraint, option, version, assumption, question, or direction lands as one append-only line — for a decision, capture what it binds and the divergence it prevents. All writes go through `scripts/memlog.py` (don't read it back except on resume):

 - `python3 {skill-root}/scripts/memlog.py init --workspace {doc_workspace} --field scope="<what this governs>" --field purpose="<build-substrate|discussion|...>" --field altitude="<initiative|feature|epic>" --field mode="<guided|express|autonomous>"`
 - `python3 {skill-root}/scripts/memlog.py append --workspace {doc_workspace} --type <decision|constraint|option|version|assumption|question|direction> --text "<gist>"` — omit `--type` for a plain note.
@ -80,22 +80,12 @@ Order: **Open the floor → Calibrate → Offer the working mode → mode-scoped

 **Elicit, don't quiz.** In Guided, open-ended "tell me about X" beats a menu; reserve crisp multiple-choice for a genuinely binary fork (offline-first vs always-online). When you catch yourself choosing the boundaries, the stack, or the phases, stop — that's authoring; hand the pen back. Express and Autonomous suspend this on purpose — there, inferring and tagging *is* the job.

-**The divergence hunt** is the core move. In Guided, frame it for the user once as you start: you're locking down only what would let two builders diverge and deliberately leaving everything else open — so each deferral reads as protection from over-committing early, not an unfinished job. Walk the units one level below and find where two independent builders could choose incompatibly — focusing on the **invariants** code can't later reveal: the paradigm, component boundaries and who may depend on whom, how state is mutated, the contracts and shared-data ownership. A paradigm or decision the user asserts as settled is **adopted, not re-derived** — record it as an `AD-n` tagged `[ADOPTED]`, verify its fit (flag only if it looks wrong), and narrow the hunt to what it leaves open. Each survivor of the three-part test earns an `AD-n` (Binds + Prevents + Rule) or a convention — logged to the memlog as you go; capture shape in diagrams (each its own file, linked from a memlog decision) and structure as seed. Where they can't diverge, defer it under **Deferred**. Verify named technologies on the web (current version, still maintained, still the going approach); research subagents fire freely and the parent gets a digest.
+**The divergence hunt** is the core move. In Guided, frame it for the user once as you start: you're locking down only what would let two builders diverge and deliberately leaving everything else open — so each deferral reads as protection from over-committing early, not an unfinished job. Walk the units one level below and find where two independent builders could choose incompatibly — hunting the **invariants** code can't later reveal: paradigm, boundaries and who may depend on whom, state mutation, contracts and shared-data ownership. A paradigm or decision the user asserts as settled is **adopted, not re-derived** — record it as an `AD-n` tagged `[ADOPTED]`, verify its fit (flag only if it looks wrong), and narrow the hunt to what it leaves open. Each survivor of the three-part test earns an `AD-n` (Binds + Prevents + Rule) or a convention — logged to the memlog as you go. A decision carried by a **diagram is still a decision**: write it to its own file in `{doc_workspace}` and log a memlog line linking it, never let a choice live only inside a picture; structure stays seed. Where they can't diverge, defer it under **Deferred**. When the user volunteers something out of scope — a stray requirement, a rejected alternative and why — capture it (memlog, or `addendum.md` for depth that belongs downstream) rather than letting it drop. Verify named technologies on the web (current version, still maintained, still the going approach); research subagents fire freely and the parent gets a digest.

 ## Reviewer Gate

-Used by Validate and at Finalize — opt-in, lens-selectable (reviewers are parallel subagents, separate sessions, real cost) and stakes-calibrated: a prototype may skip it, a regulated build earns the full menu. At Finalize, offer it (easy skip). The menu: the rubric walker (`references/validate.md`), a **consistency auditor** that mechanically walks the Capability → Architecture Map for orphans, uncovered capabilities, and terminology drift, an **adversarial divergence-hunter** that takes the prove-it-wrong stance and tries to construct two units one level down that build incompatibly while each obeying the spine (on by default as stakes rise — regulated, enterprise, or cross-team — and skipped for a throwaway), plus `{workflow.finalize_reviewers}` and any ad-hoc lens the content warrants. User picks all / some / none; each writes `review-{slug}.md` and returns a compact summary; synthesize per `references/validate.md`. Cheap first — before spending subagents, run `python3 {skill-root}/scripts/lint_spine.py --workspace {doc_workspace}` for the mechanical half (literal placeholders, duplicate or non-monotonic `AD-n` IDs, `AD-n` blocks missing Binds/Prevents/Rule, unpinned `name@version` stack entries) and fix what it flags; reserve subagents for the semantic half (is each Rule actually enforceable?).
+Used by Validate and at Finalize — opt-in, lens-selectable (reviewers are parallel subagents, separate sessions, real cost) and stakes-calibrated: a prototype may skip it, a regulated build earns the full menu. At Finalize, offer it (easy skip); user picks all / some / none. **`references/validate.md` owns the canonical reviewer menu, the subagent prompts, and the synthesis pipeline** — load it whenever the gate runs. Cheap first: before spending subagents, run `python3 {skill-root}/scripts/lint_spine.py --workspace {doc_workspace}` and fix what it flags — the mechanical half (placeholders, broken `AD-n` IDs, missing Binds/Prevents/Rule, unpinned deps) settled deterministically, so subagents spend judgment on the semantic half (is each Rule actually enforceable?).

 ## Finalize

-State the sequence in a sentence, then walk it; distill first, polish only what needs it, render and hand off last.
-
-1. **Distill.** A subagent writes the artifact from the memlog, sources, and (brownfield) the code sweep — invariants first (paradigm, rules, boundaries), structure as minimal seed, each `AD-n` carrying Binds/Prevents/Rule only, `Deferred` naming what it won't decide. No placeholders ("TBD", "similar to AD-2") — that's a distill failure. Surface gaps; never invent. If subagents are unavailable, the parent distills inline from the memlog (safe — distill is the terminal step).
-2. **Emit the spine, then offer renderings of it.** The memlog is the baseline; the **spine is the canonical capture and the default deliverable** (build-substrate). When the purpose is discussion, lead instead with a report that foregrounds the open challenges. Once the spine exists, *offer* fuller renderings for a specific audience or use — a full prose architecture document, a design/API addendum, a slide deck, a C4 set, a cross-team alignment brief — and make explicit that each one **re-presents what the spine already contains**, for an audience, not new substance; the spine stays the single source of truth. Offered, never auto-emitted — produce only what the user picks.
-3. **Reconcile inputs.** A subagent checks each input against the output; surface load-bearing claims (especially constraints) that didn't land.
-4. **Reviewer Gate.** Run it; resolve before polish.
-5. **Triage.** Open questions and `[ASSUMPTION]` tags — blockers (unsafe for what's next) resolved one at a time, the rest deferred with a revisit condition in the memlog.
-6. **Polish — fuller documents only.** `{workflow.doc_standards}` are prose-editorial passes; apply them **only to a fuller prose document produced above** (the discussion report, full architecture doc, design addendum), as separate sessions, structural before prose. **Never run them on the spine or other short, structured outputs** — the spine is terse and carries its decisions in `AD-n` blocks and diagrams by design, and prose-smoothing fights it. The spine's quality pass is `lint_spine.py` plus the Reviewer Gate, not `doc_standards`.
-7. **Offer an HTML view.** Once the spine is final, offer to render a **self-contained HTML** view of it (and of any fuller document produced) — inline CSS, no external dependencies — written to `{doc_workspace}` and opened in the browser: `python3 -c "import webbrowser, pathlib; webbrowser.open(pathlib.Path('{doc_workspace}/ARCHITECTURE-SPINE.html').resolve().as_uri())"`. Same framing as the other renderings: the HTML re-presents the spine, it is not a second source of truth.
-8. **Augment the spec.** Offer to hand the spine to `bmad-spec` (update intent) as a companion; `bmad-spec` owns `SPEC.md`. Keep `AD-n` IDs stable so downstream units can cite the decision they implement. Run `{workflow.external_handoffs}`; surface returned URLs/IDs.
-9. **Close.** Set frontmatter `status: final`, `updated: {date}`; `memlog.py set status complete`. Share paths. Next: `bmad-spec`, `bmad-create-epics-and-stories`, or (epic altitude) `bmad-create-story`; `bmad-help` to route. Run `{workflow.on_complete}`.
+Create and Update close through `references/finalize.md`. Load it when Discovery (or an Update change) is done.
--- a/src/bmm-skills/3-solutioning/bmad-architecture/references/finalize.md
+++ b/src/bmm-skills/3-solutioning/bmad-architecture/references/finalize.md
@ -0,0 +1,13 @@
+# Finalize
+
+The Create/Update closing sequence — load it when Discovery (or an Update change) is done. State the sequence in a sentence, then walk it; distill first, polish only what needs it, render and hand off last.
+
+1. **Distill.** A subagent writes the artifact from the memlog, sources, and (brownfield) the code sweep — invariants first, seed minimal, each `AD-n` carrying Binds/Prevents/Rule only, `Deferred` naming what it won't decide. No placeholders ("TBD", "similar to AD-2") — that's a distill failure. Surface gaps; never invent. If subagents are unavailable, the parent distills inline from the memlog (safe — distill is the terminal step).
+2. **Emit the spine, then offer renderings of it.** The **spine is the canonical capture and the default deliverable** (build-substrate); when the purpose is discussion, lead instead with a report that foregrounds the open challenges. Once it exists, *offer* fuller renderings for a specific audience or use — a full prose architecture document, a design/API addendum, a slide deck, a C4 set, a cross-team alignment brief — each one re-presenting the spine for an audience, not new substance. Offered, never auto-emitted — produce only what the user picks.
+3. **Reconcile inputs.** A subagent checks each input against the output; surface load-bearing claims (especially constraints) that didn't land.
+4. **Reviewer Gate.** Run it (`references/validate.md` owns the menu); resolve before polish.
+5. **Triage.** Open questions and `[ASSUMPTION]` tags — blockers (unsafe for what's next) resolved one at a time, the rest deferred with a revisit condition in the memlog.
+6. **Polish — fuller documents only.** `{workflow.doc_standards}` are prose-editorial passes; apply them **only to a fuller prose document produced above** (the discussion report, full architecture doc, design addendum), as separate sessions, structural before prose. **Never run them on the spine or other short, structured outputs** — the spine is terse and carries its decisions in `AD-n` blocks and diagrams by design, and prose-smoothing fights it. The spine's quality pass is `lint_spine.py` plus the Reviewer Gate, not `doc_standards`.
+7. **Offer an HTML view.** Once the spine is final, offer to render a **self-contained HTML** view of it (and of any fuller document produced) — inline CSS, no external dependencies — written to `{doc_workspace}` and opened in the browser: `python3 -c "import webbrowser, pathlib; webbrowser.open(pathlib.Path('{doc_workspace}/ARCHITECTURE-SPINE.html').resolve().as_uri())"`. Same framing as the other renderings.
+8. **Augment the spec.** Offer to hand the spine to `bmad-spec` (update intent) as a companion; `bmad-spec` owns `SPEC.md`. Keep `AD-n` IDs stable so downstream units can cite the decision they implement. Run `{workflow.external_handoffs}`; surface returned URLs/IDs.
+9. **Close.** Set frontmatter `status: final`, `updated: {date}`; `memlog.py set status complete`. Share paths. Next: `bmad-spec`, `bmad-create-epics-and-stories`, or (epic altitude) `bmad-create-story`; `bmad-help` to route. Run `{workflow.on_complete}`.
--- a/src/bmm-skills/3-solutioning/bmad-architecture/references/validate.md
+++ b/src/bmm-skills/3-solutioning/bmad-architecture/references/validate.md
@ -8,13 +8,20 @@ Note the paths — `.memlog.md`, the driving spec (if any), and `ARCHITECTURE-SP

 ## Run the Reviewer Gate

-Run the Reviewer Gate against `ARCHITECTURE-SPINE.md`. **SKILL.md's menu is the single source** — rubric walker, the **consistency auditor** (mechanically walks the Capability → Architecture Map for orphans, uncovered capabilities, and terminology drift), the **adversarial divergence-hunter** (below), `{workflow.finalize_reviewers}`, and any ad-hoc lens. The rubric walker is the default entry; under Validate intent the consistency auditor is **on by default** (the intent where mechanical orphan-walking matters most), and the divergence-hunter is **on by default whenever stakes are high** — regulated, enterprise, or cross-team — since a missed divergence point is the spine's costliest failure. Validate additionally runs the synthesis pipeline below.
+This file owns the canonical reviewer menu (SKILL.md routes here). Run the gate against `ARCHITECTURE-SPINE.md`; selected reviewers run as parallel subagents, each writing `{doc_workspace}/review-{slug}.md` and returning a compact summary.
+
+- **rubric walker** — the default entry; pipeline below.
+- **consistency auditor** — mechanically walks the Capability → Architecture Map for orphans, uncovered capabilities, and terminology drift. On by default under Validate intent (where mechanical orphan-walking matters most).
+- **adversarial divergence-hunter** — refutational reviewer (prompt below); on by default whenever stakes are high (regulated, enterprise, cross-team), since a missed divergence point is the spine's costliest failure. Lower-stakes runs may skip it.
+- **`{workflow.finalize_reviewers}`** plus any **ad-hoc lens** the content warrants (a security/compliance lens for regulated stakes, and similar).
+
+Validate additionally runs the synthesis pipeline below.

 ## Rubric-walker pipeline

 First run `python3 {skill-root}/scripts/lint_spine.py --workspace {doc_workspace}` and hand its JSON to the walker, so the mechanical half of decision-integrity (literal placeholders, duplicate or non-monotonic `AD-n` IDs, `AD-n` blocks missing Binds/Prevents/Rule, unpinned `name@version` stack entries) is already settled and the walker spends judgment on the semantic half. Spawn the rubric walker as a subagent with this prompt:

-> You are validating an architecture **spine** — a consistency contract, not a design document. Its job is to fix the **invariants** (the durable rules a clean codebase can't reveal — paradigm, boundaries, who-may-depend-on-whom, state mutation) that keep the independently-built level below (features, epics, or stories, per its altitude) coherent, while treating structure (stack, tree, full data shape) as disposable **seed** and leaving everything else open. Read its `.memlog.md`, the driving spec if one exists, and `ARCHITECTURE-SPINE.md`. Judge each dimension below — *strong / adequate / thin / broken* — and write findings only where they add information. Cite specific spine locations and quote phrases. Severity ranks impact on the spine's job (cross-unit consistency), not how easy the fix is.
+> You are validating an architecture **spine** — a consistency contract that fixes only the **invariants** (paradigm, boundaries, who-may-depend-on-whom, state mutation) keeping the independently-built level below (features, epics, or stories, per its altitude) coherent, treating stack/tree/data-shape as disposable **seed**. Read its `.memlog.md`, the driving spec if one exists, and `ARCHITECTURE-SPINE.md`. Judge each dimension below — *strong / adequate / thin / broken* — and write findings only where they add information. Cite specific spine locations and quote phrases. Severity ranks impact on the spine's job (cross-unit consistency), not how easy the fix is.
 >
 > Dimensions:
 > 1. **Consistency coverage** — does it fix the real divergence points for the units one level below? Actively hunt for conflict points it *missed* (where two independent builders could still diverge). This is the primary lens.
@ -30,7 +37,7 @@ First run `python3 {skill-root}/scripts/lint_spine.py --workspace {doc_workspace

 ## Adversarial divergence-hunter

-The spine's costliest failure is a *missed* divergence point, so high-stakes runs (regulated, enterprise, cross-team) get a dedicated adversarial reviewer by default — refutational, not evaluative, and orthogonal to the rubric walker's judgment. Lower-stakes runs may skip it. Spawn it as a subagent with this prompt:
+Refutational, not evaluative, and orthogonal to the rubric walker's judgment (stakes-gating is in the menu above). Spawn it as a subagent with this prompt:

 > You are an adversarial reviewer of an architecture **spine** — a consistency contract whose one job is to stop the units one level below (features, epics, or stories, per its altitude) from being built incompatibly. Your stance is refutation, not evaluation: assume there is a hole and find it. Read its `.memlog.md`, the driving spec if one exists, and `ARCHITECTURE-SPINE.md`. Then attack:
 > 1. **Hunt the missed divergence.** Walk the units one level down and try to construct two that, each obeying the spine to the letter, still build incompatibly — different shapes for shared data, two owners of the same entity, incompatible contracts across a boundary, conflicting state-mutation paths. Every pair you can construct is a hole the spine must close.
@ -40,14 +47,12 @@ The spine's costliest failure is a *missed* divergence point, so high-stakes run
 >
 > Report only real holes, each as: the two divergent builds you constructed, the spine location that should have prevented it, and the minimal fix (a new `AD-n`, a tightened `Rule`, or a deferral pulled back in). Do not restate what the spine got right; a confirmed hole is High or Critical severity. Write your review to `{doc_workspace}/review-divergence-hunter.md`; return ONLY a compact summary (hole count by severity, the sharpest one, file path).

-Beyond these, the gate dispatches `{workflow.finalize_reviewers}` and any ad-hoc lens the parent judges warranted (a security/compliance lens for regulated stakes, and similar) — the same menu SKILL.md assembles, no subset. Each writes `{doc_workspace}/review-{slug}.md` and returns a compact summary. Run in parallel.
-
 ## Synthesis pipeline

 Once every selected reviewer has returned, the parent consolidates one markdown report. **Do not skip under Validate intent** — it is the persistent artifact the user opens.

 1. Read every `{doc_workspace}/review-*.md`.
-2. Derive a grade from the rubric verdicts and severity counts: *Excellent* = all dimensions strong/adequate, no high/critical · *Good* = ≤1 thin dimension, no critical · *Fair* = multiple thin dimensions or any high · *Poor* = any broken dimension or any critical.
+2. Get the grade from the script — don't derive it by hand. Pipe the rubric walker's per-dimension verdicts and each reviewer's severity counts to `python3 {skill-root}/scripts/grade_spine.py`; it returns `grade`, `severity_totals`, and the deciding `reason`. Payload shape: `{"dimensions": {"consistency": "strong", ...}, "reviewers": [{"slug": "rubric", "severity": {"critical": 0, "high": 1, ...}}, ...]}`.
 3. Write `{doc_workspace}/validation-report.md`:

 ```markdown
--- a/src/bmm-skills/3-solutioning/bmad-architecture/scripts/grade_spine.py
+++ b/src/bmm-skills/3-solutioning/bmad-architecture/scripts/grade_spine.py
@ -0,0 +1,92 @@
+#!/usr/bin/env python3
+# /// script
+# requires-python = ">=3.10"
+# ///
+"""grade-spine — derive a validation grade deterministically from reviewer output.
+
+The grade is a pure function of (per-dimension rubric verdicts, summed severity counts).
+An LLM re-deriving that threshold ladder by hand every run can drift and miscount; a
+script gives the same input the same grade every time. The synthesis prompt keeps the
+judgment — the verdict paragraph — and hands the mechanical count-and-map here.
+
+Input is JSON on stdin (or --input FILE):
+
+  {
+    "dimensions": {"consistency": "strong", "leanness": "thin", ...},
+    "reviewers":  [{"slug": "rubric", "severity": {"critical": 0, "high": 1, "medium": 2, "low": 0}},
+                   {"slug": "divergence-hunter", "severity": {"high": 1}}]
+  }
+
+reviewers[].severity counts are summed; a bare top-level "severity" dict is accepted as an
+alternative to a single-reviewer list. Output is JSON on stdout:
+
+  {"grade": "Fair", "severity_totals": {...}, "thin": 1, "broken": 0, "reason": "..."}
+
+Grade ladder (most-severe wins):
+  Poor       any broken dimension OR any critical finding
+  Fair       any high finding OR two-plus thin dimensions
+  Good       exactly one thin dimension, no high/critical
+  Excellent  all dimensions strong/adequate, no high/critical
+"""
+from __future__ import annotations
+
+import argparse
+import json
+import sys
+from pathlib import Path
+
+SEVERITIES = ("critical", "high", "medium", "low")
+
+
+def sum_severity(payload: dict) -> dict:
+    """Sum severity counts across reviewers[], or fall back to a bare top-level `severity`."""
+    totals = {k: 0 for k in SEVERITIES}
+    reviewers = payload.get("reviewers")
+    sources = [r.get("severity") or {} for r in reviewers] if reviewers else [payload.get("severity") or {}]
+    for src in sources:
+        for k, v in src.items():
+            if k in totals:
+                totals[k] += int(v)
+    return totals
+
+
+def grade(dimensions: dict, severity: dict) -> dict:
+    sev = {k: int(severity.get(k, 0)) for k in SEVERITIES}
+    verdicts = [str(v).strip().lower() for v in (dimensions or {}).values()]
+    broken = sum(1 for v in verdicts if v == "broken")
+    thin = sum(1 for v in verdicts if v == "thin")
+    if sev["critical"] > 0 or broken > 0:
+        g, reason = "Poor", "any critical finding or broken dimension caps the grade at Poor"
+    elif sev["high"] > 0 or thin >= 2:
+        g, reason = "Fair", "a high finding or two-plus thin dimensions caps the grade at Fair"
+    elif thin == 1:
+        g, reason = "Good", "one thin dimension, no high/critical"
+    else:
+        g, reason = "Excellent", "all dimensions strong/adequate, no high/critical"
+    return {"grade": g, "severity_totals": sev, "thin": thin, "broken": broken, "reason": reason}
+
+
+def main(argv: list[str] | None = None) -> int:
+    ap = argparse.ArgumentParser(description="Derive an architecture-spine validation grade from reviewer output.")
+    ap.add_argument("-i", "--input", help="read the JSON payload from this file instead of stdin")
+    ap.add_argument("-o", "--output", help="write JSON here instead of stdout")
+    args = ap.parse_args(argv)
+
+    raw = Path(args.input).read_text(encoding="utf-8") if args.input else sys.stdin.read()
+    try:
+        payload = json.loads(raw)
+    except json.JSONDecodeError as e:
+        print(json.dumps({"error": f"invalid JSON input: {e}"}), file=sys.stderr)
+        return 2
+
+    result = grade(payload.get("dimensions", {}), sum_severity(payload))
+    out = json.dumps(result, indent=2)
+    if args.output:
+        Path(args.output).write_text(out + "\n", encoding="utf-8")
+    else:
+        print(out)
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
--- a/src/bmm-skills/3-solutioning/bmad-architecture/scripts/tests/test_grade_spine.py
+++ b/src/bmm-skills/3-solutioning/bmad-architecture/scripts/tests/test_grade_spine.py
@ -0,0 +1,74 @@
+# /// script
+# requires-python = ">=3.10"
+# dependencies = ["pytest>=8.0"]
+# ///
+"""Tests for grade_spine.py. Run: uv run --with pytest pytest scripts/tests/test_grade_spine.py
+
+The grade is a pure function of dimension verdicts and summed severity counts; each test
+pins one branch of the ladder, plus the most-severe-wins precedence and the reviewer sum.
+"""
+import importlib.util
+import sys
+from pathlib import Path
+
+import pytest
+
+_SPEC = importlib.util.spec_from_file_location(
+    "grade_spine", Path(__file__).resolve().parent.parent / "grade_spine.py"
+)
+grade_spine = importlib.util.module_from_spec(_SPEC)
+sys.modules["grade_spine"] = grade_spine
+_SPEC.loader.exec_module(grade_spine)
+
+grade = grade_spine.grade
+sum_severity = grade_spine.sum_severity
+
+ALL_STRONG = {"consistency": "strong", "leanness": "adequate", "decisions": "strong"}
+
+
+def test_excellent_all_strong_no_findings():
+    assert grade(ALL_STRONG, {})["grade"] == "Excellent"
+
+
+def test_good_one_thin():
+    assert grade({"a": "strong", "b": "thin"}, {})["grade"] == "Good"
+
+
+def test_fair_two_thin():
+    assert grade({"a": "thin", "b": "thin"}, {})["grade"] == "Fair"
+
+
+def test_fair_any_high():
+    assert grade(ALL_STRONG, {"high": 1})["grade"] == "Fair"
+
+
+def test_poor_any_critical():
+    assert grade(ALL_STRONG, {"critical": 1})["grade"] == "Poor"
+
+
+def test_poor_broken_dimension():
+    assert grade({"a": "strong", "b": "broken"}, {})["grade"] == "Poor"
+
+
+def test_critical_outranks_high_and_thin():
+    assert grade({"a": "thin"}, {"critical": 1, "high": 3})["grade"] == "Poor"
+
+
+def test_medium_and_low_do_not_lower_grade():
+    assert grade(ALL_STRONG, {"medium": 5, "low": 9})["grade"] == "Excellent"
+
+
+def test_sum_severity_across_reviewers():
+    payload = {"reviewers": [
+        {"slug": "rubric", "severity": {"high": 1}},
+        {"slug": "divergence", "severity": {"high": 2, "critical": 1}},
+    ]}
+    assert sum_severity(payload) == {"critical": 1, "high": 3, "medium": 0, "low": 0}
+
+
+def test_sum_severity_bare_fallback():
+    assert sum_severity({"severity": {"medium": 2}}) == {"critical": 0, "high": 0, "medium": 2, "low": 0}
+
+
+if __name__ == "__main__":
+    sys.exit(pytest.main([__file__, "-q"]))