From c8c05c1695a15c5052829416e1582a213d1cc714 Mon Sep 17 00:00:00 2001 From: Alex Verkhovsky Date: Sun, 22 Feb 2026 21:40:48 -0700 Subject: [PATCH] docs: add reusable workflow eval prompt with JSONL mining methodology Captures lessons from the skeleton eval: workflow boundary scoping, human turn filtering, token deduplication, context normalization, and five common pitfalls to avoid. --- _experiment/runs/EVAL-WORKFLOW.md | 206 ++++++++++++++++++++++++++++++ 1 file changed, 206 insertions(+) create mode 100644 _experiment/runs/EVAL-WORKFLOW.md diff --git a/_experiment/runs/EVAL-WORKFLOW.md b/_experiment/runs/EVAL-WORKFLOW.md new file mode 100644 index 000000000..abe84098f --- /dev/null +++ b/_experiment/runs/EVAL-WORKFLOW.md @@ -0,0 +1,206 @@ +# Evaluate Workflow Efficiency from JSONL Logs + +**Read and follow these instructions to produce an eval report comparing two workflow variants.** + +--- + +## Prerequisites + +You need JSONL session logs from both the baseline workflow and the variant being evaluated. These are Claude Code conversation logs, typically found at: + +``` +~/.claude/projects//.jsonl +``` + +Or pre-captured in `_experiment/runs/YYYY-MM-DD-.jsonl`. + +--- + +## Step 1: Identify Session Logs + +For each workflow variant, find the JSONL files that contain actual workflow runs: + +```bash +# List recent sessions for this project +ls -lt ~/.claude/projects//*.jsonl | head -10 +``` + +To confirm a session used a specific workflow, grep for the skill invocation: + +```bash +# Example: find sessions that ran quick-dev2 +grep -l "bmad-bmm-quick-dev2" ~/.claude/projects//*.jsonl +``` + +**Critical:** A single JSONL file may contain MULTIPLE workflow runs, housekeeping, or unrelated activities. You must identify the exact line boundaries of each workflow run (see Step 2). + +--- + +## Step 2: Establish Workflow Boundaries + +A workflow run starts and ends at specific JSONL lines. Everything outside those boundaries is NOT part of the workflow. + +**Start boundary:** The slash-command invocation line (e.g., `/bmad-bmm-quick-dev2`). Look for `` entries. + +**End boundary:** The last workflow-produced action BEFORE the user moves on to housekeeping (saving logs, writing prompts, starting a new task). Typical end markers: +- PR creation +- Final commit +- The user's next non-workflow request ("save the log", "write a prompt", etc.) + +**Record boundaries as line numbers:** `start_line` and `end_line` for each run. ALL subsequent analysis must be scoped to these boundaries. + +### Multi-Workflow Pipelines + +If the baseline requires multiple workflows to complete one task (e.g., quick-spec then quick-dev), each workflow run is a separate JSONL file (or a separate bounded range within one file). Combine their metrics for the total baseline. + +--- + +## Step 3: Count Human Turns + +Count every action the user typed during the workflow. The JSONL `type: "user"` entries include many things that are NOT human input. Filter them carefully. + +### What IS a human turn (count these) + +| Category | Example | Notes | +|----------|---------|-------| +| Slash-command invocations | `/bmad-bmm-quick-dev2`, `/clear` | The user typed these | +| Task input | Issue URL, task description | The actual request | +| Feedback / challenges | "One shot, really?" | Substantive interaction | +| Approvals | "Okay, approve", "Yes" | Checkpoint responses | +| AskUserQuestion answers | (appears as `tool_result` with `"answers"`) | Real human choice, hidden in tool results | +| Mechanical continues | `c`, `f`, `z` | Single-char checkpoint presses | + +### What is NOT a human turn (exclude these) + +| Category | How to detect | +|----------|--------------| +| Tool results | `tool_result` or `tool_use_id` in first 200 chars of message | +| Skill auto-loads | `IT IS CRITICAL THAT YOU FOLLOW THIS COMMAND` or `Base directory for this skill` | +| System messages | ``, ``, ``, `` | +| Interrupts | `[Request interrupted by user]` | +| Resume signals | `Continue from where you left off` | +| Post-workflow housekeeping | Log saving, prompt writing, analysis requests | + +### Classify turns into categories + +- **Invocation**: Slash commands (`/bmad-bmm-*`, `/clear`, `/output-style`) +- **Substantive**: Real human input with intent or feedback +- **Mechanical**: Single-character checkpoint continues (`c`, `z`, `f`) +- **Quality-improving**: User challenges that caught real issues (a subset of substantive) +- **Context re-establishment**: Turns spent re-explaining intent after a context reset (only in multi-session workflows) +- **Non-workflow**: User manually requesting something the workflow should have done (e.g., "Commit and push. Make PR." when VC isn't automated) + +Report total count AND breakdown by category. + +--- + +## Step 4: Count API Turns and Token Usage + +### Deduplication is critical + +Multiple JSONL lines can share the same `requestId` (thinking, text, tool_use blocks from one API call). Deduplicate by `requestId` before summing. + +### Finding the usage field + +Usage data can be at two locations in a JSONL entry: +- Top level: `entry["usage"]` +- Nested: `entry["message"]["usage"]` + +Check both. + +### Token categories to extract + +| Field | What it means | +|-------|---------------| +| `input_tokens` | Fresh (non-cached) input tokens | +| `output_tokens` | Model output tokens | +| `cache_creation_input_tokens` | New content added to prompt cache | +| `cache_read_input_tokens` | Cached content re-read this turn | + +The per-turn context size is approximately: `input_tokens + cache_creation_input_tokens + cache_read_input_tokens`. + +### What to report + +- Total API turns (deduplicated by requestId) +- Token totals by category +- Context size at first turn and last turn (to show growth) +- Per-turn average tokens + +--- + +## Step 5: Normalize the Comparison + +Before comparing metrics, account for these asymmetries: + +### Starting context + +If a run started mid-session (carrying prior conversation), its context begins inflated. Note the starting context size. Estimate what a fresh-session run would cost: + +``` +fresh_estimate = actual_total - (turns × (actual_start_context - fresh_start_context)) +``` + +Where `fresh_start_context` ≈ the baseline's starting context. + +### Multi-workflow overhead + +If the baseline uses multiple workflows (e.g., quick-spec + quick-dev): +- Count invocation turns for EACH workflow (`/clear`, `/bmad-bmm-*`) +- Note context reset between sessions (fresh initialization cost paid twice) +- Note turns spent re-establishing context in the second session + +### Task complexity + +Different tasks have different complexity. A 7-file identical edit is not comparable to a multi-file structural refactor. Note this caveat explicitly. The comparison is still valid for structural/overhead differences, but substantive turn counts reflect task complexity, not just workflow efficiency. + +### VC operations + +Check whether commit/push/PR is part of the workflow or an ad-hoc user request. If one workflow automates VC and the other doesn't, the one without automated VC has an artificially lower turn count. + +--- + +## Step 6: Write the Eval Report + +Output to `_experiment/results/.md` with this structure: + +```markdown +# Eval Report: vs + +**Date:** YYYY-MM-DD +**Data sources:** (list JSONL files with line boundaries) + +## Methodology +- How logs were identified +- Workflow boundaries used +- Comparability caveats + +## North Star: Human Turns +- Per-workflow turn table with category breakdown +- Total comparison with delta + +## API Turns and Token Usage +- Deduplicated API turn counts +- Token totals by category +- Context growth curves +- Fresh-session estimate if applicable + +## Where Is Better / Worse / Equivalent + +## Recommendations (prioritized by north star impact) + +## Summary Scorecard +``` + +--- + +## Pitfalls to Avoid + +1. **Never use the full JSONL file without boundaries.** A session file may contain multiple runs, housekeeping, and unrelated activities. + +2. **Never count tool results as human turns.** They are the most common source of inflation — a 50-turn session might show 200+ "user" entries because every tool call generates a tool_result. + +3. **Never compare tokens without deduplication.** Raw JSONL token sums can be 2-3x inflated due to multiple lines per API call. + +4. **Never ignore starting context.** A mid-session run at 62K context costs more per turn than a fresh run at 24K. This is session placement, not workflow design. + +5. **Never assume workflow = session.** One session can contain multiple workflows; one workflow can span multiple sessions.