BMAD-METHOD/src/psm/workflows/bmad-psm-incident-response/workflow.md

5.7 KiB

workflow_id workflow_name version lead_agent supporting_agents phase created_date last_modified config_file estimated_duration outputFile
W-INCIDENT-001 Production Incident Response 6.2.0 SRE Minh
Architect Khang
Mary Analyst
3-Run: Emergency Response & Recovery 2026-03-17 2026-03-17 _config/config.yaml 15 minutes to 2 hours (depending on severity) {output_folder}/psm-artifacts/incident-{{project_name}}-{{date}}.md

Production Incident Response Workflow — BMAD Pattern

Metadata & Context

Goal: Triage, diagnose, resolve production incidents through systematic diagnosis and apply fixes with verification. This is the most critical workflow - minimize MTTR (Mean Time To Recovery) while maintaining system stability.

Lead Team:

  • SRE Minh (Incident Command, Recovery Orchestration)
  • Architect Khang (Root Cause Analysis, System-wide Impact)
  • Mary Analyst (Impact Assessment, Post-Incident Review)

Success Criteria:

  • ✓ Incident severity classified within 5 minutes
  • ✓ Root cause identified within first triage pass
  • ✓ Fix applied and verified
  • ✓ System metrics returned to baseline
  • ✓ Incident postmortem documented with action items
  • ✓ Prevention measures identified

Workflow Overview

Workflow này di qua 4 bước atomic, mỗi bước focus vào một phase khác nhau:

  1. Step-01-Triage → Gather initial info, assess severity, classify impact
  2. Step-02-Diagnose → Systematic diagnosis using observability data (logs, metrics, traces)
  3. Step-03-Fix → Apply fix, verify resolution, validate recovery
  4. Step-04-Postmortem → Document incident, identify action items, prevent recurrence

Configuration Loading

Tự động load từ _config/config.yaml:

project_context:
  organization: "[loaded from config]"
  environment: "production"
  incident_channel: "slack:#incidents"

workflow_defaults:
  communication_language: "Vietnamese-English"
  severity_levels: ["SEV1", "SEV2", "SEV3", "SEV4"]
  escalation_contacts: "[loaded from config]"
  on_call_engineer: "[loaded from config]"

Workflow Architecture - Micro-File Design

BMAD pattern: Mỗi step là một file riêng, load just-in-time. Workflow chain:

workflow.md (entry point)
    ↓
step-01-triage.md (classify severity, initial assessment)
    ↓
step-02-diagnose.md (root cause analysis)
    ↓
step-03-fix.md (apply fix, verify)
    ↓
step-04-postmortem.md (document, prevent)
    ↓
incident-response-summary.md (final output)

Key Benefits:

  • Single-step focus — engineer concentrates on one phase
  • Knowledge isolation — load only relevant SKILL docs per step
  • State tracking — save progress after each step
  • Easy resumption — if interrupted, restart from exact step

Skill References

Workflow này load knowledge từ:

  • 5.07 Reliability & Resilience → Circuit breaker patterns, fallback strategies, timeout management
  • 5.08 Observability & Monitoring → Structured logging, metrics queries, distributed tracing
  • 5.09 Error Handling & Recovery → Error classification, graceful degradation patterns
  • 5.10 Production Readiness → Incident prevention checklist, alerting setup
  • 5.14 Documentation & Runbooks → Postmortem templates, incident reports

Execution Model

Entry Point Logic

1. Check if incident session exists
   → If NEW incident: Start from step-01-triage.md
   → If ONGOING: Load incident-session.yaml → continue from last completed step
   → If RESOLVED: Load postmortem template

2. For each step:
   a) Load step-{N}-{name}.md
   b) Load referenced SKILL files (auto-parse "Load:" directives)
   c) Execute MENU [A][C] options
   d) Save step output to step-{N}-output.md + incident-context.yaml
   e) Move to next step or conclude

3. Final: Generate incident report + postmortem in outputs folder

State Tracking

Incident session frontmatter tracks progress:

incident_context:
  incident_id: "INC-2026-03-17-001"
  severity: "SEV1" | "SEV2" | "SEV3" | "SEV4"
  status: "triage" → "diagnosing" → "recovering" → "resolved" → "postmortem"
  affected_services: ["service-1", "service-2"]
  started_at: "2026-03-17T14:30:00Z"
  timeline:
    detected_at: "2026-03-17T14:30:00Z"
    triage_completed_at: "2026-03-17T14:35:00Z"
    root_cause_identified_at: "2026-03-17T14:50:00Z"
    fix_applied_at: "2026-03-17T15:10:00Z"
    resolved_at: "2026-03-17T15:15:00Z"
  current_step: "step-02-diagnose"
  last_updated: "2026-03-17T14:50:00Z"
  incident_commander: "SRE Minh"

Mandatory Workflow Rules

  1. Speed first — Triage must complete in < 5 minutes
  2. Root cause identification — Must identify root cause before fix attempt
  3. Verify before declaring resolved — Check metrics + user reports
  4. Document everything — Every action logged for postmortem
  5. Escalation protocol — SEV1 → Page on-call architect immediately
  6. Communication — Update stakeholders every 5-10 minutes
  7. No flying blind — All fixes must reference observability data

Severity Scale

  • SEV1 — Service completely down, revenue impact, > 1% users affected → Page all on-call
  • SEV2 — Major degradation, significant users affected, partial functionality down
  • SEV3 — Moderate impact, some users affected, workaround possible
  • SEV4 — Minor issue, limited users, can defer to business hours

Navigation

Hãy chọn cách bắt đầu:

  • [NEW-INC] — Report new incident → Load step-01-triage
  • [RESUME-INC] — Continue existing incident (detect progress from incident-session.yaml)
  • [ESCALATE] — Escalate to on-call architect

Hãy báo cáo tình trạng incident hoặc chọn [NEW-INC] để bắt đầu triage