5.7 KiB

Raw Blame History

workflow_id

workflow_name

version

lead_agent

supporting_agents

phase

created_date

last_modified

config_file

estimated_duration

outputFile

W-INCIDENT-001

Production Incident Response

6.2.0

SRE Minh

Architect Khang

Mary Analyst

3-Run: Emergency Response & Recovery

2026-03-17

_config/config.yaml

15 minutes to 2 hours (depending on severity)

{output_folder}/psm-artifacts/incident-{{project_name}}-{{date}}.md

Production Incident Response Workflow — BMAD Pattern

Metadata & Context

Goal: Triage, diagnose, resolve production incidents through systematic diagnosis and apply fixes with verification. This is the most critical workflow - minimize MTTR (Mean Time To Recovery) while maintaining system stability.

Lead Team:

SRE Minh (Incident Command, Recovery Orchestration)
Architect Khang (Root Cause Analysis, System-wide Impact)
Mary Analyst (Impact Assessment, Post-Incident Review)

Success Criteria:

✓ Incident severity classified within 5 minutes
✓ Root cause identified within first triage pass
✓ Fix applied and verified
✓ System metrics returned to baseline
✓ Incident postmortem documented with action items
✓ Prevention measures identified

Workflow Overview

Workflow này di qua 4 bước atomic, mỗi bước focus vào một phase khác nhau:

Step-01-Triage → Gather initial info, assess severity, classify impact
Step-02-Diagnose → Systematic diagnosis using observability data (logs, metrics, traces)
Step-03-Fix → Apply fix, verify resolution, validate recovery
Step-04-Postmortem → Document incident, identify action items, prevent recurrence

Configuration Loading

Tự động load từ _config/config.yaml:

project_context:
  organization: "[loaded from config]"
  environment: "production"
  incident_channel: "slack:#incidents"

workflow_defaults:
  communication_language: "Vietnamese-English"
  severity_levels: ["SEV1", "SEV2", "SEV3", "SEV4"]
  escalation_contacts: "[loaded from config]"
  on_call_engineer: "[loaded from config]"

Workflow Architecture - Micro-File Design

BMAD pattern: Mỗi step là một file riêng, load just-in-time. Workflow chain:

workflow.md (entry point)
    ↓
step-01-triage.md (classify severity, initial assessment)
    ↓
step-02-diagnose.md (root cause analysis)
    ↓
step-03-fix.md (apply fix, verify)
    ↓
step-04-postmortem.md (document, prevent)
    ↓
incident-response-summary.md (final output)

Key Benefits:

Single-step focus — engineer concentrates on one phase
Knowledge isolation — load only relevant SKILL docs per step
State tracking — save progress after each step
Easy resumption — if interrupted, restart from exact step

Skill References

Workflow này load knowledge từ:

5.07 Reliability & Resilience → Circuit breaker patterns, fallback strategies, timeout management
5.08 Observability & Monitoring → Structured logging, metrics queries, distributed tracing
5.09 Error Handling & Recovery → Error classification, graceful degradation patterns
5.10 Production Readiness → Incident prevention checklist, alerting setup
5.14 Documentation & Runbooks → Postmortem templates, incident reports

Execution Model

Entry Point Logic

1. Check if incident session exists
   → If NEW incident: Start from step-01-triage.md
   → If ONGOING: Load incident-session.yaml → continue from last completed step
   → If RESOLVED: Load postmortem template

2. For each step:
   a) Load step-{N}-{name}.md
   b) Load referenced SKILL files (auto-parse "Load:" directives)
   c) Execute MENU [A][C] options
   d) Save step output to step-{N}-output.md + incident-context.yaml
   e) Move to next step or conclude

3. Final: Generate incident report + postmortem in outputs folder

State Tracking

Incident session frontmatter tracks progress:

incident_context:
  incident_id: "INC-2026-03-17-001"
  severity: "SEV1" | "SEV2" | "SEV3" | "SEV4"
  status: "triage" → "diagnosing" → "recovering" → "resolved" → "postmortem"
  affected_services: ["service-1", "service-2"]
  started_at: "2026-03-17T14:30:00Z"
  timeline:
    detected_at: "2026-03-17T14:30:00Z"
    triage_completed_at: "2026-03-17T14:35:00Z"
    root_cause_identified_at: "2026-03-17T14:50:00Z"
    fix_applied_at: "2026-03-17T15:10:00Z"
    resolved_at: "2026-03-17T15:15:00Z"
  current_step: "step-02-diagnose"
  last_updated: "2026-03-17T14:50:00Z"
  incident_commander: "SRE Minh"

Mandatory Workflow Rules

Speed first — Triage must complete in < 5 minutes
Root cause identification — Must identify root cause before fix attempt
Verify before declaring resolved — Check metrics + user reports
Document everything — Every action logged for postmortem
Escalation protocol — SEV1 → Page on-call architect immediately
Communication — Update stakeholders every 5-10 minutes
No flying blind — All fixes must reference observability data

Severity Scale

SEV1 — Service completely down, revenue impact, > 1% users affected → Page all on-call
SEV2 — Major degradation, significant users affected, partial functionality down
SEV3 — Moderate impact, some users affected, workaround possible
SEV4 — Minor issue, limited users, can defer to business hours

Hãy chọn cách bắt đầu:

[NEW-INC] — Report new incident → Load step-01-triage
[RESUME-INC] — Continue existing incident (detect progress from incident-session.yaml)
[ESCALATE] — Escalate to on-call architect

Hãy báo cáo tình trạng incident hoặc chọn [NEW-INC] để bắt đầu triage

5.7 KiB Raw Blame History