--- workflow_id: W-INCIDENT-001 workflow_name: Production Incident Response version: 6.2.0 lead_agent: "SRE Minh" supporting_agents: ["Architect Khang", "Mary Analyst"] phase: "3-Run: Emergency Response & Recovery" created_date: 2026-03-17 last_modified: 2026-03-17 config_file: "_config/config.yaml" estimated_duration: "15 minutes to 2 hours (depending on severity)" outputFile: '{output_folder}/psm-artifacts/incident-{{project_name}}-{{date}}.md' --- # Production Incident Response Workflow — BMAD Pattern ## Metadata & Context **Goal**: Triage, diagnose, resolve production incidents through systematic diagnosis and apply fixes with verification. This is the most critical workflow - minimize MTTR (Mean Time To Recovery) while maintaining system stability. **Lead Team**: - SRE Minh (Incident Command, Recovery Orchestration) - Architect Khang (Root Cause Analysis, System-wide Impact) - Mary Analyst (Impact Assessment, Post-Incident Review) **Success Criteria**: - ✓ Incident severity classified within 5 minutes - ✓ Root cause identified within first triage pass - ✓ Fix applied and verified - ✓ System metrics returned to baseline - ✓ Incident postmortem documented with action items - ✓ Prevention measures identified ## Workflow Overview Workflow này di qua 4 bước atomic, mỗi bước focus vào một phase khác nhau: 1. **Step-01-Triage** → Gather initial info, assess severity, classify impact 2. **Step-02-Diagnose** → Systematic diagnosis using observability data (logs, metrics, traces) 3. **Step-03-Fix** → Apply fix, verify resolution, validate recovery 4. **Step-04-Postmortem** → Document incident, identify action items, prevent recurrence ## Configuration Loading Tự động load từ `_config/config.yaml`: ```yaml project_context: organization: "[loaded from config]" environment: "production" incident_channel: "slack:#incidents" workflow_defaults: communication_language: "Vietnamese-English" severity_levels: ["SEV1", "SEV2", "SEV3", "SEV4"] escalation_contacts: "[loaded from config]" on_call_engineer: "[loaded from config]" ``` ## Workflow Architecture - Micro-File Design BMAD pattern: Mỗi step là một file riêng, load just-in-time. Workflow chain: ``` workflow.md (entry point) ↓ step-01-triage.md (classify severity, initial assessment) ↓ step-02-diagnose.md (root cause analysis) ↓ step-03-fix.md (apply fix, verify) ↓ step-04-postmortem.md (document, prevent) ↓ incident-response-summary.md (final output) ``` **Key Benefits**: - Single-step focus — engineer concentrates on one phase - Knowledge isolation — load only relevant SKILL docs per step - State tracking — save progress after each step - Easy resumption — if interrupted, restart from exact step ## Skill References Workflow này load knowledge từ: - **5.07 Reliability & Resilience** → Circuit breaker patterns, fallback strategies, timeout management - **5.08 Observability & Monitoring** → Structured logging, metrics queries, distributed tracing - **5.09 Error Handling & Recovery** → Error classification, graceful degradation patterns - **5.10 Production Readiness** → Incident prevention checklist, alerting setup - **5.14 Documentation & Runbooks** → Postmortem templates, incident reports ## Execution Model ### Entry Point Logic ``` 1. Check if incident session exists → If NEW incident: Start from step-01-triage.md → If ONGOING: Load incident-session.yaml → continue from last completed step → If RESOLVED: Load postmortem template 2. For each step: a) Load step-{N}-{name}.md b) Load referenced SKILL files (auto-parse "Load:" directives) c) Execute MENU [A][C] options d) Save step output to step-{N}-output.md + incident-context.yaml e) Move to next step or conclude 3. Final: Generate incident report + postmortem in outputs folder ``` ### State Tracking Incident session frontmatter tracks progress: ```yaml incident_context: incident_id: "INC-2026-03-17-001" severity: "SEV1" | "SEV2" | "SEV3" | "SEV4" status: "triage" → "diagnosing" → "recovering" → "resolved" → "postmortem" affected_services: ["service-1", "service-2"] started_at: "2026-03-17T14:30:00Z" timeline: detected_at: "2026-03-17T14:30:00Z" triage_completed_at: "2026-03-17T14:35:00Z" root_cause_identified_at: "2026-03-17T14:50:00Z" fix_applied_at: "2026-03-17T15:10:00Z" resolved_at: "2026-03-17T15:15:00Z" current_step: "step-02-diagnose" last_updated: "2026-03-17T14:50:00Z" incident_commander: "SRE Minh" ``` ## Mandatory Workflow Rules 1. **Speed first** — Triage must complete in < 5 minutes 2. **Root cause identification** — Must identify root cause before fix attempt 3. **Verify before declaring resolved** — Check metrics + user reports 4. **Document everything** — Every action logged for postmortem 5. **Escalation protocol** — SEV1 → Page on-call architect immediately 6. **Communication** — Update stakeholders every 5-10 minutes 7. **No flying blind** — All fixes must reference observability data ## Severity Scale - **SEV1** — Service completely down, revenue impact, > 1% users affected → Page all on-call - **SEV2** — Major degradation, significant users affected, partial functionality down - **SEV3** — Moderate impact, some users affected, workaround possible - **SEV4** — Minor issue, limited users, can defer to business hours ## Navigation Hãy chọn cách bắt đầu: - **[NEW-INC]** — Report new incident → Load step-01-triage - **[RESUME-INC]** — Continue existing incident (detect progress from incident-session.yaml) - **[ESCALATE]** — Escalate to on-call architect --- **Hãy báo cáo tình trạng incident hoặc chọn [NEW-INC] để bắt đầu triage**