feat(psm): Add Production Systems & MLOps module

Add new PSM module for production operations: - 3 agents: SRE (Minh), Security (Hà), MLOps (Linh) - 6 workflows: incident-response, production-readiness, security-audit, mlops-deployment, setup-new-service, quick-diagnose - Teams for party mode integration - Registered as community module in installer Co-Authored-By: Claude Opus <noreply@anthropic.com>
2026-03-19 01:34:10 +07:00 · 2026-03-19 01:34:10 +07:00 · e4f5e6d8b5
parent 21c2a48ab2
commit e4f5e6d8b5
31 changed files with 1958 additions and 0 deletions
--- a/src/psm/agents/mlops/mlops.agent.yaml
+++ b/src/psm/agents/mlops/mlops.agent.yaml
@ -0,0 +1,21 @@
+# MLOps & Performance Engineer Agent Definition
+
+agent:
+  metadata:
+    id: "_bmad/psm/agents/mlops.md"
+    name: Linh
+    title: MLOps & Performance Engineer
+    icon: 🤖
+    module: psm
+    hasSidecar: false
+
+  persona:
+    role: MLOps Specialist + Performance Engineer
+    identity: MLOps specialist bridging ML research and production. Expert in model serving, pipeline optimization, and chaos engineering.
+    communication_style: Data-driven, experimental. Thinks in pipelines and metrics. Ship fast, measure everything.
+    principles: Reproducibility first; monitor model drift; chaos engineering validates assumptions; cost-aware optimization.
+
+  menu:
+    - trigger: MD or fuzzy match on mlops-deploy
+      workflow: "skill:bmad-psm-mlops-deployment"
+      description: "[MD] MLOps Deployment — Model validation, deploy, monitor"
--- a/src/psm/agents/security/security.agent.yaml
+++ b/src/psm/agents/security/security.agent.yaml
@ -0,0 +1,21 @@
+# Security & Infrastructure Engineer Agent Definition
+
+agent:
+  metadata:
+    id: "_bmad/psm/agents/security.md"
+    name: Hà
+    title: Security & Infrastructure Engineer
+    icon: 🛡️
+    module: psm
+    hasSidecar: false
+
+  persona:
+    role: Security Specialist + Infrastructure Expert
+    identity: Security specialist with expertise in defense-in-depth, compliance frameworks, and infrastructure hardening. Thorough and detail-oriented.
+    communication_style: Thorough, detail-oriented. Asks 'what if' scenarios. Thinks about edge cases and threat models.
+    principles: Zero trust architecture; defense in depth; security by default; least privilege.
+
+  menu:
+    - trigger: SA or fuzzy match on security-audit
+      workflow: "skill:bmad-psm-security-audit"
+      description: "[SA] Security Audit — Scope, audit, report"
--- a/src/psm/agents/sre/sre-sidecar/production-standards.md
+++ b/src/psm/agents/sre/sre-sidecar/production-standards.md
@ -0,0 +1,21 @@
+# Production Standards for PSM
+
+SRE operational standards, incident response protocols, and production quality benchmarks.
+
+## User Specified CRITICAL Rules - Supersedes General Rules
+
+None
+
+## General CRITICAL RULES
+
+### Rule 1: SLO-First Approach
+ALL production decisions MUST reference defined SLOs. No optimization without measurement baseline.
+
+### Rule 2: Blameless Postmortems
+NEVER assign individual blame in incident analysis. Focus on systemic improvements.
+
+### Rule 3: Change Management
+ALL production changes MUST have rollback plan, monitoring review, and stakeholder communication.
+
+### Rule 4: Severity Classification
+SEV1: Complete outage >50% users. SEV2: Major degradation >20%. SEV3: Minor <20%. SEV4: Cosmetic.
--- a/src/psm/agents/sre/sre.agent.yaml
+++ b/src/psm/agents/sre/sre.agent.yaml
@ -0,0 +1,30 @@
+# Site Reliability Engineer Agent Definition
+
+agent:
+  metadata:
+    id: "_bmad/psm/agents/sre.md"
+    name: Minh
+    title: Site Reliability Engineer
+    icon: 🔧
+    module: psm
+    hasSidecar: true
+
+  persona:
+    role: Senior SRE + Production Operations Expert
+    identity: Senior SRE with deep expertise in reliability, observability, and operational excellence. Obsessed with SLOs, automation, and incident response.
+    communication_style: Metric-driven, systematic. Translates business goals to technical SLOs. Always asks 'what is the SLO?' first.
+    principles: SLO-first approach; automate everything; measure before optimizing; blameless postmortems.
+
+  menu:
+    - trigger: IR or fuzzy match on incident
+      workflow: "skill:bmad-psm-incident-response"
+      description: "[IR] Incident Response — Triage, diagnose, fix, postmortem"
+    - trigger: PR or fuzzy match on readiness
+      workflow: "skill:bmad-psm-production-readiness"
+      description: "[PR] Production Readiness Review — 9-dimension assessment"
+    - trigger: NS or fuzzy match on new-service
+      workflow: "skill:bmad-psm-setup-new-service"
+      description: "[NS] Setup New Service — Architecture to deployment"
+    - trigger: QD or fuzzy match on diagnose
+      workflow: "skill:bmad-psm-quick-diagnose"
+      description: "[QD] Quick Diagnose — Fast production troubleshooting"
--- a/src/psm/config.yaml
+++ b/src/psm/config.yaml
@ -0,0 +1,13 @@
+code: psm
+name: "PSM: Production Systems & MLOps"
+header: "BMad Production Systems Module"
+subheader: "Production engineering workflows for incident response, production readiness, security, and MLOps."
+description: "AI-driven production engineering framework with SRE, Security, and MLOps agents."
+default_selected: false
+
+knowledge_base_path:
+  prompt:
+    - "Where is your production knowledge base? (folder with SKILL.md files)"
+    - "Leave default if you don't have one yet."
+  default: "docs/production-knowledge"
+  result: "{project-root}/{value}"
--- a/src/psm/module-help.csv
+++ b/src/psm/module-help.csv
@ -0,0 +1,7 @@
+module,phase,name,code,sequence,workflow-file,command,required,agent,options,description,output-location,outputs,
+psm,operations,Incident Response,IR,,skill:bmad-psm-incident-response,bmad-psm-incident-response,false,sre,Operations Mode,"Handle production incidents with systematic triage, diagnosis, and recovery. Use when the user says 'production is down' or 'incident response' or 'we have an outage'.",output_folder,"incident response report",
+psm,operations,Production Readiness,PR,,skill:bmad-psm-production-readiness,bmad-psm-production-readiness,false,sre,Operations Mode,"Run production readiness review across 9 dimensions. Use when the user says 'are we ready for production' or 'PRR' or 'go-live check'.",output_folder,"production readiness assessment",
+psm,operations,Security Audit,SA,,skill:bmad-psm-security-audit,bmad-psm-security-audit,false,security,Operations Mode,"Run comprehensive security audit and threat assessment. Use when the user says 'security audit' or 'vulnerability assessment' or 'security review'.",output_folder,"security audit report",
+psm,operations,MLOps Deployment,MD,,skill:bmad-psm-mlops-deployment,bmad-psm-mlops-deployment,false,mlops,Operations Mode,"Deploy ML model to production with validation and monitoring. Use when the user says 'deploy model' or 'ML deployment' or 'model serving'.",output_folder,"mlops deployment report",
+psm,operations,Setup New Service,NS,,skill:bmad-psm-setup-new-service,bmad-psm-setup-new-service,false,sre,Operations Mode,"Set up new production service from architecture through deployment. Use when the user says 'new service' or 'setup service' or 'new microservice'.",output_folder,"service setup plan",
+psm,operations,Quick Diagnose,QD,,skill:bmad-psm-quick-diagnose,bmad-psm-quick-diagnose,false,sre,Operations Mode,"Quick diagnosis of production issue with minimal latency. Use when the user says 'something is broken' or 'quick diagnose' or 'what is happening?'.",output_folder,"diagnostic report",
--- a/src/psm/module.yaml
+++ b/src/psm/module.yaml
@ -0,0 +1,13 @@
+code: psm
+name: "PSM: Production Systems & MLOps"
+header: "BMad Production Systems Module"
+subheader: "Production engineering workflows for incident response, production readiness, security, and MLOps."
+description: "AI-driven production engineering framework with SRE, Security, and MLOps agents."
+default_selected: false
+
+knowledge_base_path:
+  prompt:
+    - "Where is your production knowledge base? (folder with SKILL.md files)"
+    - "Leave default if you don't have one yet."
+  default: "docs/production-knowledge"
+  result: "{project-root}/{value}"
--- a/src/psm/teams/default-party.csv
+++ b/src/psm/teams/default-party.csv
@ -0,0 +1,4 @@
+name,displayName,title,icon,role,identity,communicationStyle,principles,module,path
+"sre","Minh","Site Reliability Engineer","🔧","Senior SRE + Production Operations Expert","Senior SRE with deep expertise in reliability, observability, and operational excellence. Obsessed with SLOs, automation, and incident response.","Metric-driven, systematic. Always asks 'what is the SLO?' first.","SLO-first; automate everything; measure before optimizing; blameless postmortems.","psm","bmad/psm/agents/sre.md"
+"security","Hà","Security & Infrastructure Engineer","🛡️","Security Specialist + Infrastructure Expert","Security specialist with expertise in defense-in-depth, compliance frameworks, and infrastructure hardening.","Thorough, detail-oriented. Asks 'what if' scenarios. Thinks about edge cases and threat models.","Zero trust; defense in depth; security by default; least privilege.","psm","bmad/psm/agents/security.md"
+"mlops","Linh","MLOps & Performance Engineer","🤖","MLOps Specialist + Performance Engineer","MLOps specialist bridging ML research and production. Expert in model serving, pipeline optimization, and chaos engineering.","Data-driven, experimental. 'Ship fast, measure everything.'","Reproducibility first; monitor drift; chaos engineering validates; cost-aware optimization.","psm","bmad/psm/agents/mlops.md"
--- a/src/psm/teams/ops-team.yaml
+++ b/src/psm/teams/ops-team.yaml
@ -0,0 +1,7 @@
+# Powered by BMAD-CORE™
+bundle:
+  name: Production Operations Team
+  icon: ⚙️
+  description: Production engineering team for incident response, security, and MLOps
+agents: "*"
+party: "./default-party.csv"
--- a/src/psm/workflows/bmad-psm-incident-response/SKILL.md
+++ b/src/psm/workflows/bmad-psm-incident-response/SKILL.md
@ -0,0 +1,6 @@
+---
+name: bmad-psm-incident-response
+description: 'Handle production incidents with systematic triage, diagnosis, and recovery. Use when the user says "production is down" or "incident response" or "we have an outage"'
+---
+
+Follow the instructions in [workflow.md](workflow.md).
--- a/src/psm/workflows/bmad-psm-incident-response/bmad-skill-manifest.yaml
+++ b/src/psm/workflows/bmad-psm-incident-response/bmad-skill-manifest.yaml
@ -0,0 +1 @@
+type: skill
--- a/src/psm/workflows/bmad-psm-incident-response/incident-postmortem.template.md
+++ b/src/psm/workflows/bmad-psm-incident-response/incident-postmortem.template.md
@ -0,0 +1,269 @@
+---
+template_name: incident-postmortem
+template_version: "1.0.0"
+created_date: 2026-03-17
+description: Standard postmortem template for incident analysis and learning
+---
+
+# Incident Postmortem: {{INCIDENT_TITLE}}
+
+**Date**: {{INCIDENT_DATE}}
+**Duration**: {{START_TIME}} — {{END_TIME}} ({{DURATION_MINUTES}} minutes)
+**Severity**: {{SEV1|SEV2|SEV3}} ({{IMPACT_DESCRIPTION}})
+**Lead**: {{INCIDENT_COMMANDER_NAME}}
+**Facilitator**: {{POSTMORTEM_FACILITATOR_NAME}}
+
+---
+
+## Summary
+
+[1-2 paragraph executive summary of what happened, impact, and resolution]
+
+**Timeline at a glance**:
+- T-0:00 — Normal operation
+- T-{{TIME1}} — {{EVENT1}}
+- T-{{TIME2}} — {{EVENT2}}
+- T-{{RESOLUTION_TIME}} — Incident resolved
+
+**Impact**: {{METRIC1}} affected {{X}} users, {{METRIC2}}, {{METRIC3}}
+
+---
+
+## Detailed Timeline
+
+| Time | Event | Notes |
+|------|-------|-------|
+| {{T}} | {{What happened}} | {{Who detected it}} |
+| {{T+X}} | {{Next event}} | {{Action taken}} |
+| {{T+Y}} | {{Root cause identified}} | {{By whom}} |
+| {{T+Z}} | {{Fix applied}} | {{Verification steps}} |
+| {{T+Final}} | {{Incident resolved}} | {{Verification}} |
+
+---
+
+## Root Cause Analysis
+
+### Primary Cause
+
+**{{ROOT_CAUSE_TITLE}}**
+
+{{Detailed explanation of the root cause}}
+
+**How it happened**:
+1. {{Precondition 1}} (why the system was vulnerable)
+2. {{Trigger event}} (what caused the failure)
+3. {{Failure cascade}} (why it got worse)
+4. {{Detection lag}} (why it took X minutes to detect)
+
+**Evidence**:
+- {{Log entry or metric showing the issue}}
+- {{Related system behavior}}
+- {{Impact indicator}}
+
+### Contributing Factors
+
+- {{Factor 1}} — {{Brief explanation}}
+- {{Factor 2}} — {{Brief explanation}}
+- {{Factor 3}} — {{Brief explanation}}
+
+### Why Didn't We Catch This?
+
+- {{Missing monitoring}} — {{What metric would have alerted}}
+- {{Testing gap}} — {{What test would have failed}}
+- {{Documentation gap}} — {{What runbook would have helped}}
+- {{Knowledge gap}} — {{What training would have helped}}
+
+---
+
+## Impact Assessment
+
+### User Impact
+
+- **Duration**: {{START_TIME}} — {{END_TIME}} ({{DURATION}} minutes)
+- **Scale**: {{X}}% of {{METRIC}} (e.g., 5% of payment requests)
+- **Users Affected**: {{APPROX_COUNT}} users
+- **Revenue Impact**: {{$X}} (if applicable)
+- **Customer Escalations**: {{NUMBER}} tickets opened
+
+**User-facing symptoms**:
+- {{Symptom 1}} (e.g., "Checkout returns 500 error")
+- {{Symptom 2}} (e.g., "Page loads slowly")
+- {{Symptom 3}}
+
+### Operational Impact
+
+- **System Recovery**: {{SERVICE/METRIC}} took {{TIME}} to recover
+- **Cascading Effects**: {{SERVICE_X}} also affected due to {{reason}}
+- **On-call Load**: {{NUMBER}} pages, {{NUMBER}} escalations
+- **Data Loss**: {{None | {{Description}}}}
+
+---
+
+## Resolution & Recovery
+
+### Immediate Actions Taken
+
+1. **{{Time T+X}}** — {{Action 1}}
+   - Rationale: {{Why this helped}}
+   - Result: {{What changed}}
+
+2. **{{Time T+Y}}** — {{Action 2}}
+   - Rationale: {{Why this helped}}
+   - Result: {{What changed}}
+
+3. **{{Time T+Z}}** — {{Root Fix Applied}}
+   - Details: {{Technical description}}
+   - Verification: {{How we confirmed it worked}}
+
+### Rollback/Rollforward Decision
+
+**Decision**: {{Rollback to version X | Rollforward with fix | Hybrid approach}}
+
+**Rationale**: {{Explain why this was the right choice}}
+
+**Verification**: {{How we confirmed the fix worked}}
+
+---
+
+## Lessons Learned
+
+### What Went Well
+
+- {{Thing we did right}} — This prevented {{worse outcome}}
+- {{Thing we did right}} — Team coordination was excellent
+- {{Thing we did right}} — Monitoring caught {{something}}
+
+### What We Can Improve
+
+| Issue | Category | Severity | Recommendation | Owner |
+|-------|----------|----------|-----------------|-------|
+| {{We didn't detect it for X minutes}} | Observability | HIGH | Add alert for {{metric}} when > {{threshold}} | DevOps |
+| {{Runbook was outdated}} | Runbooks | MEDIUM | Update {{runbook}} with new architecture | SRE |
+| {{New service not in alerting system}} | Process | MEDIUM | Add new services to alert config automatically | Platform |
+| {{Team didn't know about new feature}} | Knowledge | LOW | Document new features in wiki | Tech Lead |
+
+---
+
+## Action Items
+
+### Critical (Must Complete Before Similar Incident)
+
+- [ ] **{{Action 1}}** — {{Description}}
+  - Owner: {{NAME}}
+  - Deadline: {{DATE}} (within 1 week)
+  - Acceptance: {{How we verify it's done}}
+
+- [ ] **{{Action 2}}** — {{Description}}
+  - Owner: {{NAME}}
+  - Deadline: {{DATE}} (within 1 week)
+  - Acceptance: {{How we verify it's done}}
+
+### High Priority (Target Next 2 Weeks)
+
+- [ ] {{Action}} — Owner: {{NAME}}, Deadline: {{DATE}}
+- [ ] {{Action}} — Owner: {{NAME}}, Deadline: {{DATE}}
+- [ ] {{Action}} — Owner: {{NAME}}, Deadline: {{DATE}}
+
+### Medium Priority (Target This Sprint)
+
+- [ ] {{Action}} — Owner: {{NAME}}
+- [ ] {{Action}} — Owner: {{NAME}}
+
+### Backlog (Good to Have)
+
+- [ ] {{Action}} — {{Description}}
+- [ ] {{Action}} — {{Description}}
+
+---
+
+## Prevention Measures
+
+### Short-term (1-2 Weeks)
+
+1. **{{Mitigation 1}}** — Prevents {{this exact incident}} from happening again
+   - How: {{Technical approach}}
+   - Effort: {{Estimate}}
+   - Timeline: {{When}}
+
+2. **{{Mitigation 2}}** — Catches similar issues earlier
+   - How: {{Technical approach}}
+   - Effort: {{Estimate}}
+   - Timeline: {{When}}
+
+### Long-term (Next Quarter)
+
+1. **{{Large architectural change}}** — Eliminates root cause class
+   - Rationale: {{Why this is better}}
+   - Effort: {{Estimate}}
+   - Timeline: {{When}}
+
+---
+
+## Incident Stats
+
+```
+MTTD (Mean Time To Detect): {{MINUTES}} minutes
+  - Automatic detection: {{If applicable, how}}
+  - Manual detection: {{Who found it}}
+
+MTTR (Mean Time To Resolve): {{MINUTES}} minutes
+  - Investigation time: {{MINUTES}}
+  - Fix implementation time: {{MINUTES}}
+  - Verification time: {{MINUTES}}
+
+Severity: {{SEV1|SEV2|SEV3}} ({{Criteria}})
+```
+
+---
+
+## Distribution & Follow-up
+
+- [x] Postmortem shared with: {{TEAM_LIST}}
+- [x] Customer communication sent: {{YES|NO|TEMPLATE_USED}}
+- [x] Action items tracked in: {{JIRA/BACKLOG}}
+- [x] Follow-up review scheduled: {{DATE}}
+
+**Follow-up Review**: {{DATE}} with {{ATTENDEES}}
+- Confirm all critical action items completed
+- Verify prevention measures working
+- Check for recurring patterns
+
+---
+
+## Appendix: Supporting Evidence
+
+### Logs
+
+```
+[Relevant log entries showing the incident]
+
+{{TIMESTAMP}} ERROR: {{MESSAGE}}
+{{TIMESTAMP}} ERROR: {{MESSAGE}}
+```
+
+### Metrics
+
+[Include screenshots or links to metric dashboards showing the incident]
+
+- Error rate spike: [Chart or metric]
+- Latency spike: [Chart or metric]
+- Traffic pattern: [Chart or metric]
+
+### Configuration Changes
+
+```yaml
+# Changes made before incident
+- {{Change 1}} ({{TIMESTAMP}})
+- {{Change 2}} ({{TIMESTAMP}})
+```
+
+---
+
+**Document Completed By**: {{NAME}}
+**Date**: {{DATE}}
+**Review Status**: Draft | Final | Approved
+
+**Approvals**:
+- [ ] Incident Commander: {{NAME}} {{DATE}}
+- [ ] Service Owner: {{NAME}} {{DATE}}
+- [ ] VP Engineering (if SEV1): {{NAME}} {{DATE}}
--- a/src/psm/workflows/bmad-psm-incident-response/workflow.md
+++ b/src/psm/workflows/bmad-psm-incident-response/workflow.md
@ -0,0 +1,163 @@
+---
+workflow_id: W-INCIDENT-001
+workflow_name: Production Incident Response
+version: 6.2.0
+lead_agent: "SRE Minh"
+supporting_agents: ["Architect Khang", "Mary Analyst"]
+phase: "3-Run: Emergency Response & Recovery"
+created_date: 2026-03-17
+last_modified: 2026-03-17
+config_file: "_config/config.yaml"
+estimated_duration: "15 minutes to 2 hours (depending on severity)"
+outputFile: '{output_folder}/psm-artifacts/incident-{{project_name}}-{{date}}.md'
+---
+
+# Production Incident Response Workflow — BMAD Pattern
+
+## Metadata & Context
+
+**Goal**: Triage, diagnose, resolve production incidents through systematic diagnosis and apply fixes with verification. This is the most critical workflow - minimize MTTR (Mean Time To Recovery) while maintaining system stability.
+
+**Lead Team**:
+- SRE Minh (Incident Command, Recovery Orchestration)
+- Architect Khang (Root Cause Analysis, System-wide Impact)
+- Mary Analyst (Impact Assessment, Post-Incident Review)
+
+**Success Criteria**:
+- ✓ Incident severity classified within 5 minutes
+- ✓ Root cause identified within first triage pass
+- ✓ Fix applied and verified
+- ✓ System metrics returned to baseline
+- ✓ Incident postmortem documented with action items
+- ✓ Prevention measures identified
+
+## Workflow Overview
+
+Workflow này di qua 4 bước atomic, mỗi bước focus vào một phase khác nhau:
+
+1. **Step-01-Triage** → Gather initial info, assess severity, classify impact
+2. **Step-02-Diagnose** → Systematic diagnosis using observability data (logs, metrics, traces)
+3. **Step-03-Fix** → Apply fix, verify resolution, validate recovery
+4. **Step-04-Postmortem** → Document incident, identify action items, prevent recurrence
+
+## Configuration Loading
+
+Tự động load từ `_config/config.yaml`:
+
+```yaml
+project_context:
+  organization: "[loaded from config]"
+  environment: "production"
+  incident_channel: "slack:#incidents"
+
+workflow_defaults:
+  communication_language: "Vietnamese-English"
+  severity_levels: ["SEV1", "SEV2", "SEV3", "SEV4"]
+  escalation_contacts: "[loaded from config]"
+  on_call_engineer: "[loaded from config]"
+```
+
+## Workflow Architecture - Micro-File Design
+
+BMAD pattern: Mỗi step là một file riêng, load just-in-time. Workflow chain:
+
+```
+workflow.md (entry point)
+    ↓
+step-01-triage.md (classify severity, initial assessment)
+    ↓
+step-02-diagnose.md (root cause analysis)
+    ↓
+step-03-fix.md (apply fix, verify)
+    ↓
+step-04-postmortem.md (document, prevent)
+    ↓
+incident-response-summary.md (final output)
+```
+
+**Key Benefits**:
+- Single-step focus — engineer concentrates on one phase
+- Knowledge isolation — load only relevant SKILL docs per step
+- State tracking — save progress after each step
+- Easy resumption — if interrupted, restart from exact step
+
+## Skill References
+
+Workflow này load knowledge từ:
+
+- **5.07 Reliability & Resilience** → Circuit breaker patterns, fallback strategies, timeout management
+- **5.08 Observability & Monitoring** → Structured logging, metrics queries, distributed tracing
+- **5.09 Error Handling & Recovery** → Error classification, graceful degradation patterns
+- **5.10 Production Readiness** → Incident prevention checklist, alerting setup
+- **5.14 Documentation & Runbooks** → Postmortem templates, incident reports
+
+## Execution Model
+
+### Entry Point Logic
+
+```
+1. Check if incident session exists
+   → If NEW incident: Start from step-01-triage.md
+   → If ONGOING: Load incident-session.yaml → continue from last completed step
+   → If RESOLVED: Load postmortem template
+
+2. For each step:
+   a) Load step-{N}-{name}.md
+   b) Load referenced SKILL files (auto-parse "Load:" directives)
+   c) Execute MENU [A][C] options
+   d) Save step output to step-{N}-output.md + incident-context.yaml
+   e) Move to next step or conclude
+
+3. Final: Generate incident report + postmortem in outputs folder
+```
+
+### State Tracking
+
+Incident session frontmatter tracks progress:
+
+```yaml
+incident_context:
+  incident_id: "INC-2026-03-17-001"
+  severity: "SEV1" | "SEV2" | "SEV3" | "SEV4"
+  status: "triage" → "diagnosing" → "recovering" → "resolved" → "postmortem"
+  affected_services: ["service-1", "service-2"]
+  started_at: "2026-03-17T14:30:00Z"
+  timeline:
+    detected_at: "2026-03-17T14:30:00Z"
+    triage_completed_at: "2026-03-17T14:35:00Z"
+    root_cause_identified_at: "2026-03-17T14:50:00Z"
+    fix_applied_at: "2026-03-17T15:10:00Z"
+    resolved_at: "2026-03-17T15:15:00Z"
+  current_step: "step-02-diagnose"
+  last_updated: "2026-03-17T14:50:00Z"
+  incident_commander: "SRE Minh"
+```
+
+## Mandatory Workflow Rules
+
+1. **Speed first** — Triage must complete in < 5 minutes
+2. **Root cause identification** — Must identify root cause before fix attempt
+3. **Verify before declaring resolved** — Check metrics + user reports
+4. **Document everything** — Every action logged for postmortem
+5. **Escalation protocol** — SEV1 → Page on-call architect immediately
+6. **Communication** — Update stakeholders every 5-10 minutes
+7. **No flying blind** — All fixes must reference observability data
+
+## Severity Scale
+
+- **SEV1** — Service completely down, revenue impact, > 1% users affected → Page all on-call
+- **SEV2** — Major degradation, significant users affected, partial functionality down
+- **SEV3** — Moderate impact, some users affected, workaround possible
+- **SEV4** — Minor issue, limited users, can defer to business hours
+
+## Navigation
+
+Hãy chọn cách bắt đầu:
+
+- **[NEW-INC]** — Report new incident → Load step-01-triage
+- **[RESUME-INC]** — Continue existing incident (detect progress from incident-session.yaml)
+- **[ESCALATE]** — Escalate to on-call architect
+
+---
+
+**Hãy báo cáo tình trạng incident hoặc chọn [NEW-INC] để bắt đầu triage**
--- a/src/psm/workflows/bmad-psm-mlops-deployment/SKILL.md
+++ b/src/psm/workflows/bmad-psm-mlops-deployment/SKILL.md
@ -0,0 +1,6 @@
+---
+name: bmad-psm-mlops-deployment
+description: 'Deploy ML model to production with validation and monitoring. Use when the user says "deploy model" or "ML deployment" or "model serving"'
+---
+
+Follow the instructions in [workflow.md](workflow.md).
--- a/src/psm/workflows/bmad-psm-mlops-deployment/bmad-skill-manifest.yaml
+++ b/src/psm/workflows/bmad-psm-mlops-deployment/bmad-skill-manifest.yaml
@ -0,0 +1 @@
+type: skill
--- a/src/psm/workflows/bmad-psm-mlops-deployment/workflow.md
+++ b/src/psm/workflows/bmad-psm-mlops-deployment/workflow.md
@ -0,0 +1,89 @@
+---
+workflow_id: MLOPS001
+workflow_name: MLOps Deployment
+description: Deploy ML model to production with validation, serving, and monitoring
+entry_point: steps/step-01-model-validation.md
+phase: 5-specialized
+lead_agent: "Linh (MLOps)"
+status: "active"
+created_date: 2026-03-17
+version: "1.0.0"
+estimated_duration: "3-4 hours"
+outputFile: '{output_folder}/psm-artifacts/mlops-deploy-{{project_name}}-{{date}}.md'
+---
+
+# Workflow: MLOps Deployment
+
+## Goal
+Deploy machine learning models to production with comprehensive validation, infrastructure setup, and post-deployment monitoring.
+
+## Overview
+
+MLOps deployment ensures ML models are production-ready and continuously monitored for performance and data drift. The workflow:
+
+1. **Validates** model quality, performance metrics, and data drift detection
+2. **Deploys** model to serving infrastructure with versioning and A/B testing
+3. **Monitors** model performance, data drift, and cost metrics post-deployment
+
+## Execution Path
+
+```
+START
+  ↓
+[Step 01] Model Validation (Check metrics, data drift, A/B test plan)
+  ↓
+[Step 02] Deploy Model (Setup serving, infrastructure, GPU optimization)
+  ↓
+[Step 03] Monitor (Langfuse/MLflow, drift detection, cost tracking)
+  ↓
+END
+```
+
+## Key Roles
+
+| Role | Agent | Responsibility |
+|------|-------|-----------------|
+| Lead | Linh (MLOps) | Coordinate deployment, monitor model health |
+| Data Scientist | Data Lead | Validate model quality, approve for production |
+| DevOps | Platform Eng | Setup infrastructure, manage resources |
+
+## Validation Gates (3)
+
+1. **Model Quality** — Accuracy, precision, recall metrics meet SLO
+2. **Data Quality** — No data drift detected; training/production data distribution aligned
+3. **Business Readiness** — A/B test plan ready, rollback strategy defined
+
+## Input Requirements
+
+- **Trained model artifact** — Model checkpoint, weights, configuration
+- **Performance metrics** — Baseline accuracy, latency, throughput expectations
+- **Data validation** — Training dataset description, expected data distribution
+- **Serving infrastructure** — Compute requirements (GPU/CPU), latency targets
+
+## Output Deliverable
+
+- **MLOps Deployment Report**
+  - Model version and metadata
+  - Performance validation summary
+  - Serving infrastructure setup
+  - Monitoring dashboard and alerts
+  - Data drift detection configuration
+
+## Success Criteria
+
+1. Model passes all quality gates before deployment
+2. Serving infrastructure deployed and load-tested
+3. Monitoring and alerting configured and validated
+4. Rollback strategy tested and documented
+5. Team trained on model updates and incident response
+
+## Next Steps After Workflow
+
+- Monitor model performance daily for first week
+- Track data drift metrics; alert if detected
+- Plan model retraining based on performance degradation
+- Document lessons learned in MLOps runbook
+
+---
+
+**Navigation**: [← Back to 5-specialized](../), [Next: Step 01 →](steps/step-01-model-validation.md)
--- a/src/psm/workflows/bmad-psm-production-readiness/SKILL.md
+++ b/src/psm/workflows/bmad-psm-production-readiness/SKILL.md
@ -0,0 +1,6 @@
+---
+name: bmad-psm-production-readiness
+description: 'Run production readiness review across 9 dimensions. Use when the user says "are we ready for production" or "PRR" or "go-live check"'
+---
+
+Follow the instructions in [workflow.md](workflow.md).
--- a/src/psm/workflows/bmad-psm-production-readiness/bmad-skill-manifest.yaml
+++ b/src/psm/workflows/bmad-psm-production-readiness/bmad-skill-manifest.yaml
@ -0,0 +1 @@
+type: skill
--- a/src/psm/workflows/bmad-psm-production-readiness/production-readiness.template.md
+++ b/src/psm/workflows/bmad-psm-production-readiness/production-readiness.template.md
@ -0,0 +1,367 @@
+---
+template_name: production-readiness-checklist
+template_version: "1.0.0"
+created_date: 2026-03-17
+description: Production Readiness Review checklist and report template
+---
+
+# Production Readiness Review (PRR)
+
+**Service**: {{SERVICE_NAME}}
+**Owner**: {{SERVICE_OWNER}}
+**Reviewer**: {{SRE_LEAD}} (Minh)
+**Review Date**: {{DATE}}
+**Target Go-Live**: {{TARGET_DATE}}
+
+---
+
+## Executive Summary
+
+{{1-2 paragraphs summarizing the readiness assessment, decision, and key findings}}
+
+**Overall Assessment**: {{READY | CONDITIONAL | NOT_READY}}
+
+**Timeline**: Service {{can | can conditionally | cannot}} proceed to production {{on {{DATE}}}}
+
+---
+
+## Production Readiness Scorecard
+
+### 9-Dimension Assessment
+
+| # | Dimension | Score | Status | Key Finding |
+|---|-----------|-------|--------|-------------|
+| 1 | Reliability | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
+| 2 | Observability | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
+| 3 | Performance | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
+| 4 | Security | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
+| 5 | Capacity | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
+| 6 | Data | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
+| 7 | Runbooks | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
+| 8 | Dependencies | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
+| 9 | Rollback | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
+
+**Summary**: {{X}} GREEN, {{Y}} YELLOW, {{Z}} RED
+
+---
+
+## Detailed Findings by Dimension
+
+### 1. Reliability
+
+**Goal**: Service meets SLO targets with documented failure modes and incident response plan.
+
+**Findings**:
+
+- [ ] {{Finding 1}} ({{Status}})
+- [ ] {{Finding 2}} ({{Status}})
+- [ ] {{Finding 3}} ({{Status}})
+
+**Assessment**: {{Detailed narrative, 3-5 sentences}}
+
+**Score**: {{GREEN|YELLOW|RED}}
+
+---
+
+### 2. Observability
+
+**Goal**: Service has comprehensive logging, metrics, tracing, and dashboards for operational visibility.
+
+**Findings**:
+
+- [ ] {{Finding 1}} ({{Status}})
+- [ ] {{Finding 2}} ({{Status}})
+- [ ] {{Finding 3}} ({{Status}})
+
+**Assessment**: {{Detailed narrative, 3-5 sentences}}
+
+**Score**: {{GREEN|YELLOW|RED}}
+
+---
+
+### 3. Performance
+
+**Goal**: Service meets latency/throughput targets and scales under expected load.
+
+**Findings**:
+
+- [ ] {{Finding 1}} ({{Status}})
+- [ ] {{Finding 2}} ({{Status}})
+- [ ] {{Finding 3}} ({{Status}})
+
+**Assessment**: {{Detailed narrative, 3-5 sentences}}
+
+**Score**: {{GREEN|YELLOW|RED}}
+
+---
+
+### 4. Security
+
+**Goal**: Authentication, authorization, encryption, and secrets management are implemented.
+
+**Findings**:
+
+- [ ] {{Finding 1}} ({{Status}})
+- [ ] {{Finding 2}} ({{Status}})
+- [ ] {{Finding 3}} ({{Status}})
+
+**Assessment**: {{Detailed narrative, 3-5 sentences}}
+
+**Score**: {{GREEN|YELLOW|RED}}
+
+---
+
+### 5. Capacity
+
+**Goal**: Resource requirements defined with growth headroom and cost acceptable.
+
+**Findings**:
+
+- [ ] {{Finding 1}} ({{Status}})
+- [ ] {{Finding 2}} ({{Status}})
+- [ ] {{Finding 3}} ({{Status}})
+
+**Assessment**: {{Detailed narrative, 3-5 sentences}}
+
+**Score**: {{GREEN|YELLOW|RED}}
+
+---
+
+### 6. Data
+
+**Goal**: Data governance, backup, retention, and disaster recovery documented and tested.
+
+**Findings**:
+
+- [ ] {{Finding 1}} ({{Status}})
+- [ ] {{Finding 2}} ({{Status}})
+- [ ] {{Finding 3}} ({{Status}})
+
+**Assessment**: {{Detailed narrative, 3-5 sentences}}
+
+**Score**: {{GREEN|YELLOW|RED}}
+
+---
+
+### 7. Runbooks
+
+**Goal**: Incident response, deployment, troubleshooting procedures documented and drilled.
+
+**Findings**:
+
+- [ ] {{Finding 1}} ({{Status}})
+- [ ] {{Finding 2}} ({{Status}})
+- [ ] {{Finding 3}} ({{Status}})
+
+**Assessment**: {{Detailed narrative, 3-5 sentences}}
+
+**Score**: {{GREEN|YELLOW|RED}}
+
+---
+
+### 8. Dependencies
+
+**Goal**: External/internal dependencies mapped, versioned, with fallback strategies.
+
+**Findings**:
+
+- [ ] {{Finding 1}} ({{Status}})
+- [ ] {{Finding 2}} ({{Status}})
+- [ ] {{Finding 3}} ({{Status}})
+
+**Assessment**: {{Detailed narrative, 3-5 sentences}}
+
+**Score**: {{GREEN|YELLOW|RED}}
+
+---
+
+### 9. Rollback
+
+**Goal**: Safe rollback strategy tested; deployment is reversible.
+
+**Findings**:
+
+- [ ] {{Finding 1}} ({{Status}})
+- [ ] {{Finding 2}} ({{Status}})
+- [ ] {{Finding 3}} ({{Status}})
+
+**Assessment**: {{Detailed narrative, 3-5 sentences}}
+
+**Score**: {{GREEN|YELLOW|RED}}
+
+---
+
+## Critical Blockers (P0)
+
+{{If any P0 blockers exist:}}
+
+Service **CANNOT** proceed to production until these are resolved:
+
+### P0 Blocker #1: {{ISSUE_TITLE}}
+
+- **Dimension**: {{Which dimension}}
+- **Description**: {{What's the problem}}
+- **Impact**: {{Why it's critical}}
+- **Resolution**: {{How to fix}}
+- **Owner**: {{Who must fix it}}
+- **Deadline**: {{When it must be done}}
+- **Acceptance**: {{How we verify it's fixed}}
+
+### P0 Blocker #2: {{ISSUE_TITLE}}
+
+{{Repeat format}}
+
+---
+
+## Risks to Manage (P1)
+
+Service can proceed with documented monitoring and contingency plans:
+
+### P1 Risk #1: {{ISSUE_TITLE}}
+
+- **Dimension**: {{Which dimension}}
+- **Description**: {{What's the problem}}
+- **Impact**: {{If it happens, what's the consequence}}
+- **Likelihood**: {{HIGH|MEDIUM|LOW}}
+- **Mitigation**: {{How we'll manage it}}
+- **Monitoring**: {{What metrics to watch}}
+- **Contingency**: {{What we'll do if it occurs}}
+- **Owner**: {{Who owns this risk}}
+- **Target Fix**: {{Timeline to resolve permanently}}
+
+### P1 Risk #2: {{ISSUE_TITLE}}
+
+{{Repeat format}}
+
+---
+
+## Recommendations
+
+**High Priority** (Next sprint):
+- {{Recommendation 1}}
+- {{Recommendation 2}}
+
+**Medium Priority** (Within 1 month):
+- {{Recommendation 1}}
+- {{Recommendation 2}}
+
+**Nice to Have** (Backlog):
+- {{Recommendation 1}}
+- {{Recommendation 2}}
+
+---
+
+## Final Decision
+
+### Decision
+
+**{{ ✅ GO | ⚠️ CONDITIONAL-GO | ❌ NO-GO }}**
+
+### Rationale
+
+{{Explain the decision. Why can/can't we proceed?}}
+
+### Conditions (If CONDITIONAL-GO)
+
+If proceeding despite P1 risks, document conditions:
+
+1. **{{Condition 1}}**: {{Description}}
+   - Owner: {{Who oversees this}}
+   - Success Criteria: {{How we verify it}}
+   - Escalation: {{Who to contact if issues}}
+
+2. **{{Condition 2}}**: {{Description}}
+   - Owner: {{Who oversees this}}
+   - Success Criteria: {{How we verify it}}
+   - Escalation: {{Who to contact if issues}}
+
+### Deployment Timeline
+
+{{If GO or CONDITIONAL-GO:}}
+
+- **Approved for deployment**: {{DATE}}
+- **Earliest go-live**: {{DATE}}
+- **Recommended window**: {{DATE/TIME}}
+- **On-call coverage required**: {{YES|NO}}
+- **Emergency rollback plan**: {{REFERENCE TO RUNBOOK}}
+
+---
+
+## Sign-offs & Approvals
+
+### Approval Chain
+
+- [ ] **SRE Lead** ({{NAME}}) — Review completed and findings approved
+  - Signature: ________________________ Date: __________
+
+- [ ] **Architecture Lead** ({{NAME}}) — Architecture validated
+  - Signature: ________________________ Date: __________
+
+- [ ] **Service Owner** ({{NAME}}) — Acknowledged findings and committed to actions
+  - Signature: ________________________ Date: __________
+
+- [ ] **VP Engineering** ({{NAME}}) — Risk accepted (if CONDITIONAL-GO)
+  - Signature: ________________________ Date: __________
+
+---
+
+## Post-Production Plan
+
+### First 24 Hours
+
+- [ ] SRE on-call monitoring closely
+- [ ] Daily standup with service team
+- [ ] Monitor for any unusual patterns
+- [ ] Be ready to rollback if needed
+
+### First Week
+
+- [ ] Daily metrics review
+- [ ] Watch for data drift or unusual behavior
+- [ ] Follow up on any P1 risks
+
+### Ongoing
+
+- [ ] Monthly PRR follow-ups to verify improvements
+- [ ] Track action items to completion
+- [ ] Update this PRR if significant changes made
+
+---
+
+## Action Items
+
+| ID | Action | Owner | Deadline | Type | Status |
+|----|--------|-------|----------|------|--------|
+| A1 | {{Action}} | {{Name}} | {{Date}} | {{BLOCKER|RISK|RECOMMENDATION}} | ☐ |
+| A2 | {{Action}} | {{Name}} | {{Date}} | {{BLOCKER|RISK|RECOMMENDATION}} | ☐ |
+| A3 | {{Action}} | {{Name}} | {{Date}} | {{BLOCKER|RISK|RECOMMENDATION}} | ☐ |
+
+---
+
+## Appendix
+
+### A. Load Test Results
+
+[Link to or summary of load test results showing service meets performance targets]
+
+### B. Security Review Results
+
+[Link to or summary of security audit findings]
+
+### C. Architecture Diagrams
+
+[Include or link to system architecture, data flow, and deployment topology]
+
+### D. SLO Definition
+
+[Document the agreed-upon SLO targets for availability, latency, error rate]
+
+### E. Runbooks
+
+[Link to or list of key runbooks: incident response, deployment, rollback, troubleshooting]
+
+---
+
+**Report prepared by**: {{SRE_LEAD}}
+**Report date**: {{DATE}}
+**Last updated**: {{DATE}}
--- a/src/psm/workflows/bmad-psm-production-readiness/workflow.md
+++ b/src/psm/workflows/bmad-psm-production-readiness/workflow.md
@ -0,0 +1,92 @@
+---
+workflow_id: PRR001
+workflow_name: Production Readiness Review
+description: Validate service is ready for production using comprehensive readiness checklist
+entry_point: steps/step-01-init-checklist.md
+phase: 3-run
+lead_agent: "Minh (SRE)"
+status: "active"
+created_date: 2026-03-17
+version: "1.0.0"
+estimated_duration: "2-3 hours"
+outputFile: '{output_folder}/psm-artifacts/prr-{{project_name}}-{{date}}.md'
+---
+
+# Workflow: Production Readiness Review (PRR)
+
+## Goal
+Validate and certify that a service meets production readiness standards across 9 key dimensions before deployment.
+
+## Overview
+
+This workflow systematically evaluates a service against production readiness criteria defined in the Production Systems BMAD skill framework. Using SRE expertise and architectural patterns, the workflow:
+
+1. **Initializes** the PRR process with service context and dimensional overview
+2. **Deep reviews** each dimension (reliability, observability, performance, security, capacity, data, runbooks, dependencies, rollback)
+3. **Renders final decision** with GO/NO-GO/CONDITIONAL-GO recommendation
+
+## Execution Path
+
+```
+START
+  ↓
+[Step 01] Init Checklist (Load framework, gather service context, present dimensions)
+  ↓
+[Step 02] Deep Review (Score each dimension, identify blockers, recommendations)
+  ↓
+[Step 03] Final Decision (Scorecard, decision, action items, DONE)
+  ↓
+END
+```
+
+## Key Roles
+
+| Role | Agent | Responsibility |
+|------|-------|-----------------|
+| Lead | Minh (SRE) | Navigate workflow, coordinate review, make final call |
+| Subject Matter | Service Owner | Provide service context, clarify architecture |
+| Review Committee | Arch, SecOps, MLOps | Contribute expertise on specific dimensions |
+
+## Dimensions Evaluated (9)
+
+1. **Reliability** — SLA/SLO definition, error budgets, failure modes, incident response
+2. **Observability** — Logging, metrics, tracing, dashboards, alerting
+3. **Performance** — Latency targets, throughput, P99 tail behavior, optimization opportunities
+4. **Security** — Auth/authz, secrets management, encryption, audit logging, compliance
+5. **Capacity** — Resource limits, scaling policies, burst capacity, cost projections
+6. **Data** — Schema versioning, backup/restore, data governance, retention policies
+7. **Runbooks** — Incident runbooks, operational playbooks, troubleshooting guides
+8. **Dependencies** — External services, internal libraries, database versioning, API contracts
+9. **Rollback** — Rollback strategy, canary deployment, feature flags, smoke tests
+
+## Input Requirements
+
+- **Service name and owner** — Which service are we evaluating?
+- **Current architecture** — High-level design, tech stack, topology
+- **Existing metrics/dashboards** — Links to monitoring, SLO definitions
+- **Known gaps/risks** — Already identified issues to address
+
+## Output Deliverable
+
+- **Production Readiness Checklist** (template: `production-readiness.template.md`)
+  - Scorecard with 9 dimensions (red/yellow/green)
+  - Blockers and recommendations per dimension
+  - Final GO/NO-GO/CONDITIONAL-GO decision
+  - Explicit action items with owners and deadlines
+
+## Success Criteria
+
+1. All 9 dimensions evaluated with clear rationale
+2. Blockers categorized as P0 (must fix) or P1 (should fix)
+3. Team alignment on decision (documented in PRR report)
+4. Action plan with clear accountability and timeline
+
+## Next Steps After Workflow
+
+- If **GO**: Proceed to deployment; document in CHANGELOG
+- If **NO-GO**: Reschedule PRR once blockers addressed; track in backlog
+- If **CONDITIONAL-GO**: Deploy with documented caveats; setup monitoring for risk areas
+
+---
+
+**Navigation**: [← Back to 3-run](../), [Next: Step 01 →](steps/step-01-init-checklist.md)
--- a/src/psm/workflows/bmad-psm-quick-diagnose/SKILL.md
+++ b/src/psm/workflows/bmad-psm-quick-diagnose/SKILL.md
@ -0,0 +1,6 @@
+---
+name: bmad-psm-quick-diagnose
+description: 'Quick diagnosis of production issue with minimal latency. Use when the user says "something is broken" or "quick diagnose" or "what is happening?"'
+---
+
+Follow the instructions in [workflow.md](workflow.md).
--- a/src/psm/workflows/bmad-psm-quick-diagnose/bmad-skill-manifest.yaml
+++ b/src/psm/workflows/bmad-psm-quick-diagnose/bmad-skill-manifest.yaml
@ -0,0 +1 @@
+type: skill
--- a/src/psm/workflows/bmad-psm-quick-diagnose/workflow.md
+++ b/src/psm/workflows/bmad-psm-quick-diagnose/workflow.md
@ -0,0 +1,80 @@
+---
+workflow_id: QD001
+workflow_name: Quick Diagnose
+description: Fast diagnosis of production issue with root cause and fix suggestion
+entry_point: steps/step-01-gather.md
+phase: quick-flow
+lead_agent: "Minh (SRE)"
+status: "active"
+created_date: 2026-03-17
+version: "1.0.0"
+estimated_duration: "15-25 minutes"
+outputFile: '{output_folder}/psm-artifacts/quick-diagnose-{{date}}.md'
+---
+
+# Workflow: Quick Diagnose Production Issue
+
+## Goal
+Rapidly diagnose production issues by gathering symptom data, checking metrics, and suggesting fixes.
+
+## Overview
+
+Quick Diagnose is a lightweight workflow for time-sensitive production troubleshooting:
+
+1. **Gathers** symptom description and quick metrics check
+2. **Diagnoses** root cause using observability data
+3. **Suggests** fix or mitigation immediately
+
+## Execution Path
+
+```
+START
+  ↓
+[Step 01] Gather Context (What's broken? Check metrics)
+  ↓
+[Step 02] Diagnose & Fix (Root cause analysis → fix suggestion → verify)
+  ↓
+END
+```
+
+## Key Roles
+
+| Role | Agent |
+|------|-------|
+| Lead | Minh (SRE) |
+
+## Input Requirements
+
+- **Symptom description** — What is failing? (error message, behavior, timeline)
+- **Affected service/component** — What system is broken?
+- **Timeline** — When did it start? Is it ongoing?
+- **Impact** — How many users affected? Is revenue impacted?
+
+## Output Deliverable
+
+- **Quick Diagnosis Report** (markdown, 1-2 pages)
+  - Symptom analysis
+  - Root cause hypothesis
+  - Immediate mitigation (if needed)
+  - Fix suggestion with effort
+  - Follow-up actions
+
+## Success Criteria
+
+1. Root cause identified within 15-20 minutes
+2. Immediate mitigation available (if needed)
+3. Fix suggestion documented with clear steps
+4. Team knows what to do next
+
+## Quick Diagnose vs Full Production Readiness Review
+
+| Aspect | Quick Diagnose | Full PRR |
+|--------|---|---|
+| Trigger | Active incident | Pre-deployment |
+| Duration | 15-25 min | 2-3 hours |
+| Scope | Single issue | All 9 dimensions |
+| Goal | Fix now | Prevent issues |
+
+---
+
+**Navigation**: [← Back to quick-flow](../), [Next: Step 01 →](steps/step-01-gather.md)
--- a/src/psm/workflows/bmad-psm-security-audit/SKILL.md
+++ b/src/psm/workflows/bmad-psm-security-audit/SKILL.md
@ -0,0 +1,6 @@
+---
+name: bmad-psm-security-audit
+description: 'Run comprehensive security audit and threat assessment. Use when the user says "security audit" or "vulnerability assessment" or "security review"'
+---
+
+Follow the instructions in [workflow.md](workflow.md).
--- a/src/psm/workflows/bmad-psm-security-audit/bmad-skill-manifest.yaml
+++ b/src/psm/workflows/bmad-psm-security-audit/bmad-skill-manifest.yaml
@ -0,0 +1 @@
+type: skill
--- a/src/psm/workflows/bmad-psm-security-audit/security-audit-report.template.md
+++ b/src/psm/workflows/bmad-psm-security-audit/security-audit-report.template.md
@ -0,0 +1,502 @@
+---
+template_name: security-audit-report
+template_version: "1.0.0"
+created_date: 2026-03-17
+description: Security audit report with findings, severity levels, and remediation plan
+---
+
+# Security Audit Report
+
+**Service**: {{SERVICE_NAME}}
+**Service Owner**: {{SERVICE_OWNER}}
+**Auditor**: {{SECURITY_LEAD}} (Hà)
+**Audit Date**: {{START_DATE}} — {{END_DATE}}
+**Report Date**: {{REPORT_DATE}}
+**Scope**: {{SCOPE_DESCRIPTION}}
+
+---
+
+## Executive Summary
+
+This security audit evaluated {{SERVICE_NAME}} against security best practices and compliance requirements. The assessment identified {{X}} findings across {{Y}} security domains.
+
+**Overall Security Posture**: {{COMPLIANT | FINDINGS | CRITICAL}}
+
+{{1-2 paragraph summary of key findings, critical issues if any, and recommendations}}
+
+---
+
+## Audit Scope
+
+### Services Reviewed
+
+- {{Service 1}} ({{Description}})
+- {{Service 2}} ({{Description}})
+- {{Service 3}} ({{Description}})
+
+### Assessment Domains
+
+- ✅ Authentication & Authorization
+- ✅ API Security
+- ✅ Secrets Management
+- ✅ Encryption (in-transit & at-rest)
+- ✅ PII & Data Protection
+
+### Exclusions
+
+{{Any out-of-scope areas:}}
+- {{Item}} (reason)
+- {{Item}} (reason)
+
+---
+
+## Findings Summary
+
+### By Severity
+
+| Severity | Count | Trend |
+|----------|-------|-------|
+| **Critical** | {{X}} | {{↑/→/↓}} |
+| **High** | {{Y}} | {{↑/→/↓}} |
+| **Medium** | {{Z}} | {{↑/→/↓}} |
+| **Low** | {{W}} | {{↑/→/↓}} |
+| **Total** | {{X+Y+Z+W}} | |
+
+### By Domain
+
+| Domain | Critical | High | Medium | Low | Status |
+|--------|----------|------|--------|-----|--------|
+| Auth & Authz | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
+| API Security | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
+| Secrets Mgmt | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
+| Encryption | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
+| PII & Data | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
+
+---
+
+## Critical Severity Findings
+
+### [F1] {{Finding Title}}
+
+**Severity**: CRITICAL (CVSS {{8.0-10.0}})
+**Domain**: {{Which domain}}
+**Status**: {{Open | In Progress | Resolved}}
+
+**Description**:
+{{Detailed description of the vulnerability, how it could be exploited, and the impact}}
+
+**Evidence**:
+- {{Evidence 1}}
+- {{Evidence 2}}
+- {{Testing confirmation}}
+
+**Impact**:
+- {{Business impact}}
+- {{Technical impact}}
+- {{Compliance impact}}
+
+**Remediation**:
+1. {{Step 1}} ({{Estimated time}})
+2. {{Step 2}} ({{Estimated time}})
+3. {{Step 3}} ({{Estimated time}})
+
+**Owner**: {{Name}}
+**Target Fix Date**: {{DATE}}
+**Effort**: {{Est. hours/days}}
+**Verification**: {{How we'll confirm it's fixed}}
+
+---
+
+### [F2] {{Finding Title}}
+
+{{Repeat Critical severity format}}
+
+---
+
+## High Severity Findings
+
+### [F3] {{Finding Title}}
+
+**Severity**: HIGH (CVSS {{7.0-7.9}})
+**Domain**: {{Which domain}}
+**Status**: {{Open | In Progress | Resolved}}
+
+**Description**: {{Brief description}}
+
+**Impact**: {{Why it matters}}
+
+**Remediation**:
+1. {{Step 1}}
+2. {{Step 2}}
+
+**Owner**: {{Name}}
+**Target Date**: {{DATE}}
+
+---
+
+### [F4] {{Finding Title}}
+
+{{Repeat High severity format}}
+
+---
+
+## Medium Severity Findings
+
+### [F5] {{Finding Title}}
+
+**Severity**: MEDIUM (CVSS {{4.0-6.9}})
+**Domain**: {{Which domain}}
+**Description**: {{Brief description}}
+**Remediation**: {{Brief fix}}
+**Owner**: {{Name}} | **Target Date**: {{DATE}}
+
+---
+
+### [F6] {{Finding Title}}
+
+{{Repeat Medium severity format}}
+
+---
+
+## Low Severity Findings
+
+### [F7] {{Finding Title}}
+
+**Severity**: LOW (CVSS {{0.1-3.9}})
+**Description**: {{Brief description}}
+**Remediation**: {{Brief fix}}
+
+---
+
+### [F8] {{Finding Title}}
+
+{{Repeat Low severity format}}
+
+---
+
+## Domain-Specific Assessment
+
+### Domain 1: Authentication & Authorization
+
+**Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
+
+**Strengths**:
+- {{Positive finding 1}}
+- {{Positive finding 2}}
+
+**Gaps**:
+- {{Gap 1}} — {{Impact}}
+- {{Gap 2}} — {{Impact}}
+
+**Recommendations**:
+1. {{Recommendation 1}}
+2. {{Recommendation 2}}
+
+---
+
+### Domain 2: API Security
+
+**Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
+
+**Strengths**:
+- {{Positive finding 1}}
+- {{Positive finding 2}}
+
+**Gaps**:
+- {{Gap 1}} — {{Impact}}
+- {{Gap 2}} — {{Impact}}
+
+**Recommendations**:
+1. {{Recommendation 1}}
+2. {{Recommendation 2}}
+
+---
+
+### Domain 3: Secrets Management
+
+**Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
+
+**Strengths**:
+- {{Positive finding 1}}
+- {{Positive finding 2}}
+
+**Gaps**:
+- {{Gap 1}} — {{Impact}}
+- {{Gap 2}} — {{Impact}}
+
+**Recommendations**:
+1. {{Recommendation 1}}
+2. {{Recommendation 2}}
+
+---
+
+### Domain 4: Encryption
+
+**Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
+
+**Strengths**:
+- {{Positive finding 1}}
+- {{Positive finding 2}}
+
+**Gaps**:
+- {{Gap 1}} — {{Impact}}
+- {{Gap 2}} — {{Impact}}
+
+**Recommendations**:
+1. {{Recommendation 1}}
+2. {{Recommendation 2}}
+
+---
+
+### Domain 5: PII & Data Protection
+
+**Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
+
+**Strengths**:
+- {{Positive finding 1}}
+- {{Positive finding 2}}
+
+**Gaps**:
+- {{Gap 1}} — {{Impact}}
+- {{Gap 2}} — {{Impact}}
+
+**Recommendations**:
+1. {{Recommendation 1}}
+2. {{Recommendation 2}}
+
+---
+
+## Compliance Assessment
+
+### GDPR (General Data Protection Regulation)
+
+**Applicable**: {{YES | NO | PARTIAL}}
+**Status**: {{COMPLIANT | NON-COMPLIANT | CONDITIONAL}}
+
+| Requirement | Status | Finding | Gap Fix |
+|-------------|--------|---------|---------|
+| Data Encryption | {{✅/❌}} | {{Description}} | {{Remediation}} |
+| Access Control | {{✅/❌}} | {{Description}} | {{Remediation}} |
+| Retention Policy | {{✅/❌}} | {{Description}} | {{Remediation}} |
+| Right to Deletion | {{✅/❌}} | {{Description}} | {{Remediation}} |
+| Data Processing Agreement | {{✅/❌}} | {{Description}} | {{Remediation}} |
+
+**Timeline to Compliance**: {{DATE or "Already compliant"}}
+
+---
+
+### PCI-DSS (Payment Card Industry Data Security Standard)
+
+**Applicable**: {{YES | NO | PARTIAL}}
+**Status**: {{COMPLIANT | NON-COMPLIANT | CONDITIONAL}}
+
+| Requirement | Status | Finding | Gap Fix |
+|-------------|--------|---------|---------|
+| TLS 1.2+ | {{✅/❌}} | {{Description}} | {{Remediation}} |
+| Secrets Management | {{✅/❌}} | {{Description}} | {{Remediation}} |
+| Input Validation | {{✅/❌}} | {{Description}} | {{Remediation}} |
+
+**Timeline to Compliance**: {{DATE or "Already compliant"}}
+
+---
+
+### SOC 2 Type II
+
+**Applicable**: {{YES | NO | PARTIAL}}
+**Status**: {{COMPLIANT | NON-COMPLIANT | CONDITIONAL}}
+
+**Gap Summary**: {{Description of gaps or "No gaps identified"}}
+
+**Timeline**: {{When audit can be conducted}}
+
+---
+
+### Other Regulations
+
+{{Any other applicable standards (HIPAA, FINRA, etc.)}}
+
+---
+
+## Remediation Roadmap
+
+### Critical Path (Week 1-2)
+
+**All Critical findings must be fixed before production deployment.**
+
+- [ ] {{F1}} — Owner: {{Name}}, Deadline: {{DATE}}
+- [ ] {{F2}} — Owner: {{Name}}, Deadline: {{DATE}}
+
+**Milestone**: Security re-scan on {{DATE}} to verify fixes
+
+---
+
+### Phase 2 (Week 3-4)
+
+Complete High-severity findings:
+
+- [ ] {{F3}} — Owner: {{Name}}, Deadline: {{DATE}}
+- [ ] {{F4}} — Owner: {{Name}}, Deadline: {{DATE}}
+
+**Milestone**: Second security review on {{DATE}}
+
+---
+
+### Phase 3 (Weeks 5-8)
+
+Address Medium-severity findings (can be post-production with monitoring):
+
+- [ ] {{F5}} — Owner: {{Name}}, Target: {{DATE}}
+- [ ] {{F6}} — Owner: {{Name}}, Target: {{DATE}}
+
+---
+
+### Backlog (Next Sprint)
+
+Low-severity items:
+
+- [ ] {{F7}} — {{Brief description}}
+- [ ] {{F8}} — {{Brief description}}
+
+---
+
+## Remediation Status Tracking
+
+| Finding | Owner | Deadline | Status | Last Update | Notes |
+|---------|-------|----------|--------|-------------|-------|
+| F1 | {{Name}} | {{Date}} | 🔴 Pending | {{Date}} | {{Notes}} |
+| F2 | {{Name}} | {{Date}} | 🟡 In Progress | {{Date}} | {{Notes}} |
+| F3 | {{Name}} | {{Date}} | 🟢 Complete | {{Date}} | {{Notes}} |
+
+---
+
+## Post-Audit Monitoring
+
+### Controls to Monitor
+
+{{If service proceeds to production despite findings:}}
+
+- **{{Control 1}}** — Monitor via {{method}}, alert if {{threshold}}
+- **{{Control 2}}** — Monitor via {{method}}, alert if {{threshold}}
+- **{{Control 3}}** — Monitor via {{method}}, alert if {{threshold}}
+
+### Incident Response
+
+If a security incident occurs:
+1. Activate incident response team
+2. Notify {{Escalation contacts}}
+3. Follow {{Incident response runbook}}
+4. Conduct post-incident security review
+
+---
+
+## Risk Assessment Matrix
+
+```
+             LIKELIHOOD
+             Low    Med    High
+    CRITICAL  H      C      C
+IMPACT
+    HIGH      M      H      C
+    MEDIUM    L      M      H
+    LOW       L      L      M
+
+Legend: C=Critical, H=High, M=Medium, L=Low
+```
+
+**Our findings map**:
+- {{F1}} — {{Position on matrix}}
+- {{F2}} — {{Position on matrix}}
+
+---
+
+## Positive Findings
+
+**Strengths to maintain:**
+
+- {{Positive 1}} — Keep doing this
+- {{Positive 2}} — Keep doing this
+- {{Positive 3}} — Keep doing this
+
+---
+
+## Recommendations Summary
+
+### Immediate (Critical)
+- {{Fix all Critical findings}} ({{effort}})
+
+### Short-term (High Priority)
+- {{Fix all High findings}} ({{effort}})
+- {{Implement automated scanning}} ({{effort}})
+- {{Setup security monitoring}} ({{effort}})
+
+### Medium-term
+- {{Implement {{technology}} for {{purpose}}}} ({{effort}})
+- {{Security training for team}} ({{effort}})
+
+### Long-term (Next 6 Months)
+- {{Major security initiative}} ({{effort}})
+- {{Penetration testing}} ({{effort}})
+
+---
+
+## Sign-offs & Approvals
+
+### Audit Approval
+
+- [ ] **Security Lead** ({{AUDITOR_NAME}})
+  - Signature: ________________________ Date: __________
+  - Assessment complete and findings documented
+
+### Service Owner Acknowledgment
+
+- [ ] **Service Owner** ({{SERVICE_OWNER}})
+  - Signature: ________________________ Date: __________
+  - Acknowledged findings and committed to remediation
+
+### Compliance Officer Review
+
+- [ ] **Compliance Officer** ({{NAME}})
+  - Signature: ________________________ Date: __________
+  - Compliance requirements verified
+
+### Executive Approval (If Production Clearance Needed)
+
+- [ ] **VP Engineering / Security** ({{NAME}})
+  - Signature: ________________________ Date: __________
+  - Risk accepted; approved for production
+
+---
+
+## Distribution
+
+- [x] Shared with: {{Service team, Leadership, Compliance}}
+- [x] Date shared: {{DATE}}
+- [x] Follow-up review scheduled: {{DATE}}
+
+---
+
+## Appendix: Testing Evidence
+
+### Code Review Findings
+
+```
+{{Code snippets demonstrating vulnerabilities}}
+```
+
+### Configuration Issues
+
+```
+{{Configuration examples showing gaps}}
+```
+
+### Dependencies Scan
+
+```
+{{Vulnerable dependencies identified}}
+```
+
+---
+
+**Report Prepared By**: {{AUDITOR_NAME}}
+**Report Date**: {{DATE}}
+**Review Status**: Draft | Final | Approved
--- a/src/psm/workflows/bmad-psm-security-audit/workflow.md
+++ b/src/psm/workflows/bmad-psm-security-audit/workflow.md
@ -0,0 +1,91 @@
+---
+workflow_id: SA001
+workflow_name: Security Audit
+description: Comprehensive security review using security patterns, config management, and compliance framework
+entry_point: steps/step-01-scope.md
+phase: 4-cross
+lead_agent: "Hà (Security)"
+status: "active"
+created_date: 2026-03-17
+version: "1.0.0"
+estimated_duration: "2-3 hours"
+outputFile: '{output_folder}/psm-artifacts/security-audit-{{project_name}}-{{date}}.md'
+---
+
+# Workflow: Security Audit
+
+## Goal
+Perform comprehensive security evaluation using Production Systems BMAD framework, covering threat modeling, vulnerability assessment, compliance, and security controls.
+
+## Overview
+
+Security audit is a critical cross-functional workflow that evaluates service security posture before production deployment or for ongoing compliance verification. The audit:
+
+1. **Scopes** the audit engagement, defines threat model, and identifies compliance requirements
+2. **Executes** detailed security assessment across multiple domains (authentication, data protection, infrastructure, API security)
+3. **Reports** findings with severity levels, remediation recommendations, and compliance status
+
+## Execution Path
+
+```
+START
+  ↓
+[Step 01] Scope & Threat Model (Define audit scope, identify threats, compliance reqs)
+  ↓
+[Step 02] Security Assessment (Execute checklist across domains, identify vulns)
+  ↓
+[Step 03] Security Report (Findings report, severity, recommendations, compliance)
+  ↓
+END
+```
+
+## Key Roles
+
+| Role | Agent | Responsibility |
+|------|-------|-----------------|
+| Lead | Hà (Security) | Lead audit, coordinate assessment, synthesize findings |
+| Subject Matter | Service Owner + Platform Eng | Provide architecture, answer security questions |
+| Compliance | Security/Compliance Team | Validate compliance mapping, sign-off |
+
+## Assessment Domains (5)
+
+1. **Authentication & Authorization** — Identity verification, access control, session management
+2. **API Security** — Input validation, rate limiting, API key management, CORS
+3. **Secrets Management** — Credential storage, rotation, access logging
+4. **Encryption** — In-transit (TLS), at-rest, key management
+5. **PII & Data Protection** — Classification, access controls, audit logging, retention
+
+## Input Requirements
+
+- **Service architecture diagram** — Components, data flows, external integrations
+- **Authentication/authorization approach** — OAuth2, JWT, SAML, custom
+- **Secrets storage mechanism** — Vault, cloud provider, environment variables
+- **Compliance requirements** — GDPR, CCPA, SOC2, industry-specific
+- **Known security controls** — WAF, TLS config, authentication libraries
+
+## Output Deliverable
+
+- **Security Audit Report** (template: `security-audit-report.template.md`)
+  - Audit scope and threat model
+  - Findings organized by domain with severity (Critical/High/Medium/Low)
+  - Remediation recommendations with priority and effort
+  - Compliance status matrix
+  - Sign-off
+
+## Success Criteria
+
+1. All security domains assessed with clear findings
+2. Severity levels assigned (using CVSS or similar framework)
+3. Remediation plan with owners and deadlines
+4. Compliance requirements verified (if applicable)
+5. Team alignment on security posture
+
+## Next Steps After Workflow
+
+- If **COMPLIANT**: Document in security registry; schedule periodic re-audit
+- If **NON-COMPLIANT**: Add remediation items to backlog; track closure
+- If **CRITICAL ISSUES**: Consider production pause until resolved
+
+---
+
+**Navigation**: [← Back to 4-cross](../), [Next: Step 01 →](steps/step-01-scope.md)
--- a/src/psm/workflows/bmad-psm-setup-new-service/SKILL.md
+++ b/src/psm/workflows/bmad-psm-setup-new-service/SKILL.md
@ -0,0 +1,6 @@
+---
+name: bmad-psm-setup-new-service
+description: 'Set up new production service from architecture through deployment. Use when the user says "new service" or "setup service" or "new microservice"'
+---
+
+Follow the instructions in [workflow.md](workflow.md).
--- a/src/psm/workflows/bmad-psm-setup-new-service/bmad-skill-manifest.yaml
+++ b/src/psm/workflows/bmad-psm-setup-new-service/bmad-skill-manifest.yaml
@ -0,0 +1 @@
+type: skill
--- a/src/psm/workflows/bmad-psm-setup-new-service/workflow.md
+++ b/src/psm/workflows/bmad-psm-setup-new-service/workflow.md
@ -0,0 +1,116 @@
+---
+workflow_id: W-SETUP-SVC-001
+workflow_name: Setup Production Service for BMAD
+version: 6.2.0
+lead_agent: "Architect Khang"
+supporting_agents: ["SRE Minh", "Mary Analyst"]
+phase: "1-Analysis → 2-Planning → 3-Solutioning → 4-Implementation"
+created_date: 2026-03-17
+last_modified: 2026-03-17
+config_file: "_config/config.yaml"
+estimated_duration: "12-20 hours"
+outputFile: '{output_folder}/psm-artifacts/service-setup-{{project_name}}-{{date}}.md'
+---
+
+# Setup Production Service Workflow — BMAD Pattern
+
+## Metadata & Context
+
+**Goal**: Xây dựng production-grade service từ scratch, với đầy đủ architecture, API design, deployment pipeline, reliability patterns, security, và production readiness.
+
+**Lead Team**:
+- SRE Minh (Reliability, Infrastructure, Operations)
+- Architect Khang (System Design, Technology Selection)
+- Mary Analyst (Requirements, Risk Assessment)
+
+**Success Criteria**:
+- ✓ Architecture design document approved
+- ✓ API contracts defined & validated
+- ✓ Database schema designed & indexed
+- ✓ CI/CD pipeline operational
+- ✓ Resilience & observability in place
+- ✓ Security & compliance verified
+- ✓ Production readiness checklist passed
+
+## Workflow Overview
+
+Workflow này di qua 6 bước atomic, mỗi bước focus vào một domain riêng:
+
+1. **Step-01-Architecture** → Requirements + Architecture Pattern Selection
+2. **Step-02-API-Database** → API Design + Database Selection + Schema
+3. **Step-03-Build-Deploy** → CI/CD + Containerization + Testing Strategy
+4. **Step-04-Reliability** → Resilience Patterns + Observability + Error Handling
+5. **Step-05-Security-Infra** → Auth/Authz + Secrets + K8s Config
+6. **Step-06-Readiness** → PRR Checklist + Runbook + Go/No-Go Decision
+
+## Configuration Loading
+
+Tự động load từ `_config/config.yaml`:
+
+```yaml
+project_context:
+  user_name: "[loaded from config]"
+  organization: "[loaded from config]"
+  environment: "production"
+
+workflow_defaults:
+  communication_language: "Vietnamese"
+  output_folder: "./outputs/setup-new-service-{service_name}"
+  timestamp: "2026-03-17"
+```
+
+## Execution Model
+
+### Entry Point Logic
+
+```
+1. Check if workflow.md exists in outputs folder
+   → If NEW: Start from step-01-architecture.md
+   → If RESUME: Load progress.yaml → auto-skip completed steps
+   → If PARTIAL: Load step-N-context.yaml → resume from step N
+
+2. For each step:
+   a) Load step-{N}-{name}.md
+   b) Load referenced SKILL files (auto-parse "Load:" directives)
+   c) Execute MENU [A][C] options
+   d) Save step output to step-{N}-output.md
+   e) Move to next step
+
+3. Final: Generate comprehensive outputs in outputs folder
+```
+
+### State Tracking
+
+Output document frontmatter tracks progress:
+
+```yaml
+workflow_progress:
+  step_01_architecture: "completed"
+  step_02_api_database: "completed"
+  step_03_build_deploy: "in_progress"
+  step_04_reliability: "pending"
+  step_05_security_infra: "pending"
+  step_06_readiness: "pending"
+  last_updated: "2026-03-17T14:30:00Z"
+  current_agent: "Architect Khang"
+```
+
+## Mandatory Workflow Rules
+
+1. **No skipping steps** — Mỗi step phải được execute theo order
+2. **Validate assumptions** — Mỗi decision phải được document
+3. **Cross-phase collaboration** — Architects + SRE + Analysts work together
+4. **Output artifacts** — Mỗi step produce tangible output documents
+5. **Handoff protocol** — Context được transfer giữa steps rõ ràng
+
+## Navigation
+
+Hãy chọn cách bắt đầu:
+
+- **[NEW]** — Bắt đầu workflow mới → Load step-01
+- **[RESUME]** — Quay lại workflow đã từng chạy (detect progress)
+- **[SKIP-TO]** — Nhảy tới step cụ thể (dev-only, requires confirmation)
+
+---
+
+**Tiếp tục bằng cách chọn [NEW] hoặc [RESUME]**
--- a/tools/cli/external-official-modules.yaml
+++ b/tools/cli/external-official-modules.yaml
@ -42,6 +42,16 @@ modules:
    type: bmad-org
    npmPackage: bmad-method-test-architecture-enterprise

+  bmad-production-systems:
+    url: https://github.com/DoanNgocCuong/bmad-module-production-systems
+    module-definition: src/module.yaml
+    code: psm
+    name: "Production Systems & MLOps (BMad Community Module)"
+    description: "Production engineering with SRE, Security, and MLOps agents for incident response, PRR, and deployment"
+    defaultSelected: false
+    type: community
+    npmPackage: bmad-production-systems
+
  whiteport-design-studio:
    url: https://github.com/bmad-code-org/bmad-method-wds-expansion
    module-definition: src/module.yaml