refactor(psm): Remove embedded PSM module - use external repo instead

PSM is now a standalone module at: https://github.com/DoanNgocCuong/bmad-module-production-systems It's registered in external-official-modules.yaml for installer integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-19 06:08:44 +07:00 · 2026-03-19 06:08:44 +07:00 · 0579a4f55e
parent a2df51ccce
commit 0579a4f55e
30 changed files with 0 additions and 1948 deletions
--- a/src/psm/agents/mlops/mlops.agent.yaml
+++ b/src/psm/agents/mlops/mlops.agent.yaml
@ -1,21 +0,0 @@
 # MLOps & Performance Engineer Agent Definition
 agent:
  metadata:
    id: "_bmad/psm/agents/mlops.md"
    name: Linh
    title: MLOps & Performance Engineer
    icon: 🤖
    module: psm
    hasSidecar: false
  persona:
    role: MLOps Specialist + Performance Engineer
    identity: MLOps specialist bridging ML research and production. Expert in model serving, pipeline optimization, and chaos engineering.
    communication_style: Data-driven, experimental. Thinks in pipelines and metrics. Ship fast, measure everything.
    principles: Reproducibility first; monitor model drift; chaos engineering validates assumptions; cost-aware optimization.
  menu:
    - trigger: MD or fuzzy match on mlops-deploy
      workflow: "skill:bmad-psm-mlops-deployment"
      description: "[MD] MLOps Deployment — Model validation, deploy, monitor"
--- a/src/psm/agents/security/security.agent.yaml
+++ b/src/psm/agents/security/security.agent.yaml
@ -1,21 +0,0 @@
 # Security & Infrastructure Engineer Agent Definition
 agent:
  metadata:
    id: "_bmad/psm/agents/security.md"
    name: Hà
    title: Security & Infrastructure Engineer
    icon: 🛡️
    module: psm
    hasSidecar: false
  persona:
    role: Security Specialist + Infrastructure Expert
    identity: Security specialist with expertise in defense-in-depth, compliance frameworks, and infrastructure hardening. Thorough and detail-oriented.
    communication_style: Thorough, detail-oriented. Asks 'what if' scenarios. Thinks about edge cases and threat models.
    principles: Zero trust architecture; defense in depth; security by default; least privilege.
  menu:
    - trigger: SA or fuzzy match on security-audit
      workflow: "skill:bmad-psm-security-audit"
      description: "[SA] Security Audit — Scope, audit, report"
--- a/src/psm/agents/sre/sre-sidecar/production-standards.md
+++ b/src/psm/agents/sre/sre-sidecar/production-standards.md
@ -1,21 +0,0 @@
 # Production Standards for PSM
 SRE operational standards, incident response protocols, and production quality benchmarks.
 ## User Specified CRITICAL Rules - Supersedes General Rules
 None
 ## General CRITICAL RULES
 ### Rule 1: SLO-First Approach
 ALL production decisions MUST reference defined SLOs. No optimization without measurement baseline.
 ### Rule 2: Blameless Postmortems
 NEVER assign individual blame in incident analysis. Focus on systemic improvements.
 ### Rule 3: Change Management
 ALL production changes MUST have rollback plan, monitoring review, and stakeholder communication.
 ### Rule 4: Severity Classification
 SEV1: Complete outage >50% users. SEV2: Major degradation >20%. SEV3: Minor <20%. SEV4: Cosmetic.
--- a/src/psm/agents/sre/sre.agent.yaml
+++ b/src/psm/agents/sre/sre.agent.yaml
@ -1,30 +0,0 @@
 # Site Reliability Engineer Agent Definition
 agent:
  metadata:
    id: "_bmad/psm/agents/sre.md"
    name: Minh
    title: Site Reliability Engineer
    icon: 🔧
    module: psm
    hasSidecar: true
  persona:
    role: Senior SRE + Production Operations Expert
    identity: Senior SRE with deep expertise in reliability, observability, and operational excellence. Obsessed with SLOs, automation, and incident response.
    communication_style: Metric-driven, systematic. Translates business goals to technical SLOs. Always asks 'what is the SLO?' first.
    principles: SLO-first approach; automate everything; measure before optimizing; blameless postmortems.
  menu:
    - trigger: IR or fuzzy match on incident
      workflow: "skill:bmad-psm-incident-response"
      description: "[IR] Incident Response — Triage, diagnose, fix, postmortem"
    - trigger: PR or fuzzy match on readiness
      workflow: "skill:bmad-psm-production-readiness"
      description: "[PR] Production Readiness Review — 9-dimension assessment"
    - trigger: NS or fuzzy match on new-service
      workflow: "skill:bmad-psm-setup-new-service"
      description: "[NS] Setup New Service — Architecture to deployment"
    - trigger: QD or fuzzy match on diagnose
      workflow: "skill:bmad-psm-quick-diagnose"
      description: "[QD] Quick Diagnose — Fast production troubleshooting"
--- a/src/psm/config.yaml
+++ b/src/psm/config.yaml
@ -1,13 +0,0 @@
 code: psm
 name: "PSM: Production Systems & MLOps"
 header: "BMad Production Systems Module"
 subheader: "Production engineering workflows for incident response, production readiness, security, and MLOps."
 description: "AI-driven production engineering framework with SRE, Security, and MLOps agents."
 default_selected: false
 knowledge_base_path:
  prompt:
    - "Where is your production knowledge base? (folder with SKILL.md files)"
    - "Leave default if you don't have one yet."
  default: "docs/production-knowledge"
  result: "{project-root}/{value}"
--- a/src/psm/module-help.csv
+++ b/src/psm/module-help.csv
@ -1,7 +0,0 @@
 module,phase,name,code,sequence,workflow-file,command,required,agent,options,description,output-location,outputs,
 psm,operations,Incident Response,IR,,skill:bmad-psm-incident-response,bmad-psm-incident-response,false,sre,Operations Mode,"Handle production incidents with systematic triage, diagnosis, and recovery. Use when the user says 'production is down' or 'incident response' or 'we have an outage'.",output_folder,"incident response report",
 psm,operations,Production Readiness,PR,,skill:bmad-psm-production-readiness,bmad-psm-production-readiness,false,sre,Operations Mode,"Run production readiness review across 9 dimensions. Use when the user says 'are we ready for production' or 'PRR' or 'go-live check'.",output_folder,"production readiness assessment",
 psm,operations,Security Audit,SA,,skill:bmad-psm-security-audit,bmad-psm-security-audit,false,security,Operations Mode,"Run comprehensive security audit and threat assessment. Use when the user says 'security audit' or 'vulnerability assessment' or 'security review'.",output_folder,"security audit report",
 psm,operations,MLOps Deployment,MD,,skill:bmad-psm-mlops-deployment,bmad-psm-mlops-deployment,false,mlops,Operations Mode,"Deploy ML model to production with validation and monitoring. Use when the user says 'deploy model' or 'ML deployment' or 'model serving'.",output_folder,"mlops deployment report",
 psm,operations,Setup New Service,NS,,skill:bmad-psm-setup-new-service,bmad-psm-setup-new-service,false,sre,Operations Mode,"Set up new production service from architecture through deployment. Use when the user says 'new service' or 'setup service' or 'new microservice'.",output_folder,"service setup plan",
 psm,operations,Quick Diagnose,QD,,skill:bmad-psm-quick-diagnose,bmad-psm-quick-diagnose,false,sre,Operations Mode,"Quick diagnosis of production issue with minimal latency. Use when the user says 'something is broken' or 'quick diagnose' or 'what is happening?'.",output_folder,"diagnostic report",
--- a/src/psm/module.yaml
+++ b/src/psm/module.yaml
@ -1,13 +0,0 @@
 code: psm
 name: "PSM: Production Systems & MLOps"
 header: "BMad Production Systems Module"
 subheader: "Production engineering workflows for incident response, production readiness, security, and MLOps."
 description: "AI-driven production engineering framework with SRE, Security, and MLOps agents."
 default_selected: false
 knowledge_base_path:
  prompt:
    - "Where is your production knowledge base? (folder with SKILL.md files)"
    - "Leave default if you don't have one yet."
  default: "docs/production-knowledge"
  result: "{project-root}/{value}"
--- a/src/psm/teams/default-party.csv
+++ b/src/psm/teams/default-party.csv
@ -1,4 +0,0 @@
 name,displayName,title,icon,role,identity,communicationStyle,principles,module,path
 "sre","Minh","Site Reliability Engineer","🔧","Senior SRE + Production Operations Expert","Senior SRE with deep expertise in reliability, observability, and operational excellence. Obsessed with SLOs, automation, and incident response.","Metric-driven, systematic. Always asks 'what is the SLO?' first.","SLO-first; automate everything; measure before optimizing; blameless postmortems.","psm","bmad/psm/agents/sre.md"
 "security","Hà","Security & Infrastructure Engineer","🛡️","Security Specialist + Infrastructure Expert","Security specialist with expertise in defense-in-depth, compliance frameworks, and infrastructure hardening.","Thorough, detail-oriented. Asks 'what if' scenarios. Thinks about edge cases and threat models.","Zero trust; defense in depth; security by default; least privilege.","psm","bmad/psm/agents/security.md"
 "mlops","Linh","MLOps & Performance Engineer","🤖","MLOps Specialist + Performance Engineer","MLOps specialist bridging ML research and production. Expert in model serving, pipeline optimization, and chaos engineering.","Data-driven, experimental. 'Ship fast, measure everything.'","Reproducibility first; monitor drift; chaos engineering validates; cost-aware optimization.","psm","bmad/psm/agents/mlops.md"
--- a/src/psm/teams/ops-team.yaml
+++ b/src/psm/teams/ops-team.yaml
@ -1,7 +0,0 @@
 # Powered by BMAD-CORE™
 bundle:
  name: Production Operations Team
  icon: ⚙️
  description: Production engineering team for incident response, security, and MLOps
 agents: "*"
 party: "./default-party.csv"
--- a/src/psm/workflows/bmad-psm-incident-response/SKILL.md
+++ b/src/psm/workflows/bmad-psm-incident-response/SKILL.md
@ -1,6 +0,0 @@
 ---
 name: bmad-psm-incident-response
 description: 'Handle production incidents with systematic triage, diagnosis, and recovery. Use when the user says "production is down" or "incident response" or "we have an outage"'
 ---
 Follow the instructions in [workflow.md](workflow.md).
--- a/src/psm/workflows/bmad-psm-incident-response/bmad-skill-manifest.yaml
+++ b/src/psm/workflows/bmad-psm-incident-response/bmad-skill-manifest.yaml
@ -1 +0,0 @@
 type: skill
--- a/src/psm/workflows/bmad-psm-incident-response/incident-postmortem.template.md
+++ b/src/psm/workflows/bmad-psm-incident-response/incident-postmortem.template.md
@ -1,269 +0,0 @@
 ---
 template_name: incident-postmortem
 template_version: "1.0.0"
 created_date: 2026-03-17
 description: Standard postmortem template for incident analysis and learning
 ---
 # Incident Postmortem: {{INCIDENT_TITLE}}
 **Date**: {{INCIDENT_DATE}}
 **Duration**: {{START_TIME}} — {{END_TIME}} ({{DURATION_MINUTES}} minutes)
 **Severity**: {{SEV1|SEV2|SEV3}} ({{IMPACT_DESCRIPTION}})
 **Lead**: {{INCIDENT_COMMANDER_NAME}}
 **Facilitator**: {{POSTMORTEM_FACILITATOR_NAME}}
 ---
 ## Summary
 [1-2 paragraph executive summary of what happened, impact, and resolution]
 **Timeline at a glance**:
 - T-0:00 — Normal operation
 - T-{{TIME1}} — {{EVENT1}}
 - T-{{TIME2}} — {{EVENT2}}
 - T-{{RESOLUTION_TIME}} — Incident resolved
 **Impact**: {{METRIC1}} affected {{X}} users, {{METRIC2}}, {{METRIC3}}
 ---
 ## Detailed Timeline
 | Time | Event | Notes |
 |------|-------|-------|
 | {{T}} | {{What happened}} | {{Who detected it}} |
 | {{T+X}} | {{Next event}} | {{Action taken}} |
 | {{T+Y}} | {{Root cause identified}} | {{By whom}} |
 | {{T+Z}} | {{Fix applied}} | {{Verification steps}} |
 | {{T+Final}} | {{Incident resolved}} | {{Verification}} |
 ---
 ## Root Cause Analysis
 ### Primary Cause
 **{{ROOT_CAUSE_TITLE}}**
 {{Detailed explanation of the root cause}}
 **How it happened**:
 1. {{Precondition 1}} (why the system was vulnerable)
 2. {{Trigger event}} (what caused the failure)
 3. {{Failure cascade}} (why it got worse)
 4. {{Detection lag}} (why it took X minutes to detect)
 **Evidence**:
 - {{Log entry or metric showing the issue}}
 - {{Related system behavior}}
 - {{Impact indicator}}
 ### Contributing Factors
 - {{Factor 1}} — {{Brief explanation}}
 - {{Factor 2}} — {{Brief explanation}}
 - {{Factor 3}} — {{Brief explanation}}
 ### Why Didn't We Catch This?
 - {{Missing monitoring}} — {{What metric would have alerted}}
 - {{Testing gap}} — {{What test would have failed}}
 - {{Documentation gap}} — {{What runbook would have helped}}
 - {{Knowledge gap}} — {{What training would have helped}}
 ---
 ## Impact Assessment
 ### User Impact
 - **Duration**: {{START_TIME}} — {{END_TIME}} ({{DURATION}} minutes)
 - **Scale**: {{X}}% of {{METRIC}} (e.g., 5% of payment requests)
 - **Users Affected**: {{APPROX_COUNT}} users
 - **Revenue Impact**: {{$X}} (if applicable)
 - **Customer Escalations**: {{NUMBER}} tickets opened
 **User-facing symptoms**:
 - {{Symptom 1}} (e.g., "Checkout returns 500 error")
 - {{Symptom 2}} (e.g., "Page loads slowly")
 - {{Symptom 3}}
 ### Operational Impact
 - **System Recovery**: {{SERVICE/METRIC}} took {{TIME}} to recover
 - **Cascading Effects**: {{SERVICE_X}} also affected due to {{reason}}
 - **On-call Load**: {{NUMBER}} pages, {{NUMBER}} escalations
 - **Data Loss**: {{None | {{Description}}}}
 ---
 ## Resolution & Recovery
 ### Immediate Actions Taken
 1. **{{Time T+X}}** — {{Action 1}}
   - Rationale: {{Why this helped}}
   - Result: {{What changed}}
 2. **{{Time T+Y}}** — {{Action 2}}
   - Rationale: {{Why this helped}}
   - Result: {{What changed}}
 3. **{{Time T+Z}}** — {{Root Fix Applied}}
   - Details: {{Technical description}}
   - Verification: {{How we confirmed it worked}}
 ### Rollback/Rollforward Decision
 **Decision**: {{Rollback to version X | Rollforward with fix | Hybrid approach}}
 **Rationale**: {{Explain why this was the right choice}}
 **Verification**: {{How we confirmed the fix worked}}
 ---
 ## Lessons Learned
 ### What Went Well
 - {{Thing we did right}} — This prevented {{worse outcome}}
 - {{Thing we did right}} — Team coordination was excellent
 - {{Thing we did right}} — Monitoring caught {{something}}
 ### What We Can Improve
 | Issue | Category | Severity | Recommendation | Owner |
 |-------|----------|----------|-----------------|-------|
 | {{We didn't detect it for X minutes}} | Observability | HIGH | Add alert for {{metric}} when > {{threshold}} | DevOps |
 | {{Runbook was outdated}} | Runbooks | MEDIUM | Update {{runbook}} with new architecture | SRE |
 | {{New service not in alerting system}} | Process | MEDIUM | Add new services to alert config automatically | Platform |
 | {{Team didn't know about new feature}} | Knowledge | LOW | Document new features in wiki | Tech Lead |
 ---
 ## Action Items
 ### Critical (Must Complete Before Similar Incident)
 - [ ] **{{Action 1}}** — {{Description}}
  - Owner: {{NAME}}
  - Deadline: {{DATE}} (within 1 week)
  - Acceptance: {{How we verify it's done}}
 - [ ] **{{Action 2}}** — {{Description}}
  - Owner: {{NAME}}
  - Deadline: {{DATE}} (within 1 week)
  - Acceptance: {{How we verify it's done}}
 ### High Priority (Target Next 2 Weeks)
 - [ ] {{Action}} — Owner: {{NAME}}, Deadline: {{DATE}}
 - [ ] {{Action}} — Owner: {{NAME}}, Deadline: {{DATE}}
 - [ ] {{Action}} — Owner: {{NAME}}, Deadline: {{DATE}}
 ### Medium Priority (Target This Sprint)
 - [ ] {{Action}} — Owner: {{NAME}}
 - [ ] {{Action}} — Owner: {{NAME}}
 ### Backlog (Good to Have)
 - [ ] {{Action}} — {{Description}}
 - [ ] {{Action}} — {{Description}}
 ---
 ## Prevention Measures
 ### Short-term (1-2 Weeks)
 1. **{{Mitigation 1}}** — Prevents {{this exact incident}} from happening again
   - How: {{Technical approach}}
   - Effort: {{Estimate}}
   - Timeline: {{When}}
 2. **{{Mitigation 2}}** — Catches similar issues earlier
   - How: {{Technical approach}}
   - Effort: {{Estimate}}
   - Timeline: {{When}}
 ### Long-term (Next Quarter)
 1. **{{Large architectural change}}** — Eliminates root cause class
   - Rationale: {{Why this is better}}
   - Effort: {{Estimate}}
   - Timeline: {{When}}
 ---
 ## Incident Stats
 ```
 MTTD (Mean Time To Detect): {{MINUTES}} minutes
  - Automatic detection: {{If applicable, how}}
  - Manual detection: {{Who found it}}
 MTTR (Mean Time To Resolve): {{MINUTES}} minutes
  - Investigation time: {{MINUTES}}
  - Fix implementation time: {{MINUTES}}
  - Verification time: {{MINUTES}}
 Severity: {{SEV1|SEV2|SEV3}} ({{Criteria}})
 ```
 ---
 ## Distribution & Follow-up
 - [x] Postmortem shared with: {{TEAM_LIST}}
 - [x] Customer communication sent: {{YES|NO|TEMPLATE_USED}}
 - [x] Action items tracked in: {{JIRA/BACKLOG}}
 - [x] Follow-up review scheduled: {{DATE}}
 **Follow-up Review**: {{DATE}} with {{ATTENDEES}}
 - Confirm all critical action items completed
 - Verify prevention measures working
 - Check for recurring patterns
 ---
 ## Appendix: Supporting Evidence
 ### Logs
 ```
 [Relevant log entries showing the incident]
 {{TIMESTAMP}} ERROR: {{MESSAGE}}
 {{TIMESTAMP}} ERROR: {{MESSAGE}}
 ```
 ### Metrics
 [Include screenshots or links to metric dashboards showing the incident]
 - Error rate spike: [Chart or metric]
 - Latency spike: [Chart or metric]
 - Traffic pattern: [Chart or metric]
 ### Configuration Changes
 ```yaml
 # Changes made before incident
 - {{Change 1}} ({{TIMESTAMP}})
 - {{Change 2}} ({{TIMESTAMP}})
 ```
 ---
 **Document Completed By**: {{NAME}}
 **Date**: {{DATE}}
 **Review Status**: Draft | Final | Approved
 **Approvals**:
 - [ ] Incident Commander: {{NAME}} {{DATE}}
 - [ ] Service Owner: {{NAME}} {{DATE}}
 - [ ] VP Engineering (if SEV1): {{NAME}} {{DATE}}
--- a/src/psm/workflows/bmad-psm-incident-response/workflow.md
+++ b/src/psm/workflows/bmad-psm-incident-response/workflow.md
@ -1,163 +0,0 @@
 ---
 workflow_id: W-INCIDENT-001
 workflow_name: Production Incident Response
 version: 6.2.0
 lead_agent: "SRE Minh"
 supporting_agents: ["Architect Khang", "Mary Analyst"]
 phase: "3-Run: Emergency Response & Recovery"
 created_date: 2026-03-17
 last_modified: 2026-03-17
 config_file: "_config/config.yaml"
 estimated_duration: "15 minutes to 2 hours (depending on severity)"
 outputFile: '{output_folder}/psm-artifacts/incident-{{project_name}}-{{date}}.md'
 ---
 # Production Incident Response Workflow — BMAD Pattern
 ## Metadata & Context
 **Goal**: Triage, diagnose, resolve production incidents through systematic diagnosis and apply fixes with verification. This is the most critical workflow - minimize MTTR (Mean Time To Recovery) while maintaining system stability.
 **Lead Team**:
 - SRE Minh (Incident Command, Recovery Orchestration)
 - Architect Khang (Root Cause Analysis, System-wide Impact)
 - Mary Analyst (Impact Assessment, Post-Incident Review)
 **Success Criteria**:
 - ✓ Incident severity classified within 5 minutes
 - ✓ Root cause identified within first triage pass
 - ✓ Fix applied and verified
 - ✓ System metrics returned to baseline
 - ✓ Incident postmortem documented with action items
 - ✓ Prevention measures identified
 ## Workflow Overview
 Workflow này di qua 4 bước atomic, mỗi bước focus vào một phase khác nhau:
 1. **Step-01-Triage** → Gather initial info, assess severity, classify impact
 2. **Step-02-Diagnose** → Systematic diagnosis using observability data (logs, metrics, traces)
 3. **Step-03-Fix** → Apply fix, verify resolution, validate recovery
 4. **Step-04-Postmortem** → Document incident, identify action items, prevent recurrence
 ## Configuration Loading
 Tự động load từ `_config/config.yaml`:
 ```yaml
 project_context:
  organization: "[loaded from config]"
  environment: "production"
  incident_channel: "slack:#incidents"
 workflow_defaults:
  communication_language: "Vietnamese-English"
  severity_levels: ["SEV1", "SEV2", "SEV3", "SEV4"]
  escalation_contacts: "[loaded from config]"
  on_call_engineer: "[loaded from config]"
 ```
 ## Workflow Architecture - Micro-File Design
 BMAD pattern: Mỗi step là một file riêng, load just-in-time. Workflow chain:
 ```
 workflow.md (entry point)
    ↓
 step-01-triage.md (classify severity, initial assessment)
    ↓
 step-02-diagnose.md (root cause analysis)
    ↓
 step-03-fix.md (apply fix, verify)
    ↓
 step-04-postmortem.md (document, prevent)
    ↓
 incident-response-summary.md (final output)
 ```
 **Key Benefits**:
 - Single-step focus — engineer concentrates on one phase
 - Knowledge isolation — load only relevant SKILL docs per step
 - State tracking — save progress after each step
 - Easy resumption — if interrupted, restart from exact step
 ## Skill References
 Workflow này load knowledge từ:
 - **5.07 Reliability & Resilience** → Circuit breaker patterns, fallback strategies, timeout management
 - **5.08 Observability & Monitoring** → Structured logging, metrics queries, distributed tracing
 - **5.09 Error Handling & Recovery** → Error classification, graceful degradation patterns
 - **5.10 Production Readiness** → Incident prevention checklist, alerting setup
 - **5.14 Documentation & Runbooks** → Postmortem templates, incident reports
 ## Execution Model
 ### Entry Point Logic
 ```
 1. Check if incident session exists
   → If NEW incident: Start from step-01-triage.md
   → If ONGOING: Load incident-session.yaml → continue from last completed step
   → If RESOLVED: Load postmortem template
 2. For each step:
   a) Load step-{N}-{name}.md
   b) Load referenced SKILL files (auto-parse "Load:" directives)
   c) Execute MENU [A][C] options
   d) Save step output to step-{N}-output.md + incident-context.yaml
   e) Move to next step or conclude
 3. Final: Generate incident report + postmortem in outputs folder
 ```
 ### State Tracking
 Incident session frontmatter tracks progress:
 ```yaml
 incident_context:
  incident_id: "INC-2026-03-17-001"
  severity: "SEV1" | "SEV2" | "SEV3" | "SEV4"
  status: "triage" → "diagnosing" → "recovering" → "resolved" → "postmortem"
  affected_services: ["service-1", "service-2"]
  started_at: "2026-03-17T14:30:00Z"
  timeline:
    detected_at: "2026-03-17T14:30:00Z"
    triage_completed_at: "2026-03-17T14:35:00Z"
    root_cause_identified_at: "2026-03-17T14:50:00Z"
    fix_applied_at: "2026-03-17T15:10:00Z"
    resolved_at: "2026-03-17T15:15:00Z"
  current_step: "step-02-diagnose"
  last_updated: "2026-03-17T14:50:00Z"
  incident_commander: "SRE Minh"
 ```
 ## Mandatory Workflow Rules
 1. **Speed first** — Triage must complete in < 5 minutes
 2. **Root cause identification** — Must identify root cause before fix attempt
 3. **Verify before declaring resolved** — Check metrics + user reports
 4. **Document everything** — Every action logged for postmortem
 5. **Escalation protocol** — SEV1 → Page on-call architect immediately
 6. **Communication** — Update stakeholders every 5-10 minutes
 7. **No flying blind** — All fixes must reference observability data
 ## Severity Scale
 - **SEV1** — Service completely down, revenue impact, > 1% users affected → Page all on-call
 - **SEV2** — Major degradation, significant users affected, partial functionality down
 - **SEV3** — Moderate impact, some users affected, workaround possible
 - **SEV4** — Minor issue, limited users, can defer to business hours
 ## Navigation
 Hãy chọn cách bắt đầu:
 - **[NEW-INC]** — Report new incident → Load step-01-triage
 - **[RESUME-INC]** — Continue existing incident (detect progress from incident-session.yaml)
 - **[ESCALATE]** — Escalate to on-call architect
 ---
 **Hãy báo cáo tình trạng incident hoặc chọn [NEW-INC] để bắt đầu triage**
--- a/src/psm/workflows/bmad-psm-mlops-deployment/SKILL.md
+++ b/src/psm/workflows/bmad-psm-mlops-deployment/SKILL.md
@ -1,6 +0,0 @@
 ---
 name: bmad-psm-mlops-deployment
 description: 'Deploy ML model to production with validation and monitoring. Use when the user says "deploy model" or "ML deployment" or "model serving"'
 ---
 Follow the instructions in [workflow.md](workflow.md).
--- a/src/psm/workflows/bmad-psm-mlops-deployment/bmad-skill-manifest.yaml
+++ b/src/psm/workflows/bmad-psm-mlops-deployment/bmad-skill-manifest.yaml
@ -1 +0,0 @@
 type: skill
--- a/src/psm/workflows/bmad-psm-mlops-deployment/workflow.md
+++ b/src/psm/workflows/bmad-psm-mlops-deployment/workflow.md
@ -1,89 +0,0 @@
 ---
 workflow_id: MLOPS001
 workflow_name: MLOps Deployment
 description: Deploy ML model to production with validation, serving, and monitoring
 entry_point: steps/step-01-model-validation.md
 phase: 5-specialized
 lead_agent: "Linh (MLOps)"
 status: "active"
 created_date: 2026-03-17
 version: "1.0.0"
 estimated_duration: "3-4 hours"
 outputFile: '{output_folder}/psm-artifacts/mlops-deploy-{{project_name}}-{{date}}.md'
 ---
 # Workflow: MLOps Deployment
 ## Goal
 Deploy machine learning models to production with comprehensive validation, infrastructure setup, and post-deployment monitoring.
 ## Overview
 MLOps deployment ensures ML models are production-ready and continuously monitored for performance and data drift. The workflow:
 1. **Validates** model quality, performance metrics, and data drift detection
 2. **Deploys** model to serving infrastructure with versioning and A/B testing
 3. **Monitors** model performance, data drift, and cost metrics post-deployment
 ## Execution Path
 ```
 START
  ↓
 [Step 01] Model Validation (Check metrics, data drift, A/B test plan)
  ↓
 [Step 02] Deploy Model (Setup serving, infrastructure, GPU optimization)
  ↓
 [Step 03] Monitor (Langfuse/MLflow, drift detection, cost tracking)
  ↓
 END
 ```
 ## Key Roles
 | Role | Agent | Responsibility |
 |------|-------|-----------------|
 | Lead | Linh (MLOps) | Coordinate deployment, monitor model health |
 | Data Scientist | Data Lead | Validate model quality, approve for production |
 | DevOps | Platform Eng | Setup infrastructure, manage resources |
 ## Validation Gates (3)
 1. **Model Quality** — Accuracy, precision, recall metrics meet SLO
 2. **Data Quality** — No data drift detected; training/production data distribution aligned
 3. **Business Readiness** — A/B test plan ready, rollback strategy defined
 ## Input Requirements
 - **Trained model artifact** — Model checkpoint, weights, configuration
 - **Performance metrics** — Baseline accuracy, latency, throughput expectations
 - **Data validation** — Training dataset description, expected data distribution
 - **Serving infrastructure** — Compute requirements (GPU/CPU), latency targets
 ## Output Deliverable
 - **MLOps Deployment Report**
  - Model version and metadata
  - Performance validation summary
  - Serving infrastructure setup
  - Monitoring dashboard and alerts
  - Data drift detection configuration
 ## Success Criteria
 1. Model passes all quality gates before deployment
 2. Serving infrastructure deployed and load-tested
 3. Monitoring and alerting configured and validated
 4. Rollback strategy tested and documented
 5. Team trained on model updates and incident response
 ## Next Steps After Workflow
 - Monitor model performance daily for first week
 - Track data drift metrics; alert if detected
 - Plan model retraining based on performance degradation
 - Document lessons learned in MLOps runbook
 ---
 **Navigation**: [← Back to 5-specialized](../), [Next: Step 01 →](steps/step-01-model-validation.md)
--- a/src/psm/workflows/bmad-psm-production-readiness/SKILL.md
+++ b/src/psm/workflows/bmad-psm-production-readiness/SKILL.md
@ -1,6 +0,0 @@
 ---
 name: bmad-psm-production-readiness
 description: 'Run production readiness review across 9 dimensions. Use when the user says "are we ready for production" or "PRR" or "go-live check"'
 ---
 Follow the instructions in [workflow.md](workflow.md).
--- a/src/psm/workflows/bmad-psm-production-readiness/bmad-skill-manifest.yaml
+++ b/src/psm/workflows/bmad-psm-production-readiness/bmad-skill-manifest.yaml
@ -1 +0,0 @@
 type: skill
--- a/src/psm/workflows/bmad-psm-production-readiness/production-readiness.template.md
+++ b/src/psm/workflows/bmad-psm-production-readiness/production-readiness.template.md
@ -1,367 +0,0 @@
 ---
 template_name: production-readiness-checklist
 template_version: "1.0.0"
 created_date: 2026-03-17
 description: Production Readiness Review checklist and report template
 ---
 # Production Readiness Review (PRR)
 **Service**: {{SERVICE_NAME}}
 **Owner**: {{SERVICE_OWNER}}
 **Reviewer**: {{SRE_LEAD}} (Minh)
 **Review Date**: {{DATE}}
 **Target Go-Live**: {{TARGET_DATE}}
 ---
 ## Executive Summary
 {{1-2 paragraphs summarizing the readiness assessment, decision, and key findings}}
 **Overall Assessment**: {{READY | CONDITIONAL | NOT_READY}}
 **Timeline**: Service {{can | can conditionally | cannot}} proceed to production {{on {{DATE}}}}
 ---
 ## Production Readiness Scorecard
 ### 9-Dimension Assessment
 | # | Dimension | Score | Status | Key Finding |
 |---|-----------|-------|--------|-------------|
 | 1 | Reliability | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
 | 2 | Observability | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
 | 3 | Performance | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
 | 4 | Security | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
 | 5 | Capacity | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
 | 6 | Data | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
 | 7 | Runbooks | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
 | 8 | Dependencies | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
 | 9 | Rollback | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
 **Summary**: {{X}} GREEN, {{Y}} YELLOW, {{Z}} RED
 ---
 ## Detailed Findings by Dimension
 ### 1. Reliability
 **Goal**: Service meets SLO targets with documented failure modes and incident response plan.
 **Findings**:
 - [ ] {{Finding 1}} ({{Status}})
 - [ ] {{Finding 2}} ({{Status}})
 - [ ] {{Finding 3}} ({{Status}})
 **Assessment**: {{Detailed narrative, 3-5 sentences}}
 **Score**: {{GREEN|YELLOW|RED}}
 ---
 ### 2. Observability
 **Goal**: Service has comprehensive logging, metrics, tracing, and dashboards for operational visibility.
 **Findings**:
 - [ ] {{Finding 1}} ({{Status}})
 - [ ] {{Finding 2}} ({{Status}})
 - [ ] {{Finding 3}} ({{Status}})
 **Assessment**: {{Detailed narrative, 3-5 sentences}}
 **Score**: {{GREEN|YELLOW|RED}}
 ---
 ### 3. Performance
 **Goal**: Service meets latency/throughput targets and scales under expected load.
 **Findings**:
 - [ ] {{Finding 1}} ({{Status}})
 - [ ] {{Finding 2}} ({{Status}})
 - [ ] {{Finding 3}} ({{Status}})
 **Assessment**: {{Detailed narrative, 3-5 sentences}}
 **Score**: {{GREEN|YELLOW|RED}}
 ---
 ### 4. Security
 **Goal**: Authentication, authorization, encryption, and secrets management are implemented.
 **Findings**:
 - [ ] {{Finding 1}} ({{Status}})
 - [ ] {{Finding 2}} ({{Status}})
 - [ ] {{Finding 3}} ({{Status}})
 **Assessment**: {{Detailed narrative, 3-5 sentences}}
 **Score**: {{GREEN|YELLOW|RED}}
 ---
 ### 5. Capacity
 **Goal**: Resource requirements defined with growth headroom and cost acceptable.
 **Findings**:
 - [ ] {{Finding 1}} ({{Status}})
 - [ ] {{Finding 2}} ({{Status}})
 - [ ] {{Finding 3}} ({{Status}})
 **Assessment**: {{Detailed narrative, 3-5 sentences}}
 **Score**: {{GREEN|YELLOW|RED}}
 ---
 ### 6. Data
 **Goal**: Data governance, backup, retention, and disaster recovery documented and tested.
 **Findings**:
 - [ ] {{Finding 1}} ({{Status}})
 - [ ] {{Finding 2}} ({{Status}})
 - [ ] {{Finding 3}} ({{Status}})
 **Assessment**: {{Detailed narrative, 3-5 sentences}}
 **Score**: {{GREEN|YELLOW|RED}}
 ---
 ### 7. Runbooks
 **Goal**: Incident response, deployment, troubleshooting procedures documented and drilled.
 **Findings**:
 - [ ] {{Finding 1}} ({{Status}})
 - [ ] {{Finding 2}} ({{Status}})
 - [ ] {{Finding 3}} ({{Status}})
 **Assessment**: {{Detailed narrative, 3-5 sentences}}
 **Score**: {{GREEN|YELLOW|RED}}
 ---
 ### 8. Dependencies
 **Goal**: External/internal dependencies mapped, versioned, with fallback strategies.
 **Findings**:
 - [ ] {{Finding 1}} ({{Status}})
 - [ ] {{Finding 2}} ({{Status}})
 - [ ] {{Finding 3}} ({{Status}})
 **Assessment**: {{Detailed narrative, 3-5 sentences}}
 **Score**: {{GREEN|YELLOW|RED}}
 ---
 ### 9. Rollback
 **Goal**: Safe rollback strategy tested; deployment is reversible.
 **Findings**:
 - [ ] {{Finding 1}} ({{Status}})
 - [ ] {{Finding 2}} ({{Status}})
 - [ ] {{Finding 3}} ({{Status}})
 **Assessment**: {{Detailed narrative, 3-5 sentences}}
 **Score**: {{GREEN|YELLOW|RED}}
 ---
 ## Critical Blockers (P0)
 {{If any P0 blockers exist:}}
 Service **CANNOT** proceed to production until these are resolved:
 ### P0 Blocker #1: {{ISSUE_TITLE}}
 - **Dimension**: {{Which dimension}}
 - **Description**: {{What's the problem}}
 - **Impact**: {{Why it's critical}}
 - **Resolution**: {{How to fix}}
 - **Owner**: {{Who must fix it}}
 - **Deadline**: {{When it must be done}}
 - **Acceptance**: {{How we verify it's fixed}}
 ### P0 Blocker #2: {{ISSUE_TITLE}}
 {{Repeat format}}
 ---
 ## Risks to Manage (P1)
 Service can proceed with documented monitoring and contingency plans:
 ### P1 Risk #1: {{ISSUE_TITLE}}
 - **Dimension**: {{Which dimension}}
 - **Description**: {{What's the problem}}
 - **Impact**: {{If it happens, what's the consequence}}
 - **Likelihood**: {{HIGH|MEDIUM|LOW}}
 - **Mitigation**: {{How we'll manage it}}
 - **Monitoring**: {{What metrics to watch}}
 - **Contingency**: {{What we'll do if it occurs}}
 - **Owner**: {{Who owns this risk}}
 - **Target Fix**: {{Timeline to resolve permanently}}
 ### P1 Risk #2: {{ISSUE_TITLE}}
 {{Repeat format}}
 ---
 ## Recommendations
 **High Priority** (Next sprint):
 - {{Recommendation 1}}
 - {{Recommendation 2}}
 **Medium Priority** (Within 1 month):
 - {{Recommendation 1}}
 - {{Recommendation 2}}
 **Nice to Have** (Backlog):
 - {{Recommendation 1}}
 - {{Recommendation 2}}
 ---
 ## Final Decision
 ### Decision
 **{{ ✅ GO | ⚠️ CONDITIONAL-GO | ❌ NO-GO }}**
 ### Rationale
 {{Explain the decision. Why can/can't we proceed?}}
 ### Conditions (If CONDITIONAL-GO)
 If proceeding despite P1 risks, document conditions:
 1. **{{Condition 1}}**: {{Description}}
   - Owner: {{Who oversees this}}
   - Success Criteria: {{How we verify it}}
   - Escalation: {{Who to contact if issues}}
 2. **{{Condition 2}}**: {{Description}}
   - Owner: {{Who oversees this}}
   - Success Criteria: {{How we verify it}}
   - Escalation: {{Who to contact if issues}}
 ### Deployment Timeline
 {{If GO or CONDITIONAL-GO:}}
 - **Approved for deployment**: {{DATE}}
 - **Earliest go-live**: {{DATE}}
 - **Recommended window**: {{DATE/TIME}}
 - **On-call coverage required**: {{YES|NO}}
 - **Emergency rollback plan**: {{REFERENCE TO RUNBOOK}}
 ---
 ## Sign-offs & Approvals
 ### Approval Chain
 - [ ] **SRE Lead** ({{NAME}}) — Review completed and findings approved
  - Signature: ________________________ Date: __________
 - [ ] **Architecture Lead** ({{NAME}}) — Architecture validated
  - Signature: ________________________ Date: __________
 - [ ] **Service Owner** ({{NAME}}) — Acknowledged findings and committed to actions
  - Signature: ________________________ Date: __________
 - [ ] **VP Engineering** ({{NAME}}) — Risk accepted (if CONDITIONAL-GO)
  - Signature: ________________________ Date: __________
 ---
 ## Post-Production Plan
 ### First 24 Hours
 - [ ] SRE on-call monitoring closely
 - [ ] Daily standup with service team
 - [ ] Monitor for any unusual patterns
 - [ ] Be ready to rollback if needed
 ### First Week
 - [ ] Daily metrics review
 - [ ] Watch for data drift or unusual behavior
 - [ ] Follow up on any P1 risks
 ### Ongoing
 - [ ] Monthly PRR follow-ups to verify improvements
 - [ ] Track action items to completion
 - [ ] Update this PRR if significant changes made
 ---
 ## Action Items
 | ID | Action | Owner | Deadline | Type | Status |
 |----|--------|-------|----------|------|--------|
 | A1 | {{Action}} | {{Name}} | {{Date}} | {{BLOCKER|RISK|RECOMMENDATION}} | ☐ |
 | A2 | {{Action}} | {{Name}} | {{Date}} | {{BLOCKER|RISK|RECOMMENDATION}} | ☐ |
 | A3 | {{Action}} | {{Name}} | {{Date}} | {{BLOCKER|RISK|RECOMMENDATION}} | ☐ |
 ---
 ## Appendix
 ### A. Load Test Results
 [Link to or summary of load test results showing service meets performance targets]
 ### B. Security Review Results
 [Link to or summary of security audit findings]
 ### C. Architecture Diagrams
 [Include or link to system architecture, data flow, and deployment topology]
 ### D. SLO Definition
 [Document the agreed-upon SLO targets for availability, latency, error rate]
 ### E. Runbooks
 [Link to or list of key runbooks: incident response, deployment, rollback, troubleshooting]
 ---
 **Report prepared by**: {{SRE_LEAD}}
 **Report date**: {{DATE}}
 **Last updated**: {{DATE}}
--- a/src/psm/workflows/bmad-psm-production-readiness/workflow.md
+++ b/src/psm/workflows/bmad-psm-production-readiness/workflow.md
@ -1,92 +0,0 @@
 ---
 workflow_id: PRR001
 workflow_name: Production Readiness Review
 description: Validate service is ready for production using comprehensive readiness checklist
 entry_point: steps/step-01-init-checklist.md
 phase: 3-run
 lead_agent: "Minh (SRE)"
 status: "active"
 created_date: 2026-03-17
 version: "1.0.0"
 estimated_duration: "2-3 hours"
 outputFile: '{output_folder}/psm-artifacts/prr-{{project_name}}-{{date}}.md'
 ---
 # Workflow: Production Readiness Review (PRR)
 ## Goal
 Validate and certify that a service meets production readiness standards across 9 key dimensions before deployment.
 ## Overview
 This workflow systematically evaluates a service against production readiness criteria defined in the Production Systems BMAD skill framework. Using SRE expertise and architectural patterns, the workflow:
 1. **Initializes** the PRR process with service context and dimensional overview
 2. **Deep reviews** each dimension (reliability, observability, performance, security, capacity, data, runbooks, dependencies, rollback)
 3. **Renders final decision** with GO/NO-GO/CONDITIONAL-GO recommendation
 ## Execution Path
 ```
 START
  ↓
 [Step 01] Init Checklist (Load framework, gather service context, present dimensions)
  ↓
 [Step 02] Deep Review (Score each dimension, identify blockers, recommendations)
  ↓
 [Step 03] Final Decision (Scorecard, decision, action items, DONE)
  ↓
 END
 ```
 ## Key Roles
 | Role | Agent | Responsibility |
 |------|-------|-----------------|
 | Lead | Minh (SRE) | Navigate workflow, coordinate review, make final call |
 | Subject Matter | Service Owner | Provide service context, clarify architecture |
 | Review Committee | Arch, SecOps, MLOps | Contribute expertise on specific dimensions |
 ## Dimensions Evaluated (9)
 1. **Reliability** — SLA/SLO definition, error budgets, failure modes, incident response
 2. **Observability** — Logging, metrics, tracing, dashboards, alerting
 3. **Performance** — Latency targets, throughput, P99 tail behavior, optimization opportunities
 4. **Security** — Auth/authz, secrets management, encryption, audit logging, compliance
 5. **Capacity** — Resource limits, scaling policies, burst capacity, cost projections
 6. **Data** — Schema versioning, backup/restore, data governance, retention policies
 7. **Runbooks** — Incident runbooks, operational playbooks, troubleshooting guides
 8. **Dependencies** — External services, internal libraries, database versioning, API contracts
 9. **Rollback** — Rollback strategy, canary deployment, feature flags, smoke tests
 ## Input Requirements
 - **Service name and owner** — Which service are we evaluating?
 - **Current architecture** — High-level design, tech stack, topology
 - **Existing metrics/dashboards** — Links to monitoring, SLO definitions
 - **Known gaps/risks** — Already identified issues to address
 ## Output Deliverable
 - **Production Readiness Checklist** (template: `production-readiness.template.md`)
  - Scorecard with 9 dimensions (red/yellow/green)
  - Blockers and recommendations per dimension
  - Final GO/NO-GO/CONDITIONAL-GO decision
  - Explicit action items with owners and deadlines
 ## Success Criteria
 1. All 9 dimensions evaluated with clear rationale
 2. Blockers categorized as P0 (must fix) or P1 (should fix)
 3. Team alignment on decision (documented in PRR report)
 4. Action plan with clear accountability and timeline
 ## Next Steps After Workflow
 - If **GO**: Proceed to deployment; document in CHANGELOG
 - If **NO-GO**: Reschedule PRR once blockers addressed; track in backlog
 - If **CONDITIONAL-GO**: Deploy with documented caveats; setup monitoring for risk areas
 ---
 **Navigation**: [← Back to 3-run](../), [Next: Step 01 →](steps/step-01-init-checklist.md)
--- a/src/psm/workflows/bmad-psm-quick-diagnose/SKILL.md
+++ b/src/psm/workflows/bmad-psm-quick-diagnose/SKILL.md
@ -1,6 +0,0 @@
 ---
 name: bmad-psm-quick-diagnose
 description: 'Quick diagnosis of production issue with minimal latency. Use when the user says "something is broken" or "quick diagnose" or "what is happening?"'
 ---
 Follow the instructions in [workflow.md](workflow.md).
--- a/src/psm/workflows/bmad-psm-quick-diagnose/bmad-skill-manifest.yaml
+++ b/src/psm/workflows/bmad-psm-quick-diagnose/bmad-skill-manifest.yaml
@ -1 +0,0 @@
 type: skill
--- a/src/psm/workflows/bmad-psm-quick-diagnose/workflow.md
+++ b/src/psm/workflows/bmad-psm-quick-diagnose/workflow.md
@ -1,80 +0,0 @@
 ---
 workflow_id: QD001
 workflow_name: Quick Diagnose
 description: Fast diagnosis of production issue with root cause and fix suggestion
 entry_point: steps/step-01-gather.md
 phase: quick-flow
 lead_agent: "Minh (SRE)"
 status: "active"
 created_date: 2026-03-17
 version: "1.0.0"
 estimated_duration: "15-25 minutes"
 outputFile: '{output_folder}/psm-artifacts/quick-diagnose-{{date}}.md'
 ---
 # Workflow: Quick Diagnose Production Issue
 ## Goal
 Rapidly diagnose production issues by gathering symptom data, checking metrics, and suggesting fixes.
 ## Overview
 Quick Diagnose is a lightweight workflow for time-sensitive production troubleshooting:
 1. **Gathers** symptom description and quick metrics check
 2. **Diagnoses** root cause using observability data
 3. **Suggests** fix or mitigation immediately
 ## Execution Path
 ```
 START
  ↓
 [Step 01] Gather Context (What's broken? Check metrics)
  ↓
 [Step 02] Diagnose & Fix (Root cause analysis → fix suggestion → verify)
  ↓
 END
 ```
 ## Key Roles
 | Role | Agent |
 |------|-------|
 | Lead | Minh (SRE) |
 ## Input Requirements
 - **Symptom description** — What is failing? (error message, behavior, timeline)
 - **Affected service/component** — What system is broken?
 - **Timeline** — When did it start? Is it ongoing?
 - **Impact** — How many users affected? Is revenue impacted?
 ## Output Deliverable
 - **Quick Diagnosis Report** (markdown, 1-2 pages)
  - Symptom analysis
  - Root cause hypothesis
  - Immediate mitigation (if needed)
  - Fix suggestion with effort
  - Follow-up actions
 ## Success Criteria
 1. Root cause identified within 15-20 minutes
 2. Immediate mitigation available (if needed)
 3. Fix suggestion documented with clear steps
 4. Team knows what to do next
 ## Quick Diagnose vs Full Production Readiness Review
 | Aspect | Quick Diagnose | Full PRR |
 |--------|---|---|
 | Trigger | Active incident | Pre-deployment |
 | Duration | 15-25 min | 2-3 hours |
 | Scope | Single issue | All 9 dimensions |
 | Goal | Fix now | Prevent issues |
 ---
 **Navigation**: [← Back to quick-flow](../), [Next: Step 01 →](steps/step-01-gather.md)
--- a/src/psm/workflows/bmad-psm-security-audit/SKILL.md
+++ b/src/psm/workflows/bmad-psm-security-audit/SKILL.md
@ -1,6 +0,0 @@
 ---
 name: bmad-psm-security-audit
 description: 'Run comprehensive security audit and threat assessment. Use when the user says "security audit" or "vulnerability assessment" or "security review"'
 ---
 Follow the instructions in [workflow.md](workflow.md).
--- a/src/psm/workflows/bmad-psm-security-audit/bmad-skill-manifest.yaml
+++ b/src/psm/workflows/bmad-psm-security-audit/bmad-skill-manifest.yaml
@ -1 +0,0 @@
 type: skill
--- a/src/psm/workflows/bmad-psm-security-audit/security-audit-report.template.md
+++ b/src/psm/workflows/bmad-psm-security-audit/security-audit-report.template.md
@ -1,502 +0,0 @@
 ---
 template_name: security-audit-report
 template_version: "1.0.0"
 created_date: 2026-03-17
 description: Security audit report with findings, severity levels, and remediation plan
 ---
 # Security Audit Report
 **Service**: {{SERVICE_NAME}}
 **Service Owner**: {{SERVICE_OWNER}}
 **Auditor**: {{SECURITY_LEAD}} (Hà)
 **Audit Date**: {{START_DATE}} — {{END_DATE}}
 **Report Date**: {{REPORT_DATE}}
 **Scope**: {{SCOPE_DESCRIPTION}}
 ---
 ## Executive Summary
 This security audit evaluated {{SERVICE_NAME}} against security best practices and compliance requirements. The assessment identified {{X}} findings across {{Y}} security domains.
 **Overall Security Posture**: {{COMPLIANT | FINDINGS | CRITICAL}}
 {{1-2 paragraph summary of key findings, critical issues if any, and recommendations}}
 ---
 ## Audit Scope
 ### Services Reviewed
 - {{Service 1}} ({{Description}})
 - {{Service 2}} ({{Description}})
 - {{Service 3}} ({{Description}})
 ### Assessment Domains
 - ✅ Authentication & Authorization
 - ✅ API Security
 - ✅ Secrets Management
 - ✅ Encryption (in-transit & at-rest)
 - ✅ PII & Data Protection
 ### Exclusions
 {{Any out-of-scope areas:}}
 - {{Item}} (reason)
 - {{Item}} (reason)
 ---
 ## Findings Summary
 ### By Severity
 | Severity | Count | Trend |
 |----------|-------|-------|
 | **Critical** | {{X}} | {{↑/→/↓}} |
 | **High** | {{Y}} | {{↑/→/↓}} |
 | **Medium** | {{Z}} | {{↑/→/↓}} |
 | **Low** | {{W}} | {{↑/→/↓}} |
 | **Total** | {{X+Y+Z+W}} | |
 ### By Domain
 | Domain | Critical | High | Medium | Low | Status |
 |--------|----------|------|--------|-----|--------|
 | Auth & Authz | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
 | API Security | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
 | Secrets Mgmt | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
 | Encryption | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
 | PII & Data | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
 ---
 ## Critical Severity Findings
 ### [F1] {{Finding Title}}
 **Severity**: CRITICAL (CVSS {{8.0-10.0}})
 **Domain**: {{Which domain}}
 **Status**: {{Open | In Progress | Resolved}}
 **Description**:
 {{Detailed description of the vulnerability, how it could be exploited, and the impact}}
 **Evidence**:
 - {{Evidence 1}}
 - {{Evidence 2}}
 - {{Testing confirmation}}
 **Impact**:
 - {{Business impact}}
 - {{Technical impact}}
 - {{Compliance impact}}
 **Remediation**:
 1. {{Step 1}} ({{Estimated time}})
 2. {{Step 2}} ({{Estimated time}})
 3. {{Step 3}} ({{Estimated time}})
 **Owner**: {{Name}}
 **Target Fix Date**: {{DATE}}
 **Effort**: {{Est. hours/days}}
 **Verification**: {{How we'll confirm it's fixed}}
 ---
 ### [F2] {{Finding Title}}
 {{Repeat Critical severity format}}
 ---
 ## High Severity Findings
 ### [F3] {{Finding Title}}
 **Severity**: HIGH (CVSS {{7.0-7.9}})
 **Domain**: {{Which domain}}
 **Status**: {{Open | In Progress | Resolved}}
 **Description**: {{Brief description}}
 **Impact**: {{Why it matters}}
 **Remediation**:
 1. {{Step 1}}
 2. {{Step 2}}
 **Owner**: {{Name}}
 **Target Date**: {{DATE}}
 ---
 ### [F4] {{Finding Title}}
 {{Repeat High severity format}}
 ---
 ## Medium Severity Findings
 ### [F5] {{Finding Title}}
 **Severity**: MEDIUM (CVSS {{4.0-6.9}})
 **Domain**: {{Which domain}}
 **Description**: {{Brief description}}
 **Remediation**: {{Brief fix}}
 **Owner**: {{Name}} | **Target Date**: {{DATE}}
 ---
 ### [F6] {{Finding Title}}
 {{Repeat Medium severity format}}
 ---
 ## Low Severity Findings
 ### [F7] {{Finding Title}}
 **Severity**: LOW (CVSS {{0.1-3.9}})
 **Description**: {{Brief description}}
 **Remediation**: {{Brief fix}}
 ---
 ### [F8] {{Finding Title}}
 {{Repeat Low severity format}}
 ---
 ## Domain-Specific Assessment
 ### Domain 1: Authentication & Authorization
 **Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
 **Strengths**:
 - {{Positive finding 1}}
 - {{Positive finding 2}}
 **Gaps**:
 - {{Gap 1}} — {{Impact}}
 - {{Gap 2}} — {{Impact}}
 **Recommendations**:
 1. {{Recommendation 1}}
 2. {{Recommendation 2}}
 ---
 ### Domain 2: API Security
 **Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
 **Strengths**:
 - {{Positive finding 1}}
 - {{Positive finding 2}}
 **Gaps**:
 - {{Gap 1}} — {{Impact}}
 - {{Gap 2}} — {{Impact}}
 **Recommendations**:
 1. {{Recommendation 1}}
 2. {{Recommendation 2}}
 ---
 ### Domain 3: Secrets Management
 **Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
 **Strengths**:
 - {{Positive finding 1}}
 - {{Positive finding 2}}
 **Gaps**:
 - {{Gap 1}} — {{Impact}}
 - {{Gap 2}} — {{Impact}}
 **Recommendations**:
 1. {{Recommendation 1}}
 2. {{Recommendation 2}}
 ---
 ### Domain 4: Encryption
 **Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
 **Strengths**:
 - {{Positive finding 1}}
 - {{Positive finding 2}}
 **Gaps**:
 - {{Gap 1}} — {{Impact}}
 - {{Gap 2}} — {{Impact}}
 **Recommendations**:
 1. {{Recommendation 1}}
 2. {{Recommendation 2}}
 ---
 ### Domain 5: PII & Data Protection
 **Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
 **Strengths**:
 - {{Positive finding 1}}
 - {{Positive finding 2}}
 **Gaps**:
 - {{Gap 1}} — {{Impact}}
 - {{Gap 2}} — {{Impact}}
 **Recommendations**:
 1. {{Recommendation 1}}
 2. {{Recommendation 2}}
 ---
 ## Compliance Assessment
 ### GDPR (General Data Protection Regulation)
 **Applicable**: {{YES | NO | PARTIAL}}
 **Status**: {{COMPLIANT | NON-COMPLIANT | CONDITIONAL}}
 | Requirement | Status | Finding | Gap Fix |
 |-------------|--------|---------|---------|
 | Data Encryption | {{✅/❌}} | {{Description}} | {{Remediation}} |
 | Access Control | {{✅/❌}} | {{Description}} | {{Remediation}} |
 | Retention Policy | {{✅/❌}} | {{Description}} | {{Remediation}} |
 | Right to Deletion | {{✅/❌}} | {{Description}} | {{Remediation}} |
 | Data Processing Agreement | {{✅/❌}} | {{Description}} | {{Remediation}} |
 **Timeline to Compliance**: {{DATE or "Already compliant"}}
 ---
 ### PCI-DSS (Payment Card Industry Data Security Standard)
 **Applicable**: {{YES | NO | PARTIAL}}
 **Status**: {{COMPLIANT | NON-COMPLIANT | CONDITIONAL}}
 | Requirement | Status | Finding | Gap Fix |
 |-------------|--------|---------|---------|
 | TLS 1.2+ | {{✅/❌}} | {{Description}} | {{Remediation}} |
 | Secrets Management | {{✅/❌}} | {{Description}} | {{Remediation}} |
 | Input Validation | {{✅/❌}} | {{Description}} | {{Remediation}} |
 **Timeline to Compliance**: {{DATE or "Already compliant"}}
 ---
 ### SOC 2 Type II
 **Applicable**: {{YES | NO | PARTIAL}}
 **Status**: {{COMPLIANT | NON-COMPLIANT | CONDITIONAL}}
 **Gap Summary**: {{Description of gaps or "No gaps identified"}}
 **Timeline**: {{When audit can be conducted}}
 ---
 ### Other Regulations
 {{Any other applicable standards (HIPAA, FINRA, etc.)}}
 ---
 ## Remediation Roadmap
 ### Critical Path (Week 1-2)
 **All Critical findings must be fixed before production deployment.**
 - [ ] {{F1}} — Owner: {{Name}}, Deadline: {{DATE}}
 - [ ] {{F2}} — Owner: {{Name}}, Deadline: {{DATE}}
 **Milestone**: Security re-scan on {{DATE}} to verify fixes
 ---
 ### Phase 2 (Week 3-4)
 Complete High-severity findings:
 - [ ] {{F3}} — Owner: {{Name}}, Deadline: {{DATE}}
 - [ ] {{F4}} — Owner: {{Name}}, Deadline: {{DATE}}
 **Milestone**: Second security review on {{DATE}}
 ---
 ### Phase 3 (Weeks 5-8)
 Address Medium-severity findings (can be post-production with monitoring):
 - [ ] {{F5}} — Owner: {{Name}}, Target: {{DATE}}
 - [ ] {{F6}} — Owner: {{Name}}, Target: {{DATE}}
 ---
 ### Backlog (Next Sprint)
 Low-severity items:
 - [ ] {{F7}} — {{Brief description}}
 - [ ] {{F8}} — {{Brief description}}
 ---
 ## Remediation Status Tracking
 | Finding | Owner | Deadline | Status | Last Update | Notes |
 |---------|-------|----------|--------|-------------|-------|
 | F1 | {{Name}} | {{Date}} | 🔴 Pending | {{Date}} | {{Notes}} |
 | F2 | {{Name}} | {{Date}} | 🟡 In Progress | {{Date}} | {{Notes}} |
 | F3 | {{Name}} | {{Date}} | 🟢 Complete | {{Date}} | {{Notes}} |
 ---
 ## Post-Audit Monitoring
 ### Controls to Monitor
 {{If service proceeds to production despite findings:}}
 - **{{Control 1}}** — Monitor via {{method}}, alert if {{threshold}}
 - **{{Control 2}}** — Monitor via {{method}}, alert if {{threshold}}
 - **{{Control 3}}** — Monitor via {{method}}, alert if {{threshold}}
 ### Incident Response
 If a security incident occurs:
 1. Activate incident response team
 2. Notify {{Escalation contacts}}
 3. Follow {{Incident response runbook}}
 4. Conduct post-incident security review
 ---
 ## Risk Assessment Matrix
 ```
             LIKELIHOOD
             Low    Med    High
    CRITICAL  H      C      C
 IMPACT
    HIGH      M      H      C
    MEDIUM    L      M      H
    LOW       L      L      M
 Legend: C=Critical, H=High, M=Medium, L=Low
 ```
 **Our findings map**:
 - {{F1}} — {{Position on matrix}}
 - {{F2}} — {{Position on matrix}}
 ---
 ## Positive Findings
 **Strengths to maintain:**
 - {{Positive 1}} — Keep doing this
 - {{Positive 2}} — Keep doing this
 - {{Positive 3}} — Keep doing this
 ---
 ## Recommendations Summary
 ### Immediate (Critical)
 - {{Fix all Critical findings}} ({{effort}})
 ### Short-term (High Priority)
 - {{Fix all High findings}} ({{effort}})
 - {{Implement automated scanning}} ({{effort}})
 - {{Setup security monitoring}} ({{effort}})
 ### Medium-term
 - {{Implement {{technology}} for {{purpose}}}} ({{effort}})
 - {{Security training for team}} ({{effort}})
 ### Long-term (Next 6 Months)
 - {{Major security initiative}} ({{effort}})
 - {{Penetration testing}} ({{effort}})
 ---
 ## Sign-offs & Approvals
 ### Audit Approval
 - [ ] **Security Lead** ({{AUDITOR_NAME}})
  - Signature: ________________________ Date: __________
  - Assessment complete and findings documented
 ### Service Owner Acknowledgment
 - [ ] **Service Owner** ({{SERVICE_OWNER}})
  - Signature: ________________________ Date: __________
  - Acknowledged findings and committed to remediation
 ### Compliance Officer Review
 - [ ] **Compliance Officer** ({{NAME}})
  - Signature: ________________________ Date: __________
  - Compliance requirements verified
 ### Executive Approval (If Production Clearance Needed)
 - [ ] **VP Engineering / Security** ({{NAME}})
  - Signature: ________________________ Date: __________
  - Risk accepted; approved for production
 ---
 ## Distribution
 - [x] Shared with: {{Service team, Leadership, Compliance}}
 - [x] Date shared: {{DATE}}
 - [x] Follow-up review scheduled: {{DATE}}
 ---
 ## Appendix: Testing Evidence
 ### Code Review Findings
 ```
 {{Code snippets demonstrating vulnerabilities}}
 ```
 ### Configuration Issues
 ```
 {{Configuration examples showing gaps}}
 ```
 ### Dependencies Scan
 ```
 {{Vulnerable dependencies identified}}
 ```
 ---
 **Report Prepared By**: {{AUDITOR_NAME}}
 **Report Date**: {{DATE}}
 **Review Status**: Draft | Final | Approved
--- a/src/psm/workflows/bmad-psm-security-audit/workflow.md
+++ b/src/psm/workflows/bmad-psm-security-audit/workflow.md
@ -1,91 +0,0 @@
 ---
 workflow_id: SA001
 workflow_name: Security Audit
 description: Comprehensive security review using security patterns, config management, and compliance framework
 entry_point: steps/step-01-scope.md
 phase: 4-cross
 lead_agent: "Hà (Security)"
 status: "active"
 created_date: 2026-03-17
 version: "1.0.0"
 estimated_duration: "2-3 hours"
 outputFile: '{output_folder}/psm-artifacts/security-audit-{{project_name}}-{{date}}.md'
 ---
 # Workflow: Security Audit
 ## Goal
 Perform comprehensive security evaluation using Production Systems BMAD framework, covering threat modeling, vulnerability assessment, compliance, and security controls.
 ## Overview
 Security audit is a critical cross-functional workflow that evaluates service security posture before production deployment or for ongoing compliance verification. The audit:
 1. **Scopes** the audit engagement, defines threat model, and identifies compliance requirements
 2. **Executes** detailed security assessment across multiple domains (authentication, data protection, infrastructure, API security)
 3. **Reports** findings with severity levels, remediation recommendations, and compliance status
 ## Execution Path
 ```
 START
  ↓
 [Step 01] Scope & Threat Model (Define audit scope, identify threats, compliance reqs)
  ↓
 [Step 02] Security Assessment (Execute checklist across domains, identify vulns)
  ↓
 [Step 03] Security Report (Findings report, severity, recommendations, compliance)
  ↓
 END
 ```
 ## Key Roles
 | Role | Agent | Responsibility |
 |------|-------|-----------------|
 | Lead | Hà (Security) | Lead audit, coordinate assessment, synthesize findings |
 | Subject Matter | Service Owner + Platform Eng | Provide architecture, answer security questions |
 | Compliance | Security/Compliance Team | Validate compliance mapping, sign-off |
 ## Assessment Domains (5)
 1. **Authentication & Authorization** — Identity verification, access control, session management
 2. **API Security** — Input validation, rate limiting, API key management, CORS
 3. **Secrets Management** — Credential storage, rotation, access logging
 4. **Encryption** — In-transit (TLS), at-rest, key management
 5. **PII & Data Protection** — Classification, access controls, audit logging, retention
 ## Input Requirements
 - **Service architecture diagram** — Components, data flows, external integrations
 - **Authentication/authorization approach** — OAuth2, JWT, SAML, custom
 - **Secrets storage mechanism** — Vault, cloud provider, environment variables
 - **Compliance requirements** — GDPR, CCPA, SOC2, industry-specific
 - **Known security controls** — WAF, TLS config, authentication libraries
 ## Output Deliverable
 - **Security Audit Report** (template: `security-audit-report.template.md`)
  - Audit scope and threat model
  - Findings organized by domain with severity (Critical/High/Medium/Low)
  - Remediation recommendations with priority and effort
  - Compliance status matrix
  - Sign-off
 ## Success Criteria
 1. All security domains assessed with clear findings
 2. Severity levels assigned (using CVSS or similar framework)
 3. Remediation plan with owners and deadlines
 4. Compliance requirements verified (if applicable)
 5. Team alignment on security posture
 ## Next Steps After Workflow
 - If **COMPLIANT**: Document in security registry; schedule periodic re-audit
 - If **NON-COMPLIANT**: Add remediation items to backlog; track closure
 - If **CRITICAL ISSUES**: Consider production pause until resolved
 ---
 **Navigation**: [← Back to 4-cross](../), [Next: Step 01 →](steps/step-01-scope.md)
--- a/src/psm/workflows/bmad-psm-setup-new-service/SKILL.md
+++ b/src/psm/workflows/bmad-psm-setup-new-service/SKILL.md
@ -1,6 +0,0 @@
 ---
 name: bmad-psm-setup-new-service
 description: 'Set up new production service from architecture through deployment. Use when the user says "new service" or "setup service" or "new microservice"'
 ---
 Follow the instructions in [workflow.md](workflow.md).
--- a/src/psm/workflows/bmad-psm-setup-new-service/bmad-skill-manifest.yaml
+++ b/src/psm/workflows/bmad-psm-setup-new-service/bmad-skill-manifest.yaml
@ -1 +0,0 @@
 type: skill
--- a/src/psm/workflows/bmad-psm-setup-new-service/workflow.md
+++ b/src/psm/workflows/bmad-psm-setup-new-service/workflow.md
@ -1,116 +0,0 @@
 ---
 workflow_id: W-SETUP-SVC-001
 workflow_name: Setup Production Service for BMAD
 version: 6.2.0
 lead_agent: "Architect Khang"
 supporting_agents: ["SRE Minh", "Mary Analyst"]
 phase: "1-Analysis → 2-Planning → 3-Solutioning → 4-Implementation"
 created_date: 2026-03-17
 last_modified: 2026-03-17
 config_file: "_config/config.yaml"
 estimated_duration: "12-20 hours"
 outputFile: '{output_folder}/psm-artifacts/service-setup-{{project_name}}-{{date}}.md'
 ---
 # Setup Production Service Workflow — BMAD Pattern
 ## Metadata & Context
 **Goal**: Xây dựng production-grade service từ scratch, với đầy đủ architecture, API design, deployment pipeline, reliability patterns, security, và production readiness.
 **Lead Team**:
 - SRE Minh (Reliability, Infrastructure, Operations)
 - Architect Khang (System Design, Technology Selection)
 - Mary Analyst (Requirements, Risk Assessment)
 **Success Criteria**:
 - ✓ Architecture design document approved
 - ✓ API contracts defined & validated
 - ✓ Database schema designed & indexed
 - ✓ CI/CD pipeline operational
 - ✓ Resilience & observability in place
 - ✓ Security & compliance verified
 - ✓ Production readiness checklist passed
 ## Workflow Overview
 Workflow này di qua 6 bước atomic, mỗi bước focus vào một domain riêng:
 1. **Step-01-Architecture** → Requirements + Architecture Pattern Selection
 2. **Step-02-API-Database** → API Design + Database Selection + Schema
 3. **Step-03-Build-Deploy** → CI/CD + Containerization + Testing Strategy
 4. **Step-04-Reliability** → Resilience Patterns + Observability + Error Handling
 5. **Step-05-Security-Infra** → Auth/Authz + Secrets + K8s Config
 6. **Step-06-Readiness** → PRR Checklist + Runbook + Go/No-Go Decision
 ## Configuration Loading
 Tự động load từ `_config/config.yaml`:
 ```yaml
 project_context:
  user_name: "[loaded from config]"
  organization: "[loaded from config]"
  environment: "production"
 workflow_defaults:
  communication_language: "Vietnamese"
  output_folder: "./outputs/setup-new-service-{service_name}"
  timestamp: "2026-03-17"
 ```
 ## Execution Model
 ### Entry Point Logic
 ```
 1. Check if workflow.md exists in outputs folder
   → If NEW: Start from step-01-architecture.md
   → If RESUME: Load progress.yaml → auto-skip completed steps
   → If PARTIAL: Load step-N-context.yaml → resume from step N
 2. For each step:
   a) Load step-{N}-{name}.md
   b) Load referenced SKILL files (auto-parse "Load:" directives)
   c) Execute MENU [A][C] options
   d) Save step output to step-{N}-output.md
   e) Move to next step
 3. Final: Generate comprehensive outputs in outputs folder
 ```
 ### State Tracking
 Output document frontmatter tracks progress:
 ```yaml
 workflow_progress:
  step_01_architecture: "completed"
  step_02_api_database: "completed"
  step_03_build_deploy: "in_progress"
  step_04_reliability: "pending"
  step_05_security_infra: "pending"
  step_06_readiness: "pending"
  last_updated: "2026-03-17T14:30:00Z"
  current_agent: "Architect Khang"
 ```
 ## Mandatory Workflow Rules
 1. **No skipping steps** — Mỗi step phải được execute theo order
 2. **Validate assumptions** — Mỗi decision phải được document
 3. **Cross-phase collaboration** — Architects + SRE + Analysts work together
 4. **Output artifacts** — Mỗi step produce tangible output documents
 5. **Handoff protocol** — Context được transfer giữa steps rõ ràng
 ## Navigation
 Hãy chọn cách bắt đầu:
 - **[NEW]** — Bắt đầu workflow mới → Load step-01
 - **[RESUME]** — Quay lại workflow đã từng chạy (detect progress)
 - **[SKIP-TO]** — Nhảy tới step cụ thể (dev-only, requires confirmation)
 ---
 **Tiếp tục bằng cách chọn [NEW] hoặc [RESUME]**