feat(psm): Add Production Systems & MLOps module
Add new PSM module for production operations: - 3 agents: SRE (Minh), Security (Hà), MLOps (Linh) - 6 workflows: incident-response, production-readiness, security-audit, mlops-deployment, setup-new-service, quick-diagnose - Teams for party mode integration - Registered as community module in installer Co-Authored-By: Claude Opus <noreply@anthropic.com>
This commit is contained in:
parent
21c2a48ab2
commit
e4f5e6d8b5
|
|
@ -0,0 +1,21 @@
|
|||
# MLOps & Performance Engineer Agent Definition
|
||||
|
||||
agent:
|
||||
metadata:
|
||||
id: "_bmad/psm/agents/mlops.md"
|
||||
name: Linh
|
||||
title: MLOps & Performance Engineer
|
||||
icon: 🤖
|
||||
module: psm
|
||||
hasSidecar: false
|
||||
|
||||
persona:
|
||||
role: MLOps Specialist + Performance Engineer
|
||||
identity: MLOps specialist bridging ML research and production. Expert in model serving, pipeline optimization, and chaos engineering.
|
||||
communication_style: Data-driven, experimental. Thinks in pipelines and metrics. Ship fast, measure everything.
|
||||
principles: Reproducibility first; monitor model drift; chaos engineering validates assumptions; cost-aware optimization.
|
||||
|
||||
menu:
|
||||
- trigger: MD or fuzzy match on mlops-deploy
|
||||
workflow: "skill:bmad-psm-mlops-deployment"
|
||||
description: "[MD] MLOps Deployment — Model validation, deploy, monitor"
|
||||
|
|
@ -0,0 +1,21 @@
|
|||
# Security & Infrastructure Engineer Agent Definition
|
||||
|
||||
agent:
|
||||
metadata:
|
||||
id: "_bmad/psm/agents/security.md"
|
||||
name: Hà
|
||||
title: Security & Infrastructure Engineer
|
||||
icon: 🛡️
|
||||
module: psm
|
||||
hasSidecar: false
|
||||
|
||||
persona:
|
||||
role: Security Specialist + Infrastructure Expert
|
||||
identity: Security specialist with expertise in defense-in-depth, compliance frameworks, and infrastructure hardening. Thorough and detail-oriented.
|
||||
communication_style: Thorough, detail-oriented. Asks 'what if' scenarios. Thinks about edge cases and threat models.
|
||||
principles: Zero trust architecture; defense in depth; security by default; least privilege.
|
||||
|
||||
menu:
|
||||
- trigger: SA or fuzzy match on security-audit
|
||||
workflow: "skill:bmad-psm-security-audit"
|
||||
description: "[SA] Security Audit — Scope, audit, report"
|
||||
|
|
@ -0,0 +1,21 @@
|
|||
# Production Standards for PSM
|
||||
|
||||
SRE operational standards, incident response protocols, and production quality benchmarks.
|
||||
|
||||
## User Specified CRITICAL Rules - Supersedes General Rules
|
||||
|
||||
None
|
||||
|
||||
## General CRITICAL RULES
|
||||
|
||||
### Rule 1: SLO-First Approach
|
||||
ALL production decisions MUST reference defined SLOs. No optimization without measurement baseline.
|
||||
|
||||
### Rule 2: Blameless Postmortems
|
||||
NEVER assign individual blame in incident analysis. Focus on systemic improvements.
|
||||
|
||||
### Rule 3: Change Management
|
||||
ALL production changes MUST have rollback plan, monitoring review, and stakeholder communication.
|
||||
|
||||
### Rule 4: Severity Classification
|
||||
SEV1: Complete outage >50% users. SEV2: Major degradation >20%. SEV3: Minor <20%. SEV4: Cosmetic.
|
||||
|
|
@ -0,0 +1,30 @@
|
|||
# Site Reliability Engineer Agent Definition
|
||||
|
||||
agent:
|
||||
metadata:
|
||||
id: "_bmad/psm/agents/sre.md"
|
||||
name: Minh
|
||||
title: Site Reliability Engineer
|
||||
icon: 🔧
|
||||
module: psm
|
||||
hasSidecar: true
|
||||
|
||||
persona:
|
||||
role: Senior SRE + Production Operations Expert
|
||||
identity: Senior SRE with deep expertise in reliability, observability, and operational excellence. Obsessed with SLOs, automation, and incident response.
|
||||
communication_style: Metric-driven, systematic. Translates business goals to technical SLOs. Always asks 'what is the SLO?' first.
|
||||
principles: SLO-first approach; automate everything; measure before optimizing; blameless postmortems.
|
||||
|
||||
menu:
|
||||
- trigger: IR or fuzzy match on incident
|
||||
workflow: "skill:bmad-psm-incident-response"
|
||||
description: "[IR] Incident Response — Triage, diagnose, fix, postmortem"
|
||||
- trigger: PR or fuzzy match on readiness
|
||||
workflow: "skill:bmad-psm-production-readiness"
|
||||
description: "[PR] Production Readiness Review — 9-dimension assessment"
|
||||
- trigger: NS or fuzzy match on new-service
|
||||
workflow: "skill:bmad-psm-setup-new-service"
|
||||
description: "[NS] Setup New Service — Architecture to deployment"
|
||||
- trigger: QD or fuzzy match on diagnose
|
||||
workflow: "skill:bmad-psm-quick-diagnose"
|
||||
description: "[QD] Quick Diagnose — Fast production troubleshooting"
|
||||
|
|
@ -0,0 +1,13 @@
|
|||
code: psm
|
||||
name: "PSM: Production Systems & MLOps"
|
||||
header: "BMad Production Systems Module"
|
||||
subheader: "Production engineering workflows for incident response, production readiness, security, and MLOps."
|
||||
description: "AI-driven production engineering framework with SRE, Security, and MLOps agents."
|
||||
default_selected: false
|
||||
|
||||
knowledge_base_path:
|
||||
prompt:
|
||||
- "Where is your production knowledge base? (folder with SKILL.md files)"
|
||||
- "Leave default if you don't have one yet."
|
||||
default: "docs/production-knowledge"
|
||||
result: "{project-root}/{value}"
|
||||
|
|
@ -0,0 +1,7 @@
|
|||
module,phase,name,code,sequence,workflow-file,command,required,agent,options,description,output-location,outputs,
|
||||
psm,operations,Incident Response,IR,,skill:bmad-psm-incident-response,bmad-psm-incident-response,false,sre,Operations Mode,"Handle production incidents with systematic triage, diagnosis, and recovery. Use when the user says 'production is down' or 'incident response' or 'we have an outage'.",output_folder,"incident response report",
|
||||
psm,operations,Production Readiness,PR,,skill:bmad-psm-production-readiness,bmad-psm-production-readiness,false,sre,Operations Mode,"Run production readiness review across 9 dimensions. Use when the user says 'are we ready for production' or 'PRR' or 'go-live check'.",output_folder,"production readiness assessment",
|
||||
psm,operations,Security Audit,SA,,skill:bmad-psm-security-audit,bmad-psm-security-audit,false,security,Operations Mode,"Run comprehensive security audit and threat assessment. Use when the user says 'security audit' or 'vulnerability assessment' or 'security review'.",output_folder,"security audit report",
|
||||
psm,operations,MLOps Deployment,MD,,skill:bmad-psm-mlops-deployment,bmad-psm-mlops-deployment,false,mlops,Operations Mode,"Deploy ML model to production with validation and monitoring. Use when the user says 'deploy model' or 'ML deployment' or 'model serving'.",output_folder,"mlops deployment report",
|
||||
psm,operations,Setup New Service,NS,,skill:bmad-psm-setup-new-service,bmad-psm-setup-new-service,false,sre,Operations Mode,"Set up new production service from architecture through deployment. Use when the user says 'new service' or 'setup service' or 'new microservice'.",output_folder,"service setup plan",
|
||||
psm,operations,Quick Diagnose,QD,,skill:bmad-psm-quick-diagnose,bmad-psm-quick-diagnose,false,sre,Operations Mode,"Quick diagnosis of production issue with minimal latency. Use when the user says 'something is broken' or 'quick diagnose' or 'what is happening?'.",output_folder,"diagnostic report",
|
||||
|
|
|
@ -0,0 +1,13 @@
|
|||
code: psm
|
||||
name: "PSM: Production Systems & MLOps"
|
||||
header: "BMad Production Systems Module"
|
||||
subheader: "Production engineering workflows for incident response, production readiness, security, and MLOps."
|
||||
description: "AI-driven production engineering framework with SRE, Security, and MLOps agents."
|
||||
default_selected: false
|
||||
|
||||
knowledge_base_path:
|
||||
prompt:
|
||||
- "Where is your production knowledge base? (folder with SKILL.md files)"
|
||||
- "Leave default if you don't have one yet."
|
||||
default: "docs/production-knowledge"
|
||||
result: "{project-root}/{value}"
|
||||
|
|
@ -0,0 +1,4 @@
|
|||
name,displayName,title,icon,role,identity,communicationStyle,principles,module,path
|
||||
"sre","Minh","Site Reliability Engineer","🔧","Senior SRE + Production Operations Expert","Senior SRE with deep expertise in reliability, observability, and operational excellence. Obsessed with SLOs, automation, and incident response.","Metric-driven, systematic. Always asks 'what is the SLO?' first.","SLO-first; automate everything; measure before optimizing; blameless postmortems.","psm","bmad/psm/agents/sre.md"
|
||||
"security","Hà","Security & Infrastructure Engineer","🛡️","Security Specialist + Infrastructure Expert","Security specialist with expertise in defense-in-depth, compliance frameworks, and infrastructure hardening.","Thorough, detail-oriented. Asks 'what if' scenarios. Thinks about edge cases and threat models.","Zero trust; defense in depth; security by default; least privilege.","psm","bmad/psm/agents/security.md"
|
||||
"mlops","Linh","MLOps & Performance Engineer","🤖","MLOps Specialist + Performance Engineer","MLOps specialist bridging ML research and production. Expert in model serving, pipeline optimization, and chaos engineering.","Data-driven, experimental. 'Ship fast, measure everything.'","Reproducibility first; monitor drift; chaos engineering validates; cost-aware optimization.","psm","bmad/psm/agents/mlops.md"
|
||||
|
|
|
@ -0,0 +1,7 @@
|
|||
# Powered by BMAD-CORE™
|
||||
bundle:
|
||||
name: Production Operations Team
|
||||
icon: ⚙️
|
||||
description: Production engineering team for incident response, security, and MLOps
|
||||
agents: "*"
|
||||
party: "./default-party.csv"
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
---
|
||||
name: bmad-psm-incident-response
|
||||
description: 'Handle production incidents with systematic triage, diagnosis, and recovery. Use when the user says "production is down" or "incident response" or "we have an outage"'
|
||||
---
|
||||
|
||||
Follow the instructions in [workflow.md](workflow.md).
|
||||
|
|
@ -0,0 +1 @@
|
|||
type: skill
|
||||
|
|
@ -0,0 +1,269 @@
|
|||
---
|
||||
template_name: incident-postmortem
|
||||
template_version: "1.0.0"
|
||||
created_date: 2026-03-17
|
||||
description: Standard postmortem template for incident analysis and learning
|
||||
---
|
||||
|
||||
# Incident Postmortem: {{INCIDENT_TITLE}}
|
||||
|
||||
**Date**: {{INCIDENT_DATE}}
|
||||
**Duration**: {{START_TIME}} — {{END_TIME}} ({{DURATION_MINUTES}} minutes)
|
||||
**Severity**: {{SEV1|SEV2|SEV3}} ({{IMPACT_DESCRIPTION}})
|
||||
**Lead**: {{INCIDENT_COMMANDER_NAME}}
|
||||
**Facilitator**: {{POSTMORTEM_FACILITATOR_NAME}}
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
[1-2 paragraph executive summary of what happened, impact, and resolution]
|
||||
|
||||
**Timeline at a glance**:
|
||||
- T-0:00 — Normal operation
|
||||
- T-{{TIME1}} — {{EVENT1}}
|
||||
- T-{{TIME2}} — {{EVENT2}}
|
||||
- T-{{RESOLUTION_TIME}} — Incident resolved
|
||||
|
||||
**Impact**: {{METRIC1}} affected {{X}} users, {{METRIC2}}, {{METRIC3}}
|
||||
|
||||
---
|
||||
|
||||
## Detailed Timeline
|
||||
|
||||
| Time | Event | Notes |
|
||||
|------|-------|-------|
|
||||
| {{T}} | {{What happened}} | {{Who detected it}} |
|
||||
| {{T+X}} | {{Next event}} | {{Action taken}} |
|
||||
| {{T+Y}} | {{Root cause identified}} | {{By whom}} |
|
||||
| {{T+Z}} | {{Fix applied}} | {{Verification steps}} |
|
||||
| {{T+Final}} | {{Incident resolved}} | {{Verification}} |
|
||||
|
||||
---
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### Primary Cause
|
||||
|
||||
**{{ROOT_CAUSE_TITLE}}**
|
||||
|
||||
{{Detailed explanation of the root cause}}
|
||||
|
||||
**How it happened**:
|
||||
1. {{Precondition 1}} (why the system was vulnerable)
|
||||
2. {{Trigger event}} (what caused the failure)
|
||||
3. {{Failure cascade}} (why it got worse)
|
||||
4. {{Detection lag}} (why it took X minutes to detect)
|
||||
|
||||
**Evidence**:
|
||||
- {{Log entry or metric showing the issue}}
|
||||
- {{Related system behavior}}
|
||||
- {{Impact indicator}}
|
||||
|
||||
### Contributing Factors
|
||||
|
||||
- {{Factor 1}} — {{Brief explanation}}
|
||||
- {{Factor 2}} — {{Brief explanation}}
|
||||
- {{Factor 3}} — {{Brief explanation}}
|
||||
|
||||
### Why Didn't We Catch This?
|
||||
|
||||
- {{Missing monitoring}} — {{What metric would have alerted}}
|
||||
- {{Testing gap}} — {{What test would have failed}}
|
||||
- {{Documentation gap}} — {{What runbook would have helped}}
|
||||
- {{Knowledge gap}} — {{What training would have helped}}
|
||||
|
||||
---
|
||||
|
||||
## Impact Assessment
|
||||
|
||||
### User Impact
|
||||
|
||||
- **Duration**: {{START_TIME}} — {{END_TIME}} ({{DURATION}} minutes)
|
||||
- **Scale**: {{X}}% of {{METRIC}} (e.g., 5% of payment requests)
|
||||
- **Users Affected**: {{APPROX_COUNT}} users
|
||||
- **Revenue Impact**: {{$X}} (if applicable)
|
||||
- **Customer Escalations**: {{NUMBER}} tickets opened
|
||||
|
||||
**User-facing symptoms**:
|
||||
- {{Symptom 1}} (e.g., "Checkout returns 500 error")
|
||||
- {{Symptom 2}} (e.g., "Page loads slowly")
|
||||
- {{Symptom 3}}
|
||||
|
||||
### Operational Impact
|
||||
|
||||
- **System Recovery**: {{SERVICE/METRIC}} took {{TIME}} to recover
|
||||
- **Cascading Effects**: {{SERVICE_X}} also affected due to {{reason}}
|
||||
- **On-call Load**: {{NUMBER}} pages, {{NUMBER}} escalations
|
||||
- **Data Loss**: {{None | {{Description}}}}
|
||||
|
||||
---
|
||||
|
||||
## Resolution & Recovery
|
||||
|
||||
### Immediate Actions Taken
|
||||
|
||||
1. **{{Time T+X}}** — {{Action 1}}
|
||||
- Rationale: {{Why this helped}}
|
||||
- Result: {{What changed}}
|
||||
|
||||
2. **{{Time T+Y}}** — {{Action 2}}
|
||||
- Rationale: {{Why this helped}}
|
||||
- Result: {{What changed}}
|
||||
|
||||
3. **{{Time T+Z}}** — {{Root Fix Applied}}
|
||||
- Details: {{Technical description}}
|
||||
- Verification: {{How we confirmed it worked}}
|
||||
|
||||
### Rollback/Rollforward Decision
|
||||
|
||||
**Decision**: {{Rollback to version X | Rollforward with fix | Hybrid approach}}
|
||||
|
||||
**Rationale**: {{Explain why this was the right choice}}
|
||||
|
||||
**Verification**: {{How we confirmed the fix worked}}
|
||||
|
||||
---
|
||||
|
||||
## Lessons Learned
|
||||
|
||||
### What Went Well
|
||||
|
||||
- {{Thing we did right}} — This prevented {{worse outcome}}
|
||||
- {{Thing we did right}} — Team coordination was excellent
|
||||
- {{Thing we did right}} — Monitoring caught {{something}}
|
||||
|
||||
### What We Can Improve
|
||||
|
||||
| Issue | Category | Severity | Recommendation | Owner |
|
||||
|-------|----------|----------|-----------------|-------|
|
||||
| {{We didn't detect it for X minutes}} | Observability | HIGH | Add alert for {{metric}} when > {{threshold}} | DevOps |
|
||||
| {{Runbook was outdated}} | Runbooks | MEDIUM | Update {{runbook}} with new architecture | SRE |
|
||||
| {{New service not in alerting system}} | Process | MEDIUM | Add new services to alert config automatically | Platform |
|
||||
| {{Team didn't know about new feature}} | Knowledge | LOW | Document new features in wiki | Tech Lead |
|
||||
|
||||
---
|
||||
|
||||
## Action Items
|
||||
|
||||
### Critical (Must Complete Before Similar Incident)
|
||||
|
||||
- [ ] **{{Action 1}}** — {{Description}}
|
||||
- Owner: {{NAME}}
|
||||
- Deadline: {{DATE}} (within 1 week)
|
||||
- Acceptance: {{How we verify it's done}}
|
||||
|
||||
- [ ] **{{Action 2}}** — {{Description}}
|
||||
- Owner: {{NAME}}
|
||||
- Deadline: {{DATE}} (within 1 week)
|
||||
- Acceptance: {{How we verify it's done}}
|
||||
|
||||
### High Priority (Target Next 2 Weeks)
|
||||
|
||||
- [ ] {{Action}} — Owner: {{NAME}}, Deadline: {{DATE}}
|
||||
- [ ] {{Action}} — Owner: {{NAME}}, Deadline: {{DATE}}
|
||||
- [ ] {{Action}} — Owner: {{NAME}}, Deadline: {{DATE}}
|
||||
|
||||
### Medium Priority (Target This Sprint)
|
||||
|
||||
- [ ] {{Action}} — Owner: {{NAME}}
|
||||
- [ ] {{Action}} — Owner: {{NAME}}
|
||||
|
||||
### Backlog (Good to Have)
|
||||
|
||||
- [ ] {{Action}} — {{Description}}
|
||||
- [ ] {{Action}} — {{Description}}
|
||||
|
||||
---
|
||||
|
||||
## Prevention Measures
|
||||
|
||||
### Short-term (1-2 Weeks)
|
||||
|
||||
1. **{{Mitigation 1}}** — Prevents {{this exact incident}} from happening again
|
||||
- How: {{Technical approach}}
|
||||
- Effort: {{Estimate}}
|
||||
- Timeline: {{When}}
|
||||
|
||||
2. **{{Mitigation 2}}** — Catches similar issues earlier
|
||||
- How: {{Technical approach}}
|
||||
- Effort: {{Estimate}}
|
||||
- Timeline: {{When}}
|
||||
|
||||
### Long-term (Next Quarter)
|
||||
|
||||
1. **{{Large architectural change}}** — Eliminates root cause class
|
||||
- Rationale: {{Why this is better}}
|
||||
- Effort: {{Estimate}}
|
||||
- Timeline: {{When}}
|
||||
|
||||
---
|
||||
|
||||
## Incident Stats
|
||||
|
||||
```
|
||||
MTTD (Mean Time To Detect): {{MINUTES}} minutes
|
||||
- Automatic detection: {{If applicable, how}}
|
||||
- Manual detection: {{Who found it}}
|
||||
|
||||
MTTR (Mean Time To Resolve): {{MINUTES}} minutes
|
||||
- Investigation time: {{MINUTES}}
|
||||
- Fix implementation time: {{MINUTES}}
|
||||
- Verification time: {{MINUTES}}
|
||||
|
||||
Severity: {{SEV1|SEV2|SEV3}} ({{Criteria}})
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Distribution & Follow-up
|
||||
|
||||
- [x] Postmortem shared with: {{TEAM_LIST}}
|
||||
- [x] Customer communication sent: {{YES|NO|TEMPLATE_USED}}
|
||||
- [x] Action items tracked in: {{JIRA/BACKLOG}}
|
||||
- [x] Follow-up review scheduled: {{DATE}}
|
||||
|
||||
**Follow-up Review**: {{DATE}} with {{ATTENDEES}}
|
||||
- Confirm all critical action items completed
|
||||
- Verify prevention measures working
|
||||
- Check for recurring patterns
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Supporting Evidence
|
||||
|
||||
### Logs
|
||||
|
||||
```
|
||||
[Relevant log entries showing the incident]
|
||||
|
||||
{{TIMESTAMP}} ERROR: {{MESSAGE}}
|
||||
{{TIMESTAMP}} ERROR: {{MESSAGE}}
|
||||
```
|
||||
|
||||
### Metrics
|
||||
|
||||
[Include screenshots or links to metric dashboards showing the incident]
|
||||
|
||||
- Error rate spike: [Chart or metric]
|
||||
- Latency spike: [Chart or metric]
|
||||
- Traffic pattern: [Chart or metric]
|
||||
|
||||
### Configuration Changes
|
||||
|
||||
```yaml
|
||||
# Changes made before incident
|
||||
- {{Change 1}} ({{TIMESTAMP}})
|
||||
- {{Change 2}} ({{TIMESTAMP}})
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Document Completed By**: {{NAME}}
|
||||
**Date**: {{DATE}}
|
||||
**Review Status**: Draft | Final | Approved
|
||||
|
||||
**Approvals**:
|
||||
- [ ] Incident Commander: {{NAME}} {{DATE}}
|
||||
- [ ] Service Owner: {{NAME}} {{DATE}}
|
||||
- [ ] VP Engineering (if SEV1): {{NAME}} {{DATE}}
|
||||
|
|
@ -0,0 +1,163 @@
|
|||
---
|
||||
workflow_id: W-INCIDENT-001
|
||||
workflow_name: Production Incident Response
|
||||
version: 6.2.0
|
||||
lead_agent: "SRE Minh"
|
||||
supporting_agents: ["Architect Khang", "Mary Analyst"]
|
||||
phase: "3-Run: Emergency Response & Recovery"
|
||||
created_date: 2026-03-17
|
||||
last_modified: 2026-03-17
|
||||
config_file: "_config/config.yaml"
|
||||
estimated_duration: "15 minutes to 2 hours (depending on severity)"
|
||||
outputFile: '{output_folder}/psm-artifacts/incident-{{project_name}}-{{date}}.md'
|
||||
---
|
||||
|
||||
# Production Incident Response Workflow — BMAD Pattern
|
||||
|
||||
## Metadata & Context
|
||||
|
||||
**Goal**: Triage, diagnose, resolve production incidents through systematic diagnosis and apply fixes with verification. This is the most critical workflow - minimize MTTR (Mean Time To Recovery) while maintaining system stability.
|
||||
|
||||
**Lead Team**:
|
||||
- SRE Minh (Incident Command, Recovery Orchestration)
|
||||
- Architect Khang (Root Cause Analysis, System-wide Impact)
|
||||
- Mary Analyst (Impact Assessment, Post-Incident Review)
|
||||
|
||||
**Success Criteria**:
|
||||
- ✓ Incident severity classified within 5 minutes
|
||||
- ✓ Root cause identified within first triage pass
|
||||
- ✓ Fix applied and verified
|
||||
- ✓ System metrics returned to baseline
|
||||
- ✓ Incident postmortem documented with action items
|
||||
- ✓ Prevention measures identified
|
||||
|
||||
## Workflow Overview
|
||||
|
||||
Workflow này di qua 4 bước atomic, mỗi bước focus vào một phase khác nhau:
|
||||
|
||||
1. **Step-01-Triage** → Gather initial info, assess severity, classify impact
|
||||
2. **Step-02-Diagnose** → Systematic diagnosis using observability data (logs, metrics, traces)
|
||||
3. **Step-03-Fix** → Apply fix, verify resolution, validate recovery
|
||||
4. **Step-04-Postmortem** → Document incident, identify action items, prevent recurrence
|
||||
|
||||
## Configuration Loading
|
||||
|
||||
Tự động load từ `_config/config.yaml`:
|
||||
|
||||
```yaml
|
||||
project_context:
|
||||
organization: "[loaded from config]"
|
||||
environment: "production"
|
||||
incident_channel: "slack:#incidents"
|
||||
|
||||
workflow_defaults:
|
||||
communication_language: "Vietnamese-English"
|
||||
severity_levels: ["SEV1", "SEV2", "SEV3", "SEV4"]
|
||||
escalation_contacts: "[loaded from config]"
|
||||
on_call_engineer: "[loaded from config]"
|
||||
```
|
||||
|
||||
## Workflow Architecture - Micro-File Design
|
||||
|
||||
BMAD pattern: Mỗi step là một file riêng, load just-in-time. Workflow chain:
|
||||
|
||||
```
|
||||
workflow.md (entry point)
|
||||
↓
|
||||
step-01-triage.md (classify severity, initial assessment)
|
||||
↓
|
||||
step-02-diagnose.md (root cause analysis)
|
||||
↓
|
||||
step-03-fix.md (apply fix, verify)
|
||||
↓
|
||||
step-04-postmortem.md (document, prevent)
|
||||
↓
|
||||
incident-response-summary.md (final output)
|
||||
```
|
||||
|
||||
**Key Benefits**:
|
||||
- Single-step focus — engineer concentrates on one phase
|
||||
- Knowledge isolation — load only relevant SKILL docs per step
|
||||
- State tracking — save progress after each step
|
||||
- Easy resumption — if interrupted, restart from exact step
|
||||
|
||||
## Skill References
|
||||
|
||||
Workflow này load knowledge từ:
|
||||
|
||||
- **5.07 Reliability & Resilience** → Circuit breaker patterns, fallback strategies, timeout management
|
||||
- **5.08 Observability & Monitoring** → Structured logging, metrics queries, distributed tracing
|
||||
- **5.09 Error Handling & Recovery** → Error classification, graceful degradation patterns
|
||||
- **5.10 Production Readiness** → Incident prevention checklist, alerting setup
|
||||
- **5.14 Documentation & Runbooks** → Postmortem templates, incident reports
|
||||
|
||||
## Execution Model
|
||||
|
||||
### Entry Point Logic
|
||||
|
||||
```
|
||||
1. Check if incident session exists
|
||||
→ If NEW incident: Start from step-01-triage.md
|
||||
→ If ONGOING: Load incident-session.yaml → continue from last completed step
|
||||
→ If RESOLVED: Load postmortem template
|
||||
|
||||
2. For each step:
|
||||
a) Load step-{N}-{name}.md
|
||||
b) Load referenced SKILL files (auto-parse "Load:" directives)
|
||||
c) Execute MENU [A][C] options
|
||||
d) Save step output to step-{N}-output.md + incident-context.yaml
|
||||
e) Move to next step or conclude
|
||||
|
||||
3. Final: Generate incident report + postmortem in outputs folder
|
||||
```
|
||||
|
||||
### State Tracking
|
||||
|
||||
Incident session frontmatter tracks progress:
|
||||
|
||||
```yaml
|
||||
incident_context:
|
||||
incident_id: "INC-2026-03-17-001"
|
||||
severity: "SEV1" | "SEV2" | "SEV3" | "SEV4"
|
||||
status: "triage" → "diagnosing" → "recovering" → "resolved" → "postmortem"
|
||||
affected_services: ["service-1", "service-2"]
|
||||
started_at: "2026-03-17T14:30:00Z"
|
||||
timeline:
|
||||
detected_at: "2026-03-17T14:30:00Z"
|
||||
triage_completed_at: "2026-03-17T14:35:00Z"
|
||||
root_cause_identified_at: "2026-03-17T14:50:00Z"
|
||||
fix_applied_at: "2026-03-17T15:10:00Z"
|
||||
resolved_at: "2026-03-17T15:15:00Z"
|
||||
current_step: "step-02-diagnose"
|
||||
last_updated: "2026-03-17T14:50:00Z"
|
||||
incident_commander: "SRE Minh"
|
||||
```
|
||||
|
||||
## Mandatory Workflow Rules
|
||||
|
||||
1. **Speed first** — Triage must complete in < 5 minutes
|
||||
2. **Root cause identification** — Must identify root cause before fix attempt
|
||||
3. **Verify before declaring resolved** — Check metrics + user reports
|
||||
4. **Document everything** — Every action logged for postmortem
|
||||
5. **Escalation protocol** — SEV1 → Page on-call architect immediately
|
||||
6. **Communication** — Update stakeholders every 5-10 minutes
|
||||
7. **No flying blind** — All fixes must reference observability data
|
||||
|
||||
## Severity Scale
|
||||
|
||||
- **SEV1** — Service completely down, revenue impact, > 1% users affected → Page all on-call
|
||||
- **SEV2** — Major degradation, significant users affected, partial functionality down
|
||||
- **SEV3** — Moderate impact, some users affected, workaround possible
|
||||
- **SEV4** — Minor issue, limited users, can defer to business hours
|
||||
|
||||
## Navigation
|
||||
|
||||
Hãy chọn cách bắt đầu:
|
||||
|
||||
- **[NEW-INC]** — Report new incident → Load step-01-triage
|
||||
- **[RESUME-INC]** — Continue existing incident (detect progress from incident-session.yaml)
|
||||
- **[ESCALATE]** — Escalate to on-call architect
|
||||
|
||||
---
|
||||
|
||||
**Hãy báo cáo tình trạng incident hoặc chọn [NEW-INC] để bắt đầu triage**
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
---
|
||||
name: bmad-psm-mlops-deployment
|
||||
description: 'Deploy ML model to production with validation and monitoring. Use when the user says "deploy model" or "ML deployment" or "model serving"'
|
||||
---
|
||||
|
||||
Follow the instructions in [workflow.md](workflow.md).
|
||||
|
|
@ -0,0 +1 @@
|
|||
type: skill
|
||||
|
|
@ -0,0 +1,89 @@
|
|||
---
|
||||
workflow_id: MLOPS001
|
||||
workflow_name: MLOps Deployment
|
||||
description: Deploy ML model to production with validation, serving, and monitoring
|
||||
entry_point: steps/step-01-model-validation.md
|
||||
phase: 5-specialized
|
||||
lead_agent: "Linh (MLOps)"
|
||||
status: "active"
|
||||
created_date: 2026-03-17
|
||||
version: "1.0.0"
|
||||
estimated_duration: "3-4 hours"
|
||||
outputFile: '{output_folder}/psm-artifacts/mlops-deploy-{{project_name}}-{{date}}.md'
|
||||
---
|
||||
|
||||
# Workflow: MLOps Deployment
|
||||
|
||||
## Goal
|
||||
Deploy machine learning models to production with comprehensive validation, infrastructure setup, and post-deployment monitoring.
|
||||
|
||||
## Overview
|
||||
|
||||
MLOps deployment ensures ML models are production-ready and continuously monitored for performance and data drift. The workflow:
|
||||
|
||||
1. **Validates** model quality, performance metrics, and data drift detection
|
||||
2. **Deploys** model to serving infrastructure with versioning and A/B testing
|
||||
3. **Monitors** model performance, data drift, and cost metrics post-deployment
|
||||
|
||||
## Execution Path
|
||||
|
||||
```
|
||||
START
|
||||
↓
|
||||
[Step 01] Model Validation (Check metrics, data drift, A/B test plan)
|
||||
↓
|
||||
[Step 02] Deploy Model (Setup serving, infrastructure, GPU optimization)
|
||||
↓
|
||||
[Step 03] Monitor (Langfuse/MLflow, drift detection, cost tracking)
|
||||
↓
|
||||
END
|
||||
```
|
||||
|
||||
## Key Roles
|
||||
|
||||
| Role | Agent | Responsibility |
|
||||
|------|-------|-----------------|
|
||||
| Lead | Linh (MLOps) | Coordinate deployment, monitor model health |
|
||||
| Data Scientist | Data Lead | Validate model quality, approve for production |
|
||||
| DevOps | Platform Eng | Setup infrastructure, manage resources |
|
||||
|
||||
## Validation Gates (3)
|
||||
|
||||
1. **Model Quality** — Accuracy, precision, recall metrics meet SLO
|
||||
2. **Data Quality** — No data drift detected; training/production data distribution aligned
|
||||
3. **Business Readiness** — A/B test plan ready, rollback strategy defined
|
||||
|
||||
## Input Requirements
|
||||
|
||||
- **Trained model artifact** — Model checkpoint, weights, configuration
|
||||
- **Performance metrics** — Baseline accuracy, latency, throughput expectations
|
||||
- **Data validation** — Training dataset description, expected data distribution
|
||||
- **Serving infrastructure** — Compute requirements (GPU/CPU), latency targets
|
||||
|
||||
## Output Deliverable
|
||||
|
||||
- **MLOps Deployment Report**
|
||||
- Model version and metadata
|
||||
- Performance validation summary
|
||||
- Serving infrastructure setup
|
||||
- Monitoring dashboard and alerts
|
||||
- Data drift detection configuration
|
||||
|
||||
## Success Criteria
|
||||
|
||||
1. Model passes all quality gates before deployment
|
||||
2. Serving infrastructure deployed and load-tested
|
||||
3. Monitoring and alerting configured and validated
|
||||
4. Rollback strategy tested and documented
|
||||
5. Team trained on model updates and incident response
|
||||
|
||||
## Next Steps After Workflow
|
||||
|
||||
- Monitor model performance daily for first week
|
||||
- Track data drift metrics; alert if detected
|
||||
- Plan model retraining based on performance degradation
|
||||
- Document lessons learned in MLOps runbook
|
||||
|
||||
---
|
||||
|
||||
**Navigation**: [← Back to 5-specialized](../), [Next: Step 01 →](steps/step-01-model-validation.md)
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
---
|
||||
name: bmad-psm-production-readiness
|
||||
description: 'Run production readiness review across 9 dimensions. Use when the user says "are we ready for production" or "PRR" or "go-live check"'
|
||||
---
|
||||
|
||||
Follow the instructions in [workflow.md](workflow.md).
|
||||
|
|
@ -0,0 +1 @@
|
|||
type: skill
|
||||
|
|
@ -0,0 +1,367 @@
|
|||
---
|
||||
template_name: production-readiness-checklist
|
||||
template_version: "1.0.0"
|
||||
created_date: 2026-03-17
|
||||
description: Production Readiness Review checklist and report template
|
||||
---
|
||||
|
||||
# Production Readiness Review (PRR)
|
||||
|
||||
**Service**: {{SERVICE_NAME}}
|
||||
**Owner**: {{SERVICE_OWNER}}
|
||||
**Reviewer**: {{SRE_LEAD}} (Minh)
|
||||
**Review Date**: {{DATE}}
|
||||
**Target Go-Live**: {{TARGET_DATE}}
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
{{1-2 paragraphs summarizing the readiness assessment, decision, and key findings}}
|
||||
|
||||
**Overall Assessment**: {{READY | CONDITIONAL | NOT_READY}}
|
||||
|
||||
**Timeline**: Service {{can | can conditionally | cannot}} proceed to production {{on {{DATE}}}}
|
||||
|
||||
---
|
||||
|
||||
## Production Readiness Scorecard
|
||||
|
||||
### 9-Dimension Assessment
|
||||
|
||||
| # | Dimension | Score | Status | Key Finding |
|
||||
|---|-----------|-------|--------|-------------|
|
||||
| 1 | Reliability | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
||||
| 2 | Observability | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
||||
| 3 | Performance | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
||||
| 4 | Security | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
||||
| 5 | Capacity | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
||||
| 6 | Data | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
||||
| 7 | Runbooks | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
||||
| 8 | Dependencies | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
||||
| 9 | Rollback | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
||||
|
||||
**Summary**: {{X}} GREEN, {{Y}} YELLOW, {{Z}} RED
|
||||
|
||||
---
|
||||
|
||||
## Detailed Findings by Dimension
|
||||
|
||||
### 1. Reliability
|
||||
|
||||
**Goal**: Service meets SLO targets with documented failure modes and incident response plan.
|
||||
|
||||
**Findings**:
|
||||
|
||||
- [ ] {{Finding 1}} ({{Status}})
|
||||
- [ ] {{Finding 2}} ({{Status}})
|
||||
- [ ] {{Finding 3}} ({{Status}})
|
||||
|
||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
||||
|
||||
**Score**: {{GREEN|YELLOW|RED}}
|
||||
|
||||
---
|
||||
|
||||
### 2. Observability
|
||||
|
||||
**Goal**: Service has comprehensive logging, metrics, tracing, and dashboards for operational visibility.
|
||||
|
||||
**Findings**:
|
||||
|
||||
- [ ] {{Finding 1}} ({{Status}})
|
||||
- [ ] {{Finding 2}} ({{Status}})
|
||||
- [ ] {{Finding 3}} ({{Status}})
|
||||
|
||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
||||
|
||||
**Score**: {{GREEN|YELLOW|RED}}
|
||||
|
||||
---
|
||||
|
||||
### 3. Performance
|
||||
|
||||
**Goal**: Service meets latency/throughput targets and scales under expected load.
|
||||
|
||||
**Findings**:
|
||||
|
||||
- [ ] {{Finding 1}} ({{Status}})
|
||||
- [ ] {{Finding 2}} ({{Status}})
|
||||
- [ ] {{Finding 3}} ({{Status}})
|
||||
|
||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
||||
|
||||
**Score**: {{GREEN|YELLOW|RED}}
|
||||
|
||||
---
|
||||
|
||||
### 4. Security
|
||||
|
||||
**Goal**: Authentication, authorization, encryption, and secrets management are implemented.
|
||||
|
||||
**Findings**:
|
||||
|
||||
- [ ] {{Finding 1}} ({{Status}})
|
||||
- [ ] {{Finding 2}} ({{Status}})
|
||||
- [ ] {{Finding 3}} ({{Status}})
|
||||
|
||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
||||
|
||||
**Score**: {{GREEN|YELLOW|RED}}
|
||||
|
||||
---
|
||||
|
||||
### 5. Capacity
|
||||
|
||||
**Goal**: Resource requirements defined with growth headroom and cost acceptable.
|
||||
|
||||
**Findings**:
|
||||
|
||||
- [ ] {{Finding 1}} ({{Status}})
|
||||
- [ ] {{Finding 2}} ({{Status}})
|
||||
- [ ] {{Finding 3}} ({{Status}})
|
||||
|
||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
||||
|
||||
**Score**: {{GREEN|YELLOW|RED}}
|
||||
|
||||
---
|
||||
|
||||
### 6. Data
|
||||
|
||||
**Goal**: Data governance, backup, retention, and disaster recovery documented and tested.
|
||||
|
||||
**Findings**:
|
||||
|
||||
- [ ] {{Finding 1}} ({{Status}})
|
||||
- [ ] {{Finding 2}} ({{Status}})
|
||||
- [ ] {{Finding 3}} ({{Status}})
|
||||
|
||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
||||
|
||||
**Score**: {{GREEN|YELLOW|RED}}
|
||||
|
||||
---
|
||||
|
||||
### 7. Runbooks
|
||||
|
||||
**Goal**: Incident response, deployment, troubleshooting procedures documented and drilled.
|
||||
|
||||
**Findings**:
|
||||
|
||||
- [ ] {{Finding 1}} ({{Status}})
|
||||
- [ ] {{Finding 2}} ({{Status}})
|
||||
- [ ] {{Finding 3}} ({{Status}})
|
||||
|
||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
||||
|
||||
**Score**: {{GREEN|YELLOW|RED}}
|
||||
|
||||
---
|
||||
|
||||
### 8. Dependencies
|
||||
|
||||
**Goal**: External/internal dependencies mapped, versioned, with fallback strategies.
|
||||
|
||||
**Findings**:
|
||||
|
||||
- [ ] {{Finding 1}} ({{Status}})
|
||||
- [ ] {{Finding 2}} ({{Status}})
|
||||
- [ ] {{Finding 3}} ({{Status}})
|
||||
|
||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
||||
|
||||
**Score**: {{GREEN|YELLOW|RED}}
|
||||
|
||||
---
|
||||
|
||||
### 9. Rollback
|
||||
|
||||
**Goal**: Safe rollback strategy tested; deployment is reversible.
|
||||
|
||||
**Findings**:
|
||||
|
||||
- [ ] {{Finding 1}} ({{Status}})
|
||||
- [ ] {{Finding 2}} ({{Status}})
|
||||
- [ ] {{Finding 3}} ({{Status}})
|
||||
|
||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
||||
|
||||
**Score**: {{GREEN|YELLOW|RED}}
|
||||
|
||||
---
|
||||
|
||||
## Critical Blockers (P0)
|
||||
|
||||
{{If any P0 blockers exist:}}
|
||||
|
||||
Service **CANNOT** proceed to production until these are resolved:
|
||||
|
||||
### P0 Blocker #1: {{ISSUE_TITLE}}
|
||||
|
||||
- **Dimension**: {{Which dimension}}
|
||||
- **Description**: {{What's the problem}}
|
||||
- **Impact**: {{Why it's critical}}
|
||||
- **Resolution**: {{How to fix}}
|
||||
- **Owner**: {{Who must fix it}}
|
||||
- **Deadline**: {{When it must be done}}
|
||||
- **Acceptance**: {{How we verify it's fixed}}
|
||||
|
||||
### P0 Blocker #2: {{ISSUE_TITLE}}
|
||||
|
||||
{{Repeat format}}
|
||||
|
||||
---
|
||||
|
||||
## Risks to Manage (P1)
|
||||
|
||||
Service can proceed with documented monitoring and contingency plans:
|
||||
|
||||
### P1 Risk #1: {{ISSUE_TITLE}}
|
||||
|
||||
- **Dimension**: {{Which dimension}}
|
||||
- **Description**: {{What's the problem}}
|
||||
- **Impact**: {{If it happens, what's the consequence}}
|
||||
- **Likelihood**: {{HIGH|MEDIUM|LOW}}
|
||||
- **Mitigation**: {{How we'll manage it}}
|
||||
- **Monitoring**: {{What metrics to watch}}
|
||||
- **Contingency**: {{What we'll do if it occurs}}
|
||||
- **Owner**: {{Who owns this risk}}
|
||||
- **Target Fix**: {{Timeline to resolve permanently}}
|
||||
|
||||
### P1 Risk #2: {{ISSUE_TITLE}}
|
||||
|
||||
{{Repeat format}}
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
**High Priority** (Next sprint):
|
||||
- {{Recommendation 1}}
|
||||
- {{Recommendation 2}}
|
||||
|
||||
**Medium Priority** (Within 1 month):
|
||||
- {{Recommendation 1}}
|
||||
- {{Recommendation 2}}
|
||||
|
||||
**Nice to Have** (Backlog):
|
||||
- {{Recommendation 1}}
|
||||
- {{Recommendation 2}}
|
||||
|
||||
---
|
||||
|
||||
## Final Decision
|
||||
|
||||
### Decision
|
||||
|
||||
**{{ ✅ GO | ⚠️ CONDITIONAL-GO | ❌ NO-GO }}**
|
||||
|
||||
### Rationale
|
||||
|
||||
{{Explain the decision. Why can/can't we proceed?}}
|
||||
|
||||
### Conditions (If CONDITIONAL-GO)
|
||||
|
||||
If proceeding despite P1 risks, document conditions:
|
||||
|
||||
1. **{{Condition 1}}**: {{Description}}
|
||||
- Owner: {{Who oversees this}}
|
||||
- Success Criteria: {{How we verify it}}
|
||||
- Escalation: {{Who to contact if issues}}
|
||||
|
||||
2. **{{Condition 2}}**: {{Description}}
|
||||
- Owner: {{Who oversees this}}
|
||||
- Success Criteria: {{How we verify it}}
|
||||
- Escalation: {{Who to contact if issues}}
|
||||
|
||||
### Deployment Timeline
|
||||
|
||||
{{If GO or CONDITIONAL-GO:}}
|
||||
|
||||
- **Approved for deployment**: {{DATE}}
|
||||
- **Earliest go-live**: {{DATE}}
|
||||
- **Recommended window**: {{DATE/TIME}}
|
||||
- **On-call coverage required**: {{YES|NO}}
|
||||
- **Emergency rollback plan**: {{REFERENCE TO RUNBOOK}}
|
||||
|
||||
---
|
||||
|
||||
## Sign-offs & Approvals
|
||||
|
||||
### Approval Chain
|
||||
|
||||
- [ ] **SRE Lead** ({{NAME}}) — Review completed and findings approved
|
||||
- Signature: ________________________ Date: __________
|
||||
|
||||
- [ ] **Architecture Lead** ({{NAME}}) — Architecture validated
|
||||
- Signature: ________________________ Date: __________
|
||||
|
||||
- [ ] **Service Owner** ({{NAME}}) — Acknowledged findings and committed to actions
|
||||
- Signature: ________________________ Date: __________
|
||||
|
||||
- [ ] **VP Engineering** ({{NAME}}) — Risk accepted (if CONDITIONAL-GO)
|
||||
- Signature: ________________________ Date: __________
|
||||
|
||||
---
|
||||
|
||||
## Post-Production Plan
|
||||
|
||||
### First 24 Hours
|
||||
|
||||
- [ ] SRE on-call monitoring closely
|
||||
- [ ] Daily standup with service team
|
||||
- [ ] Monitor for any unusual patterns
|
||||
- [ ] Be ready to rollback if needed
|
||||
|
||||
### First Week
|
||||
|
||||
- [ ] Daily metrics review
|
||||
- [ ] Watch for data drift or unusual behavior
|
||||
- [ ] Follow up on any P1 risks
|
||||
|
||||
### Ongoing
|
||||
|
||||
- [ ] Monthly PRR follow-ups to verify improvements
|
||||
- [ ] Track action items to completion
|
||||
- [ ] Update this PRR if significant changes made
|
||||
|
||||
---
|
||||
|
||||
## Action Items
|
||||
|
||||
| ID | Action | Owner | Deadline | Type | Status |
|
||||
|----|--------|-------|----------|------|--------|
|
||||
| A1 | {{Action}} | {{Name}} | {{Date}} | {{BLOCKER|RISK|RECOMMENDATION}} | ☐ |
|
||||
| A2 | {{Action}} | {{Name}} | {{Date}} | {{BLOCKER|RISK|RECOMMENDATION}} | ☐ |
|
||||
| A3 | {{Action}} | {{Name}} | {{Date}} | {{BLOCKER|RISK|RECOMMENDATION}} | ☐ |
|
||||
|
||||
---
|
||||
|
||||
## Appendix
|
||||
|
||||
### A. Load Test Results
|
||||
|
||||
[Link to or summary of load test results showing service meets performance targets]
|
||||
|
||||
### B. Security Review Results
|
||||
|
||||
[Link to or summary of security audit findings]
|
||||
|
||||
### C. Architecture Diagrams
|
||||
|
||||
[Include or link to system architecture, data flow, and deployment topology]
|
||||
|
||||
### D. SLO Definition
|
||||
|
||||
[Document the agreed-upon SLO targets for availability, latency, error rate]
|
||||
|
||||
### E. Runbooks
|
||||
|
||||
[Link to or list of key runbooks: incident response, deployment, rollback, troubleshooting]
|
||||
|
||||
---
|
||||
|
||||
**Report prepared by**: {{SRE_LEAD}}
|
||||
**Report date**: {{DATE}}
|
||||
**Last updated**: {{DATE}}
|
||||
|
|
@ -0,0 +1,92 @@
|
|||
---
|
||||
workflow_id: PRR001
|
||||
workflow_name: Production Readiness Review
|
||||
description: Validate service is ready for production using comprehensive readiness checklist
|
||||
entry_point: steps/step-01-init-checklist.md
|
||||
phase: 3-run
|
||||
lead_agent: "Minh (SRE)"
|
||||
status: "active"
|
||||
created_date: 2026-03-17
|
||||
version: "1.0.0"
|
||||
estimated_duration: "2-3 hours"
|
||||
outputFile: '{output_folder}/psm-artifacts/prr-{{project_name}}-{{date}}.md'
|
||||
---
|
||||
|
||||
# Workflow: Production Readiness Review (PRR)
|
||||
|
||||
## Goal
|
||||
Validate and certify that a service meets production readiness standards across 9 key dimensions before deployment.
|
||||
|
||||
## Overview
|
||||
|
||||
This workflow systematically evaluates a service against production readiness criteria defined in the Production Systems BMAD skill framework. Using SRE expertise and architectural patterns, the workflow:
|
||||
|
||||
1. **Initializes** the PRR process with service context and dimensional overview
|
||||
2. **Deep reviews** each dimension (reliability, observability, performance, security, capacity, data, runbooks, dependencies, rollback)
|
||||
3. **Renders final decision** with GO/NO-GO/CONDITIONAL-GO recommendation
|
||||
|
||||
## Execution Path
|
||||
|
||||
```
|
||||
START
|
||||
↓
|
||||
[Step 01] Init Checklist (Load framework, gather service context, present dimensions)
|
||||
↓
|
||||
[Step 02] Deep Review (Score each dimension, identify blockers, recommendations)
|
||||
↓
|
||||
[Step 03] Final Decision (Scorecard, decision, action items, DONE)
|
||||
↓
|
||||
END
|
||||
```
|
||||
|
||||
## Key Roles
|
||||
|
||||
| Role | Agent | Responsibility |
|
||||
|------|-------|-----------------|
|
||||
| Lead | Minh (SRE) | Navigate workflow, coordinate review, make final call |
|
||||
| Subject Matter | Service Owner | Provide service context, clarify architecture |
|
||||
| Review Committee | Arch, SecOps, MLOps | Contribute expertise on specific dimensions |
|
||||
|
||||
## Dimensions Evaluated (9)
|
||||
|
||||
1. **Reliability** — SLA/SLO definition, error budgets, failure modes, incident response
|
||||
2. **Observability** — Logging, metrics, tracing, dashboards, alerting
|
||||
3. **Performance** — Latency targets, throughput, P99 tail behavior, optimization opportunities
|
||||
4. **Security** — Auth/authz, secrets management, encryption, audit logging, compliance
|
||||
5. **Capacity** — Resource limits, scaling policies, burst capacity, cost projections
|
||||
6. **Data** — Schema versioning, backup/restore, data governance, retention policies
|
||||
7. **Runbooks** — Incident runbooks, operational playbooks, troubleshooting guides
|
||||
8. **Dependencies** — External services, internal libraries, database versioning, API contracts
|
||||
9. **Rollback** — Rollback strategy, canary deployment, feature flags, smoke tests
|
||||
|
||||
## Input Requirements
|
||||
|
||||
- **Service name and owner** — Which service are we evaluating?
|
||||
- **Current architecture** — High-level design, tech stack, topology
|
||||
- **Existing metrics/dashboards** — Links to monitoring, SLO definitions
|
||||
- **Known gaps/risks** — Already identified issues to address
|
||||
|
||||
## Output Deliverable
|
||||
|
||||
- **Production Readiness Checklist** (template: `production-readiness.template.md`)
|
||||
- Scorecard with 9 dimensions (red/yellow/green)
|
||||
- Blockers and recommendations per dimension
|
||||
- Final GO/NO-GO/CONDITIONAL-GO decision
|
||||
- Explicit action items with owners and deadlines
|
||||
|
||||
## Success Criteria
|
||||
|
||||
1. All 9 dimensions evaluated with clear rationale
|
||||
2. Blockers categorized as P0 (must fix) or P1 (should fix)
|
||||
3. Team alignment on decision (documented in PRR report)
|
||||
4. Action plan with clear accountability and timeline
|
||||
|
||||
## Next Steps After Workflow
|
||||
|
||||
- If **GO**: Proceed to deployment; document in CHANGELOG
|
||||
- If **NO-GO**: Reschedule PRR once blockers addressed; track in backlog
|
||||
- If **CONDITIONAL-GO**: Deploy with documented caveats; setup monitoring for risk areas
|
||||
|
||||
---
|
||||
|
||||
**Navigation**: [← Back to 3-run](../), [Next: Step 01 →](steps/step-01-init-checklist.md)
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
---
|
||||
name: bmad-psm-quick-diagnose
|
||||
description: 'Quick diagnosis of production issue with minimal latency. Use when the user says "something is broken" or "quick diagnose" or "what is happening?"'
|
||||
---
|
||||
|
||||
Follow the instructions in [workflow.md](workflow.md).
|
||||
|
|
@ -0,0 +1 @@
|
|||
type: skill
|
||||
|
|
@ -0,0 +1,80 @@
|
|||
---
|
||||
workflow_id: QD001
|
||||
workflow_name: Quick Diagnose
|
||||
description: Fast diagnosis of production issue with root cause and fix suggestion
|
||||
entry_point: steps/step-01-gather.md
|
||||
phase: quick-flow
|
||||
lead_agent: "Minh (SRE)"
|
||||
status: "active"
|
||||
created_date: 2026-03-17
|
||||
version: "1.0.0"
|
||||
estimated_duration: "15-25 minutes"
|
||||
outputFile: '{output_folder}/psm-artifacts/quick-diagnose-{{date}}.md'
|
||||
---
|
||||
|
||||
# Workflow: Quick Diagnose Production Issue
|
||||
|
||||
## Goal
|
||||
Rapidly diagnose production issues by gathering symptom data, checking metrics, and suggesting fixes.
|
||||
|
||||
## Overview
|
||||
|
||||
Quick Diagnose is a lightweight workflow for time-sensitive production troubleshooting:
|
||||
|
||||
1. **Gathers** symptom description and quick metrics check
|
||||
2. **Diagnoses** root cause using observability data
|
||||
3. **Suggests** fix or mitigation immediately
|
||||
|
||||
## Execution Path
|
||||
|
||||
```
|
||||
START
|
||||
↓
|
||||
[Step 01] Gather Context (What's broken? Check metrics)
|
||||
↓
|
||||
[Step 02] Diagnose & Fix (Root cause analysis → fix suggestion → verify)
|
||||
↓
|
||||
END
|
||||
```
|
||||
|
||||
## Key Roles
|
||||
|
||||
| Role | Agent |
|
||||
|------|-------|
|
||||
| Lead | Minh (SRE) |
|
||||
|
||||
## Input Requirements
|
||||
|
||||
- **Symptom description** — What is failing? (error message, behavior, timeline)
|
||||
- **Affected service/component** — What system is broken?
|
||||
- **Timeline** — When did it start? Is it ongoing?
|
||||
- **Impact** — How many users affected? Is revenue impacted?
|
||||
|
||||
## Output Deliverable
|
||||
|
||||
- **Quick Diagnosis Report** (markdown, 1-2 pages)
|
||||
- Symptom analysis
|
||||
- Root cause hypothesis
|
||||
- Immediate mitigation (if needed)
|
||||
- Fix suggestion with effort
|
||||
- Follow-up actions
|
||||
|
||||
## Success Criteria
|
||||
|
||||
1. Root cause identified within 15-20 minutes
|
||||
2. Immediate mitigation available (if needed)
|
||||
3. Fix suggestion documented with clear steps
|
||||
4. Team knows what to do next
|
||||
|
||||
## Quick Diagnose vs Full Production Readiness Review
|
||||
|
||||
| Aspect | Quick Diagnose | Full PRR |
|
||||
|--------|---|---|
|
||||
| Trigger | Active incident | Pre-deployment |
|
||||
| Duration | 15-25 min | 2-3 hours |
|
||||
| Scope | Single issue | All 9 dimensions |
|
||||
| Goal | Fix now | Prevent issues |
|
||||
|
||||
---
|
||||
|
||||
**Navigation**: [← Back to quick-flow](../), [Next: Step 01 →](steps/step-01-gather.md)
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
---
|
||||
name: bmad-psm-security-audit
|
||||
description: 'Run comprehensive security audit and threat assessment. Use when the user says "security audit" or "vulnerability assessment" or "security review"'
|
||||
---
|
||||
|
||||
Follow the instructions in [workflow.md](workflow.md).
|
||||
|
|
@ -0,0 +1 @@
|
|||
type: skill
|
||||
|
|
@ -0,0 +1,502 @@
|
|||
---
|
||||
template_name: security-audit-report
|
||||
template_version: "1.0.0"
|
||||
created_date: 2026-03-17
|
||||
description: Security audit report with findings, severity levels, and remediation plan
|
||||
---
|
||||
|
||||
# Security Audit Report
|
||||
|
||||
**Service**: {{SERVICE_NAME}}
|
||||
**Service Owner**: {{SERVICE_OWNER}}
|
||||
**Auditor**: {{SECURITY_LEAD}} (Hà)
|
||||
**Audit Date**: {{START_DATE}} — {{END_DATE}}
|
||||
**Report Date**: {{REPORT_DATE}}
|
||||
**Scope**: {{SCOPE_DESCRIPTION}}
|
||||
|
||||
---
|
||||
|
||||
## Executive Summary
|
||||
|
||||
This security audit evaluated {{SERVICE_NAME}} against security best practices and compliance requirements. The assessment identified {{X}} findings across {{Y}} security domains.
|
||||
|
||||
**Overall Security Posture**: {{COMPLIANT | FINDINGS | CRITICAL}}
|
||||
|
||||
{{1-2 paragraph summary of key findings, critical issues if any, and recommendations}}
|
||||
|
||||
---
|
||||
|
||||
## Audit Scope
|
||||
|
||||
### Services Reviewed
|
||||
|
||||
- {{Service 1}} ({{Description}})
|
||||
- {{Service 2}} ({{Description}})
|
||||
- {{Service 3}} ({{Description}})
|
||||
|
||||
### Assessment Domains
|
||||
|
||||
- ✅ Authentication & Authorization
|
||||
- ✅ API Security
|
||||
- ✅ Secrets Management
|
||||
- ✅ Encryption (in-transit & at-rest)
|
||||
- ✅ PII & Data Protection
|
||||
|
||||
### Exclusions
|
||||
|
||||
{{Any out-of-scope areas:}}
|
||||
- {{Item}} (reason)
|
||||
- {{Item}} (reason)
|
||||
|
||||
---
|
||||
|
||||
## Findings Summary
|
||||
|
||||
### By Severity
|
||||
|
||||
| Severity | Count | Trend |
|
||||
|----------|-------|-------|
|
||||
| **Critical** | {{X}} | {{↑/→/↓}} |
|
||||
| **High** | {{Y}} | {{↑/→/↓}} |
|
||||
| **Medium** | {{Z}} | {{↑/→/↓}} |
|
||||
| **Low** | {{W}} | {{↑/→/↓}} |
|
||||
| **Total** | {{X+Y+Z+W}} | |
|
||||
|
||||
### By Domain
|
||||
|
||||
| Domain | Critical | High | Medium | Low | Status |
|
||||
|--------|----------|------|--------|-----|--------|
|
||||
| Auth & Authz | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
|
||||
| API Security | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
|
||||
| Secrets Mgmt | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
|
||||
| Encryption | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
|
||||
| PII & Data | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
|
||||
|
||||
---
|
||||
|
||||
## Critical Severity Findings
|
||||
|
||||
### [F1] {{Finding Title}}
|
||||
|
||||
**Severity**: CRITICAL (CVSS {{8.0-10.0}})
|
||||
**Domain**: {{Which domain}}
|
||||
**Status**: {{Open | In Progress | Resolved}}
|
||||
|
||||
**Description**:
|
||||
{{Detailed description of the vulnerability, how it could be exploited, and the impact}}
|
||||
|
||||
**Evidence**:
|
||||
- {{Evidence 1}}
|
||||
- {{Evidence 2}}
|
||||
- {{Testing confirmation}}
|
||||
|
||||
**Impact**:
|
||||
- {{Business impact}}
|
||||
- {{Technical impact}}
|
||||
- {{Compliance impact}}
|
||||
|
||||
**Remediation**:
|
||||
1. {{Step 1}} ({{Estimated time}})
|
||||
2. {{Step 2}} ({{Estimated time}})
|
||||
3. {{Step 3}} ({{Estimated time}})
|
||||
|
||||
**Owner**: {{Name}}
|
||||
**Target Fix Date**: {{DATE}}
|
||||
**Effort**: {{Est. hours/days}}
|
||||
**Verification**: {{How we'll confirm it's fixed}}
|
||||
|
||||
---
|
||||
|
||||
### [F2] {{Finding Title}}
|
||||
|
||||
{{Repeat Critical severity format}}
|
||||
|
||||
---
|
||||
|
||||
## High Severity Findings
|
||||
|
||||
### [F3] {{Finding Title}}
|
||||
|
||||
**Severity**: HIGH (CVSS {{7.0-7.9}})
|
||||
**Domain**: {{Which domain}}
|
||||
**Status**: {{Open | In Progress | Resolved}}
|
||||
|
||||
**Description**: {{Brief description}}
|
||||
|
||||
**Impact**: {{Why it matters}}
|
||||
|
||||
**Remediation**:
|
||||
1. {{Step 1}}
|
||||
2. {{Step 2}}
|
||||
|
||||
**Owner**: {{Name}}
|
||||
**Target Date**: {{DATE}}
|
||||
|
||||
---
|
||||
|
||||
### [F4] {{Finding Title}}
|
||||
|
||||
{{Repeat High severity format}}
|
||||
|
||||
---
|
||||
|
||||
## Medium Severity Findings
|
||||
|
||||
### [F5] {{Finding Title}}
|
||||
|
||||
**Severity**: MEDIUM (CVSS {{4.0-6.9}})
|
||||
**Domain**: {{Which domain}}
|
||||
**Description**: {{Brief description}}
|
||||
**Remediation**: {{Brief fix}}
|
||||
**Owner**: {{Name}} | **Target Date**: {{DATE}}
|
||||
|
||||
---
|
||||
|
||||
### [F6] {{Finding Title}}
|
||||
|
||||
{{Repeat Medium severity format}}
|
||||
|
||||
---
|
||||
|
||||
## Low Severity Findings
|
||||
|
||||
### [F7] {{Finding Title}}
|
||||
|
||||
**Severity**: LOW (CVSS {{0.1-3.9}})
|
||||
**Description**: {{Brief description}}
|
||||
**Remediation**: {{Brief fix}}
|
||||
|
||||
---
|
||||
|
||||
### [F8] {{Finding Title}}
|
||||
|
||||
{{Repeat Low severity format}}
|
||||
|
||||
---
|
||||
|
||||
## Domain-Specific Assessment
|
||||
|
||||
### Domain 1: Authentication & Authorization
|
||||
|
||||
**Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
|
||||
|
||||
**Strengths**:
|
||||
- {{Positive finding 1}}
|
||||
- {{Positive finding 2}}
|
||||
|
||||
**Gaps**:
|
||||
- {{Gap 1}} — {{Impact}}
|
||||
- {{Gap 2}} — {{Impact}}
|
||||
|
||||
**Recommendations**:
|
||||
1. {{Recommendation 1}}
|
||||
2. {{Recommendation 2}}
|
||||
|
||||
---
|
||||
|
||||
### Domain 2: API Security
|
||||
|
||||
**Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
|
||||
|
||||
**Strengths**:
|
||||
- {{Positive finding 1}}
|
||||
- {{Positive finding 2}}
|
||||
|
||||
**Gaps**:
|
||||
- {{Gap 1}} — {{Impact}}
|
||||
- {{Gap 2}} — {{Impact}}
|
||||
|
||||
**Recommendations**:
|
||||
1. {{Recommendation 1}}
|
||||
2. {{Recommendation 2}}
|
||||
|
||||
---
|
||||
|
||||
### Domain 3: Secrets Management
|
||||
|
||||
**Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
|
||||
|
||||
**Strengths**:
|
||||
- {{Positive finding 1}}
|
||||
- {{Positive finding 2}}
|
||||
|
||||
**Gaps**:
|
||||
- {{Gap 1}} — {{Impact}}
|
||||
- {{Gap 2}} — {{Impact}}
|
||||
|
||||
**Recommendations**:
|
||||
1. {{Recommendation 1}}
|
||||
2. {{Recommendation 2}}
|
||||
|
||||
---
|
||||
|
||||
### Domain 4: Encryption
|
||||
|
||||
**Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
|
||||
|
||||
**Strengths**:
|
||||
- {{Positive finding 1}}
|
||||
- {{Positive finding 2}}
|
||||
|
||||
**Gaps**:
|
||||
- {{Gap 1}} — {{Impact}}
|
||||
- {{Gap 2}} — {{Impact}}
|
||||
|
||||
**Recommendations**:
|
||||
1. {{Recommendation 1}}
|
||||
2. {{Recommendation 2}}
|
||||
|
||||
---
|
||||
|
||||
### Domain 5: PII & Data Protection
|
||||
|
||||
**Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
|
||||
|
||||
**Strengths**:
|
||||
- {{Positive finding 1}}
|
||||
- {{Positive finding 2}}
|
||||
|
||||
**Gaps**:
|
||||
- {{Gap 1}} — {{Impact}}
|
||||
- {{Gap 2}} — {{Impact}}
|
||||
|
||||
**Recommendations**:
|
||||
1. {{Recommendation 1}}
|
||||
2. {{Recommendation 2}}
|
||||
|
||||
---
|
||||
|
||||
## Compliance Assessment
|
||||
|
||||
### GDPR (General Data Protection Regulation)
|
||||
|
||||
**Applicable**: {{YES | NO | PARTIAL}}
|
||||
**Status**: {{COMPLIANT | NON-COMPLIANT | CONDITIONAL}}
|
||||
|
||||
| Requirement | Status | Finding | Gap Fix |
|
||||
|-------------|--------|---------|---------|
|
||||
| Data Encryption | {{✅/❌}} | {{Description}} | {{Remediation}} |
|
||||
| Access Control | {{✅/❌}} | {{Description}} | {{Remediation}} |
|
||||
| Retention Policy | {{✅/❌}} | {{Description}} | {{Remediation}} |
|
||||
| Right to Deletion | {{✅/❌}} | {{Description}} | {{Remediation}} |
|
||||
| Data Processing Agreement | {{✅/❌}} | {{Description}} | {{Remediation}} |
|
||||
|
||||
**Timeline to Compliance**: {{DATE or "Already compliant"}}
|
||||
|
||||
---
|
||||
|
||||
### PCI-DSS (Payment Card Industry Data Security Standard)
|
||||
|
||||
**Applicable**: {{YES | NO | PARTIAL}}
|
||||
**Status**: {{COMPLIANT | NON-COMPLIANT | CONDITIONAL}}
|
||||
|
||||
| Requirement | Status | Finding | Gap Fix |
|
||||
|-------------|--------|---------|---------|
|
||||
| TLS 1.2+ | {{✅/❌}} | {{Description}} | {{Remediation}} |
|
||||
| Secrets Management | {{✅/❌}} | {{Description}} | {{Remediation}} |
|
||||
| Input Validation | {{✅/❌}} | {{Description}} | {{Remediation}} |
|
||||
|
||||
**Timeline to Compliance**: {{DATE or "Already compliant"}}
|
||||
|
||||
---
|
||||
|
||||
### SOC 2 Type II
|
||||
|
||||
**Applicable**: {{YES | NO | PARTIAL}}
|
||||
**Status**: {{COMPLIANT | NON-COMPLIANT | CONDITIONAL}}
|
||||
|
||||
**Gap Summary**: {{Description of gaps or "No gaps identified"}}
|
||||
|
||||
**Timeline**: {{When audit can be conducted}}
|
||||
|
||||
---
|
||||
|
||||
### Other Regulations
|
||||
|
||||
{{Any other applicable standards (HIPAA, FINRA, etc.)}}
|
||||
|
||||
---
|
||||
|
||||
## Remediation Roadmap
|
||||
|
||||
### Critical Path (Week 1-2)
|
||||
|
||||
**All Critical findings must be fixed before production deployment.**
|
||||
|
||||
- [ ] {{F1}} — Owner: {{Name}}, Deadline: {{DATE}}
|
||||
- [ ] {{F2}} — Owner: {{Name}}, Deadline: {{DATE}}
|
||||
|
||||
**Milestone**: Security re-scan on {{DATE}} to verify fixes
|
||||
|
||||
---
|
||||
|
||||
### Phase 2 (Week 3-4)
|
||||
|
||||
Complete High-severity findings:
|
||||
|
||||
- [ ] {{F3}} — Owner: {{Name}}, Deadline: {{DATE}}
|
||||
- [ ] {{F4}} — Owner: {{Name}}, Deadline: {{DATE}}
|
||||
|
||||
**Milestone**: Second security review on {{DATE}}
|
||||
|
||||
---
|
||||
|
||||
### Phase 3 (Weeks 5-8)
|
||||
|
||||
Address Medium-severity findings (can be post-production with monitoring):
|
||||
|
||||
- [ ] {{F5}} — Owner: {{Name}}, Target: {{DATE}}
|
||||
- [ ] {{F6}} — Owner: {{Name}}, Target: {{DATE}}
|
||||
|
||||
---
|
||||
|
||||
### Backlog (Next Sprint)
|
||||
|
||||
Low-severity items:
|
||||
|
||||
- [ ] {{F7}} — {{Brief description}}
|
||||
- [ ] {{F8}} — {{Brief description}}
|
||||
|
||||
---
|
||||
|
||||
## Remediation Status Tracking
|
||||
|
||||
| Finding | Owner | Deadline | Status | Last Update | Notes |
|
||||
|---------|-------|----------|--------|-------------|-------|
|
||||
| F1 | {{Name}} | {{Date}} | 🔴 Pending | {{Date}} | {{Notes}} |
|
||||
| F2 | {{Name}} | {{Date}} | 🟡 In Progress | {{Date}} | {{Notes}} |
|
||||
| F3 | {{Name}} | {{Date}} | 🟢 Complete | {{Date}} | {{Notes}} |
|
||||
|
||||
---
|
||||
|
||||
## Post-Audit Monitoring
|
||||
|
||||
### Controls to Monitor
|
||||
|
||||
{{If service proceeds to production despite findings:}}
|
||||
|
||||
- **{{Control 1}}** — Monitor via {{method}}, alert if {{threshold}}
|
||||
- **{{Control 2}}** — Monitor via {{method}}, alert if {{threshold}}
|
||||
- **{{Control 3}}** — Monitor via {{method}}, alert if {{threshold}}
|
||||
|
||||
### Incident Response
|
||||
|
||||
If a security incident occurs:
|
||||
1. Activate incident response team
|
||||
2. Notify {{Escalation contacts}}
|
||||
3. Follow {{Incident response runbook}}
|
||||
4. Conduct post-incident security review
|
||||
|
||||
---
|
||||
|
||||
## Risk Assessment Matrix
|
||||
|
||||
```
|
||||
LIKELIHOOD
|
||||
Low Med High
|
||||
CRITICAL H C C
|
||||
IMPACT
|
||||
HIGH M H C
|
||||
MEDIUM L M H
|
||||
LOW L L M
|
||||
|
||||
Legend: C=Critical, H=High, M=Medium, L=Low
|
||||
```
|
||||
|
||||
**Our findings map**:
|
||||
- {{F1}} — {{Position on matrix}}
|
||||
- {{F2}} — {{Position on matrix}}
|
||||
|
||||
---
|
||||
|
||||
## Positive Findings
|
||||
|
||||
**Strengths to maintain:**
|
||||
|
||||
- {{Positive 1}} — Keep doing this
|
||||
- {{Positive 2}} — Keep doing this
|
||||
- {{Positive 3}} — Keep doing this
|
||||
|
||||
---
|
||||
|
||||
## Recommendations Summary
|
||||
|
||||
### Immediate (Critical)
|
||||
- {{Fix all Critical findings}} ({{effort}})
|
||||
|
||||
### Short-term (High Priority)
|
||||
- {{Fix all High findings}} ({{effort}})
|
||||
- {{Implement automated scanning}} ({{effort}})
|
||||
- {{Setup security monitoring}} ({{effort}})
|
||||
|
||||
### Medium-term
|
||||
- {{Implement {{technology}} for {{purpose}}}} ({{effort}})
|
||||
- {{Security training for team}} ({{effort}})
|
||||
|
||||
### Long-term (Next 6 Months)
|
||||
- {{Major security initiative}} ({{effort}})
|
||||
- {{Penetration testing}} ({{effort}})
|
||||
|
||||
---
|
||||
|
||||
## Sign-offs & Approvals
|
||||
|
||||
### Audit Approval
|
||||
|
||||
- [ ] **Security Lead** ({{AUDITOR_NAME}})
|
||||
- Signature: ________________________ Date: __________
|
||||
- Assessment complete and findings documented
|
||||
|
||||
### Service Owner Acknowledgment
|
||||
|
||||
- [ ] **Service Owner** ({{SERVICE_OWNER}})
|
||||
- Signature: ________________________ Date: __________
|
||||
- Acknowledged findings and committed to remediation
|
||||
|
||||
### Compliance Officer Review
|
||||
|
||||
- [ ] **Compliance Officer** ({{NAME}})
|
||||
- Signature: ________________________ Date: __________
|
||||
- Compliance requirements verified
|
||||
|
||||
### Executive Approval (If Production Clearance Needed)
|
||||
|
||||
- [ ] **VP Engineering / Security** ({{NAME}})
|
||||
- Signature: ________________________ Date: __________
|
||||
- Risk accepted; approved for production
|
||||
|
||||
---
|
||||
|
||||
## Distribution
|
||||
|
||||
- [x] Shared with: {{Service team, Leadership, Compliance}}
|
||||
- [x] Date shared: {{DATE}}
|
||||
- [x] Follow-up review scheduled: {{DATE}}
|
||||
|
||||
---
|
||||
|
||||
## Appendix: Testing Evidence
|
||||
|
||||
### Code Review Findings
|
||||
|
||||
```
|
||||
{{Code snippets demonstrating vulnerabilities}}
|
||||
```
|
||||
|
||||
### Configuration Issues
|
||||
|
||||
```
|
||||
{{Configuration examples showing gaps}}
|
||||
```
|
||||
|
||||
### Dependencies Scan
|
||||
|
||||
```
|
||||
{{Vulnerable dependencies identified}}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Report Prepared By**: {{AUDITOR_NAME}}
|
||||
**Report Date**: {{DATE}}
|
||||
**Review Status**: Draft | Final | Approved
|
||||
|
|
@ -0,0 +1,91 @@
|
|||
---
|
||||
workflow_id: SA001
|
||||
workflow_name: Security Audit
|
||||
description: Comprehensive security review using security patterns, config management, and compliance framework
|
||||
entry_point: steps/step-01-scope.md
|
||||
phase: 4-cross
|
||||
lead_agent: "Hà (Security)"
|
||||
status: "active"
|
||||
created_date: 2026-03-17
|
||||
version: "1.0.0"
|
||||
estimated_duration: "2-3 hours"
|
||||
outputFile: '{output_folder}/psm-artifacts/security-audit-{{project_name}}-{{date}}.md'
|
||||
---
|
||||
|
||||
# Workflow: Security Audit
|
||||
|
||||
## Goal
|
||||
Perform comprehensive security evaluation using Production Systems BMAD framework, covering threat modeling, vulnerability assessment, compliance, and security controls.
|
||||
|
||||
## Overview
|
||||
|
||||
Security audit is a critical cross-functional workflow that evaluates service security posture before production deployment or for ongoing compliance verification. The audit:
|
||||
|
||||
1. **Scopes** the audit engagement, defines threat model, and identifies compliance requirements
|
||||
2. **Executes** detailed security assessment across multiple domains (authentication, data protection, infrastructure, API security)
|
||||
3. **Reports** findings with severity levels, remediation recommendations, and compliance status
|
||||
|
||||
## Execution Path
|
||||
|
||||
```
|
||||
START
|
||||
↓
|
||||
[Step 01] Scope & Threat Model (Define audit scope, identify threats, compliance reqs)
|
||||
↓
|
||||
[Step 02] Security Assessment (Execute checklist across domains, identify vulns)
|
||||
↓
|
||||
[Step 03] Security Report (Findings report, severity, recommendations, compliance)
|
||||
↓
|
||||
END
|
||||
```
|
||||
|
||||
## Key Roles
|
||||
|
||||
| Role | Agent | Responsibility |
|
||||
|------|-------|-----------------|
|
||||
| Lead | Hà (Security) | Lead audit, coordinate assessment, synthesize findings |
|
||||
| Subject Matter | Service Owner + Platform Eng | Provide architecture, answer security questions |
|
||||
| Compliance | Security/Compliance Team | Validate compliance mapping, sign-off |
|
||||
|
||||
## Assessment Domains (5)
|
||||
|
||||
1. **Authentication & Authorization** — Identity verification, access control, session management
|
||||
2. **API Security** — Input validation, rate limiting, API key management, CORS
|
||||
3. **Secrets Management** — Credential storage, rotation, access logging
|
||||
4. **Encryption** — In-transit (TLS), at-rest, key management
|
||||
5. **PII & Data Protection** — Classification, access controls, audit logging, retention
|
||||
|
||||
## Input Requirements
|
||||
|
||||
- **Service architecture diagram** — Components, data flows, external integrations
|
||||
- **Authentication/authorization approach** — OAuth2, JWT, SAML, custom
|
||||
- **Secrets storage mechanism** — Vault, cloud provider, environment variables
|
||||
- **Compliance requirements** — GDPR, CCPA, SOC2, industry-specific
|
||||
- **Known security controls** — WAF, TLS config, authentication libraries
|
||||
|
||||
## Output Deliverable
|
||||
|
||||
- **Security Audit Report** (template: `security-audit-report.template.md`)
|
||||
- Audit scope and threat model
|
||||
- Findings organized by domain with severity (Critical/High/Medium/Low)
|
||||
- Remediation recommendations with priority and effort
|
||||
- Compliance status matrix
|
||||
- Sign-off
|
||||
|
||||
## Success Criteria
|
||||
|
||||
1. All security domains assessed with clear findings
|
||||
2. Severity levels assigned (using CVSS or similar framework)
|
||||
3. Remediation plan with owners and deadlines
|
||||
4. Compliance requirements verified (if applicable)
|
||||
5. Team alignment on security posture
|
||||
|
||||
## Next Steps After Workflow
|
||||
|
||||
- If **COMPLIANT**: Document in security registry; schedule periodic re-audit
|
||||
- If **NON-COMPLIANT**: Add remediation items to backlog; track closure
|
||||
- If **CRITICAL ISSUES**: Consider production pause until resolved
|
||||
|
||||
---
|
||||
|
||||
**Navigation**: [← Back to 4-cross](../), [Next: Step 01 →](steps/step-01-scope.md)
|
||||
|
|
@ -0,0 +1,6 @@
|
|||
---
|
||||
name: bmad-psm-setup-new-service
|
||||
description: 'Set up new production service from architecture through deployment. Use when the user says "new service" or "setup service" or "new microservice"'
|
||||
---
|
||||
|
||||
Follow the instructions in [workflow.md](workflow.md).
|
||||
|
|
@ -0,0 +1 @@
|
|||
type: skill
|
||||
|
|
@ -0,0 +1,116 @@
|
|||
---
|
||||
workflow_id: W-SETUP-SVC-001
|
||||
workflow_name: Setup Production Service for BMAD
|
||||
version: 6.2.0
|
||||
lead_agent: "Architect Khang"
|
||||
supporting_agents: ["SRE Minh", "Mary Analyst"]
|
||||
phase: "1-Analysis → 2-Planning → 3-Solutioning → 4-Implementation"
|
||||
created_date: 2026-03-17
|
||||
last_modified: 2026-03-17
|
||||
config_file: "_config/config.yaml"
|
||||
estimated_duration: "12-20 hours"
|
||||
outputFile: '{output_folder}/psm-artifacts/service-setup-{{project_name}}-{{date}}.md'
|
||||
---
|
||||
|
||||
# Setup Production Service Workflow — BMAD Pattern
|
||||
|
||||
## Metadata & Context
|
||||
|
||||
**Goal**: Xây dựng production-grade service từ scratch, với đầy đủ architecture, API design, deployment pipeline, reliability patterns, security, và production readiness.
|
||||
|
||||
**Lead Team**:
|
||||
- SRE Minh (Reliability, Infrastructure, Operations)
|
||||
- Architect Khang (System Design, Technology Selection)
|
||||
- Mary Analyst (Requirements, Risk Assessment)
|
||||
|
||||
**Success Criteria**:
|
||||
- ✓ Architecture design document approved
|
||||
- ✓ API contracts defined & validated
|
||||
- ✓ Database schema designed & indexed
|
||||
- ✓ CI/CD pipeline operational
|
||||
- ✓ Resilience & observability in place
|
||||
- ✓ Security & compliance verified
|
||||
- ✓ Production readiness checklist passed
|
||||
|
||||
## Workflow Overview
|
||||
|
||||
Workflow này di qua 6 bước atomic, mỗi bước focus vào một domain riêng:
|
||||
|
||||
1. **Step-01-Architecture** → Requirements + Architecture Pattern Selection
|
||||
2. **Step-02-API-Database** → API Design + Database Selection + Schema
|
||||
3. **Step-03-Build-Deploy** → CI/CD + Containerization + Testing Strategy
|
||||
4. **Step-04-Reliability** → Resilience Patterns + Observability + Error Handling
|
||||
5. **Step-05-Security-Infra** → Auth/Authz + Secrets + K8s Config
|
||||
6. **Step-06-Readiness** → PRR Checklist + Runbook + Go/No-Go Decision
|
||||
|
||||
## Configuration Loading
|
||||
|
||||
Tự động load từ `_config/config.yaml`:
|
||||
|
||||
```yaml
|
||||
project_context:
|
||||
user_name: "[loaded from config]"
|
||||
organization: "[loaded from config]"
|
||||
environment: "production"
|
||||
|
||||
workflow_defaults:
|
||||
communication_language: "Vietnamese"
|
||||
output_folder: "./outputs/setup-new-service-{service_name}"
|
||||
timestamp: "2026-03-17"
|
||||
```
|
||||
|
||||
## Execution Model
|
||||
|
||||
### Entry Point Logic
|
||||
|
||||
```
|
||||
1. Check if workflow.md exists in outputs folder
|
||||
→ If NEW: Start from step-01-architecture.md
|
||||
→ If RESUME: Load progress.yaml → auto-skip completed steps
|
||||
→ If PARTIAL: Load step-N-context.yaml → resume from step N
|
||||
|
||||
2. For each step:
|
||||
a) Load step-{N}-{name}.md
|
||||
b) Load referenced SKILL files (auto-parse "Load:" directives)
|
||||
c) Execute MENU [A][C] options
|
||||
d) Save step output to step-{N}-output.md
|
||||
e) Move to next step
|
||||
|
||||
3. Final: Generate comprehensive outputs in outputs folder
|
||||
```
|
||||
|
||||
### State Tracking
|
||||
|
||||
Output document frontmatter tracks progress:
|
||||
|
||||
```yaml
|
||||
workflow_progress:
|
||||
step_01_architecture: "completed"
|
||||
step_02_api_database: "completed"
|
||||
step_03_build_deploy: "in_progress"
|
||||
step_04_reliability: "pending"
|
||||
step_05_security_infra: "pending"
|
||||
step_06_readiness: "pending"
|
||||
last_updated: "2026-03-17T14:30:00Z"
|
||||
current_agent: "Architect Khang"
|
||||
```
|
||||
|
||||
## Mandatory Workflow Rules
|
||||
|
||||
1. **No skipping steps** — Mỗi step phải được execute theo order
|
||||
2. **Validate assumptions** — Mỗi decision phải được document
|
||||
3. **Cross-phase collaboration** — Architects + SRE + Analysts work together
|
||||
4. **Output artifacts** — Mỗi step produce tangible output documents
|
||||
5. **Handoff protocol** — Context được transfer giữa steps rõ ràng
|
||||
|
||||
## Navigation
|
||||
|
||||
Hãy chọn cách bắt đầu:
|
||||
|
||||
- **[NEW]** — Bắt đầu workflow mới → Load step-01
|
||||
- **[RESUME]** — Quay lại workflow đã từng chạy (detect progress)
|
||||
- **[SKIP-TO]** — Nhảy tới step cụ thể (dev-only, requires confirmation)
|
||||
|
||||
---
|
||||
|
||||
**Tiếp tục bằng cách chọn [NEW] hoặc [RESUME]**
|
||||
|
|
@ -42,6 +42,16 @@ modules:
|
|||
type: bmad-org
|
||||
npmPackage: bmad-method-test-architecture-enterprise
|
||||
|
||||
bmad-production-systems:
|
||||
url: https://github.com/DoanNgocCuong/bmad-module-production-systems
|
||||
module-definition: src/module.yaml
|
||||
code: psm
|
||||
name: "Production Systems & MLOps (BMad Community Module)"
|
||||
description: "Production engineering with SRE, Security, and MLOps agents for incident response, PRR, and deployment"
|
||||
defaultSelected: false
|
||||
type: community
|
||||
npmPackage: bmad-production-systems
|
||||
|
||||
whiteport-design-studio:
|
||||
url: https://github.com/bmad-code-org/bmad-method-wds-expansion
|
||||
module-definition: src/module.yaml
|
||||
|
|
|
|||
Loading…
Reference in New Issue