refactor(psm): Remove embedded PSM module - use external repo instead
PSM is now a standalone module at: https://github.com/DoanNgocCuong/bmad-module-production-systems It's registered in external-official-modules.yaml for installer integration. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
parent
a2df51ccce
commit
0579a4f55e
|
|
@ -1,21 +0,0 @@
|
||||||
# MLOps & Performance Engineer Agent Definition
|
|
||||||
|
|
||||||
agent:
|
|
||||||
metadata:
|
|
||||||
id: "_bmad/psm/agents/mlops.md"
|
|
||||||
name: Linh
|
|
||||||
title: MLOps & Performance Engineer
|
|
||||||
icon: 🤖
|
|
||||||
module: psm
|
|
||||||
hasSidecar: false
|
|
||||||
|
|
||||||
persona:
|
|
||||||
role: MLOps Specialist + Performance Engineer
|
|
||||||
identity: MLOps specialist bridging ML research and production. Expert in model serving, pipeline optimization, and chaos engineering.
|
|
||||||
communication_style: Data-driven, experimental. Thinks in pipelines and metrics. Ship fast, measure everything.
|
|
||||||
principles: Reproducibility first; monitor model drift; chaos engineering validates assumptions; cost-aware optimization.
|
|
||||||
|
|
||||||
menu:
|
|
||||||
- trigger: MD or fuzzy match on mlops-deploy
|
|
||||||
workflow: "skill:bmad-psm-mlops-deployment"
|
|
||||||
description: "[MD] MLOps Deployment — Model validation, deploy, monitor"
|
|
||||||
|
|
@ -1,21 +0,0 @@
|
||||||
# Security & Infrastructure Engineer Agent Definition
|
|
||||||
|
|
||||||
agent:
|
|
||||||
metadata:
|
|
||||||
id: "_bmad/psm/agents/security.md"
|
|
||||||
name: Hà
|
|
||||||
title: Security & Infrastructure Engineer
|
|
||||||
icon: 🛡️
|
|
||||||
module: psm
|
|
||||||
hasSidecar: false
|
|
||||||
|
|
||||||
persona:
|
|
||||||
role: Security Specialist + Infrastructure Expert
|
|
||||||
identity: Security specialist with expertise in defense-in-depth, compliance frameworks, and infrastructure hardening. Thorough and detail-oriented.
|
|
||||||
communication_style: Thorough, detail-oriented. Asks 'what if' scenarios. Thinks about edge cases and threat models.
|
|
||||||
principles: Zero trust architecture; defense in depth; security by default; least privilege.
|
|
||||||
|
|
||||||
menu:
|
|
||||||
- trigger: SA or fuzzy match on security-audit
|
|
||||||
workflow: "skill:bmad-psm-security-audit"
|
|
||||||
description: "[SA] Security Audit — Scope, audit, report"
|
|
||||||
|
|
@ -1,21 +0,0 @@
|
||||||
# Production Standards for PSM
|
|
||||||
|
|
||||||
SRE operational standards, incident response protocols, and production quality benchmarks.
|
|
||||||
|
|
||||||
## User Specified CRITICAL Rules - Supersedes General Rules
|
|
||||||
|
|
||||||
None
|
|
||||||
|
|
||||||
## General CRITICAL RULES
|
|
||||||
|
|
||||||
### Rule 1: SLO-First Approach
|
|
||||||
ALL production decisions MUST reference defined SLOs. No optimization without measurement baseline.
|
|
||||||
|
|
||||||
### Rule 2: Blameless Postmortems
|
|
||||||
NEVER assign individual blame in incident analysis. Focus on systemic improvements.
|
|
||||||
|
|
||||||
### Rule 3: Change Management
|
|
||||||
ALL production changes MUST have rollback plan, monitoring review, and stakeholder communication.
|
|
||||||
|
|
||||||
### Rule 4: Severity Classification
|
|
||||||
SEV1: Complete outage >50% users. SEV2: Major degradation >20%. SEV3: Minor <20%. SEV4: Cosmetic.
|
|
||||||
|
|
@ -1,30 +0,0 @@
|
||||||
# Site Reliability Engineer Agent Definition
|
|
||||||
|
|
||||||
agent:
|
|
||||||
metadata:
|
|
||||||
id: "_bmad/psm/agents/sre.md"
|
|
||||||
name: Minh
|
|
||||||
title: Site Reliability Engineer
|
|
||||||
icon: 🔧
|
|
||||||
module: psm
|
|
||||||
hasSidecar: true
|
|
||||||
|
|
||||||
persona:
|
|
||||||
role: Senior SRE + Production Operations Expert
|
|
||||||
identity: Senior SRE with deep expertise in reliability, observability, and operational excellence. Obsessed with SLOs, automation, and incident response.
|
|
||||||
communication_style: Metric-driven, systematic. Translates business goals to technical SLOs. Always asks 'what is the SLO?' first.
|
|
||||||
principles: SLO-first approach; automate everything; measure before optimizing; blameless postmortems.
|
|
||||||
|
|
||||||
menu:
|
|
||||||
- trigger: IR or fuzzy match on incident
|
|
||||||
workflow: "skill:bmad-psm-incident-response"
|
|
||||||
description: "[IR] Incident Response — Triage, diagnose, fix, postmortem"
|
|
||||||
- trigger: PR or fuzzy match on readiness
|
|
||||||
workflow: "skill:bmad-psm-production-readiness"
|
|
||||||
description: "[PR] Production Readiness Review — 9-dimension assessment"
|
|
||||||
- trigger: NS or fuzzy match on new-service
|
|
||||||
workflow: "skill:bmad-psm-setup-new-service"
|
|
||||||
description: "[NS] Setup New Service — Architecture to deployment"
|
|
||||||
- trigger: QD or fuzzy match on diagnose
|
|
||||||
workflow: "skill:bmad-psm-quick-diagnose"
|
|
||||||
description: "[QD] Quick Diagnose — Fast production troubleshooting"
|
|
||||||
|
|
@ -1,13 +0,0 @@
|
||||||
code: psm
|
|
||||||
name: "PSM: Production Systems & MLOps"
|
|
||||||
header: "BMad Production Systems Module"
|
|
||||||
subheader: "Production engineering workflows for incident response, production readiness, security, and MLOps."
|
|
||||||
description: "AI-driven production engineering framework with SRE, Security, and MLOps agents."
|
|
||||||
default_selected: false
|
|
||||||
|
|
||||||
knowledge_base_path:
|
|
||||||
prompt:
|
|
||||||
- "Where is your production knowledge base? (folder with SKILL.md files)"
|
|
||||||
- "Leave default if you don't have one yet."
|
|
||||||
default: "docs/production-knowledge"
|
|
||||||
result: "{project-root}/{value}"
|
|
||||||
|
|
@ -1,7 +0,0 @@
|
||||||
module,phase,name,code,sequence,workflow-file,command,required,agent,options,description,output-location,outputs,
|
|
||||||
psm,operations,Incident Response,IR,,skill:bmad-psm-incident-response,bmad-psm-incident-response,false,sre,Operations Mode,"Handle production incidents with systematic triage, diagnosis, and recovery. Use when the user says 'production is down' or 'incident response' or 'we have an outage'.",output_folder,"incident response report",
|
|
||||||
psm,operations,Production Readiness,PR,,skill:bmad-psm-production-readiness,bmad-psm-production-readiness,false,sre,Operations Mode,"Run production readiness review across 9 dimensions. Use when the user says 'are we ready for production' or 'PRR' or 'go-live check'.",output_folder,"production readiness assessment",
|
|
||||||
psm,operations,Security Audit,SA,,skill:bmad-psm-security-audit,bmad-psm-security-audit,false,security,Operations Mode,"Run comprehensive security audit and threat assessment. Use when the user says 'security audit' or 'vulnerability assessment' or 'security review'.",output_folder,"security audit report",
|
|
||||||
psm,operations,MLOps Deployment,MD,,skill:bmad-psm-mlops-deployment,bmad-psm-mlops-deployment,false,mlops,Operations Mode,"Deploy ML model to production with validation and monitoring. Use when the user says 'deploy model' or 'ML deployment' or 'model serving'.",output_folder,"mlops deployment report",
|
|
||||||
psm,operations,Setup New Service,NS,,skill:bmad-psm-setup-new-service,bmad-psm-setup-new-service,false,sre,Operations Mode,"Set up new production service from architecture through deployment. Use when the user says 'new service' or 'setup service' or 'new microservice'.",output_folder,"service setup plan",
|
|
||||||
psm,operations,Quick Diagnose,QD,,skill:bmad-psm-quick-diagnose,bmad-psm-quick-diagnose,false,sre,Operations Mode,"Quick diagnosis of production issue with minimal latency. Use when the user says 'something is broken' or 'quick diagnose' or 'what is happening?'.",output_folder,"diagnostic report",
|
|
||||||
|
|
|
@ -1,13 +0,0 @@
|
||||||
code: psm
|
|
||||||
name: "PSM: Production Systems & MLOps"
|
|
||||||
header: "BMad Production Systems Module"
|
|
||||||
subheader: "Production engineering workflows for incident response, production readiness, security, and MLOps."
|
|
||||||
description: "AI-driven production engineering framework with SRE, Security, and MLOps agents."
|
|
||||||
default_selected: false
|
|
||||||
|
|
||||||
knowledge_base_path:
|
|
||||||
prompt:
|
|
||||||
- "Where is your production knowledge base? (folder with SKILL.md files)"
|
|
||||||
- "Leave default if you don't have one yet."
|
|
||||||
default: "docs/production-knowledge"
|
|
||||||
result: "{project-root}/{value}"
|
|
||||||
|
|
@ -1,4 +0,0 @@
|
||||||
name,displayName,title,icon,role,identity,communicationStyle,principles,module,path
|
|
||||||
"sre","Minh","Site Reliability Engineer","🔧","Senior SRE + Production Operations Expert","Senior SRE with deep expertise in reliability, observability, and operational excellence. Obsessed with SLOs, automation, and incident response.","Metric-driven, systematic. Always asks 'what is the SLO?' first.","SLO-first; automate everything; measure before optimizing; blameless postmortems.","psm","bmad/psm/agents/sre.md"
|
|
||||||
"security","Hà","Security & Infrastructure Engineer","🛡️","Security Specialist + Infrastructure Expert","Security specialist with expertise in defense-in-depth, compliance frameworks, and infrastructure hardening.","Thorough, detail-oriented. Asks 'what if' scenarios. Thinks about edge cases and threat models.","Zero trust; defense in depth; security by default; least privilege.","psm","bmad/psm/agents/security.md"
|
|
||||||
"mlops","Linh","MLOps & Performance Engineer","🤖","MLOps Specialist + Performance Engineer","MLOps specialist bridging ML research and production. Expert in model serving, pipeline optimization, and chaos engineering.","Data-driven, experimental. 'Ship fast, measure everything.'","Reproducibility first; monitor drift; chaos engineering validates; cost-aware optimization.","psm","bmad/psm/agents/mlops.md"
|
|
||||||
|
|
|
@ -1,7 +0,0 @@
|
||||||
# Powered by BMAD-CORE™
|
|
||||||
bundle:
|
|
||||||
name: Production Operations Team
|
|
||||||
icon: ⚙️
|
|
||||||
description: Production engineering team for incident response, security, and MLOps
|
|
||||||
agents: "*"
|
|
||||||
party: "./default-party.csv"
|
|
||||||
|
|
@ -1,6 +0,0 @@
|
||||||
---
|
|
||||||
name: bmad-psm-incident-response
|
|
||||||
description: 'Handle production incidents with systematic triage, diagnosis, and recovery. Use when the user says "production is down" or "incident response" or "we have an outage"'
|
|
||||||
---
|
|
||||||
|
|
||||||
Follow the instructions in [workflow.md](workflow.md).
|
|
||||||
|
|
@ -1 +0,0 @@
|
||||||
type: skill
|
|
||||||
|
|
@ -1,269 +0,0 @@
|
||||||
---
|
|
||||||
template_name: incident-postmortem
|
|
||||||
template_version: "1.0.0"
|
|
||||||
created_date: 2026-03-17
|
|
||||||
description: Standard postmortem template for incident analysis and learning
|
|
||||||
---
|
|
||||||
|
|
||||||
# Incident Postmortem: {{INCIDENT_TITLE}}
|
|
||||||
|
|
||||||
**Date**: {{INCIDENT_DATE}}
|
|
||||||
**Duration**: {{START_TIME}} — {{END_TIME}} ({{DURATION_MINUTES}} minutes)
|
|
||||||
**Severity**: {{SEV1|SEV2|SEV3}} ({{IMPACT_DESCRIPTION}})
|
|
||||||
**Lead**: {{INCIDENT_COMMANDER_NAME}}
|
|
||||||
**Facilitator**: {{POSTMORTEM_FACILITATOR_NAME}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Summary
|
|
||||||
|
|
||||||
[1-2 paragraph executive summary of what happened, impact, and resolution]
|
|
||||||
|
|
||||||
**Timeline at a glance**:
|
|
||||||
- T-0:00 — Normal operation
|
|
||||||
- T-{{TIME1}} — {{EVENT1}}
|
|
||||||
- T-{{TIME2}} — {{EVENT2}}
|
|
||||||
- T-{{RESOLUTION_TIME}} — Incident resolved
|
|
||||||
|
|
||||||
**Impact**: {{METRIC1}} affected {{X}} users, {{METRIC2}}, {{METRIC3}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Detailed Timeline
|
|
||||||
|
|
||||||
| Time | Event | Notes |
|
|
||||||
|------|-------|-------|
|
|
||||||
| {{T}} | {{What happened}} | {{Who detected it}} |
|
|
||||||
| {{T+X}} | {{Next event}} | {{Action taken}} |
|
|
||||||
| {{T+Y}} | {{Root cause identified}} | {{By whom}} |
|
|
||||||
| {{T+Z}} | {{Fix applied}} | {{Verification steps}} |
|
|
||||||
| {{T+Final}} | {{Incident resolved}} | {{Verification}} |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Root Cause Analysis
|
|
||||||
|
|
||||||
### Primary Cause
|
|
||||||
|
|
||||||
**{{ROOT_CAUSE_TITLE}}**
|
|
||||||
|
|
||||||
{{Detailed explanation of the root cause}}
|
|
||||||
|
|
||||||
**How it happened**:
|
|
||||||
1. {{Precondition 1}} (why the system was vulnerable)
|
|
||||||
2. {{Trigger event}} (what caused the failure)
|
|
||||||
3. {{Failure cascade}} (why it got worse)
|
|
||||||
4. {{Detection lag}} (why it took X minutes to detect)
|
|
||||||
|
|
||||||
**Evidence**:
|
|
||||||
- {{Log entry or metric showing the issue}}
|
|
||||||
- {{Related system behavior}}
|
|
||||||
- {{Impact indicator}}
|
|
||||||
|
|
||||||
### Contributing Factors
|
|
||||||
|
|
||||||
- {{Factor 1}} — {{Brief explanation}}
|
|
||||||
- {{Factor 2}} — {{Brief explanation}}
|
|
||||||
- {{Factor 3}} — {{Brief explanation}}
|
|
||||||
|
|
||||||
### Why Didn't We Catch This?
|
|
||||||
|
|
||||||
- {{Missing monitoring}} — {{What metric would have alerted}}
|
|
||||||
- {{Testing gap}} — {{What test would have failed}}
|
|
||||||
- {{Documentation gap}} — {{What runbook would have helped}}
|
|
||||||
- {{Knowledge gap}} — {{What training would have helped}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Impact Assessment
|
|
||||||
|
|
||||||
### User Impact
|
|
||||||
|
|
||||||
- **Duration**: {{START_TIME}} — {{END_TIME}} ({{DURATION}} minutes)
|
|
||||||
- **Scale**: {{X}}% of {{METRIC}} (e.g., 5% of payment requests)
|
|
||||||
- **Users Affected**: {{APPROX_COUNT}} users
|
|
||||||
- **Revenue Impact**: {{$X}} (if applicable)
|
|
||||||
- **Customer Escalations**: {{NUMBER}} tickets opened
|
|
||||||
|
|
||||||
**User-facing symptoms**:
|
|
||||||
- {{Symptom 1}} (e.g., "Checkout returns 500 error")
|
|
||||||
- {{Symptom 2}} (e.g., "Page loads slowly")
|
|
||||||
- {{Symptom 3}}
|
|
||||||
|
|
||||||
### Operational Impact
|
|
||||||
|
|
||||||
- **System Recovery**: {{SERVICE/METRIC}} took {{TIME}} to recover
|
|
||||||
- **Cascading Effects**: {{SERVICE_X}} also affected due to {{reason}}
|
|
||||||
- **On-call Load**: {{NUMBER}} pages, {{NUMBER}} escalations
|
|
||||||
- **Data Loss**: {{None | {{Description}}}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Resolution & Recovery
|
|
||||||
|
|
||||||
### Immediate Actions Taken
|
|
||||||
|
|
||||||
1. **{{Time T+X}}** — {{Action 1}}
|
|
||||||
- Rationale: {{Why this helped}}
|
|
||||||
- Result: {{What changed}}
|
|
||||||
|
|
||||||
2. **{{Time T+Y}}** — {{Action 2}}
|
|
||||||
- Rationale: {{Why this helped}}
|
|
||||||
- Result: {{What changed}}
|
|
||||||
|
|
||||||
3. **{{Time T+Z}}** — {{Root Fix Applied}}
|
|
||||||
- Details: {{Technical description}}
|
|
||||||
- Verification: {{How we confirmed it worked}}
|
|
||||||
|
|
||||||
### Rollback/Rollforward Decision
|
|
||||||
|
|
||||||
**Decision**: {{Rollback to version X | Rollforward with fix | Hybrid approach}}
|
|
||||||
|
|
||||||
**Rationale**: {{Explain why this was the right choice}}
|
|
||||||
|
|
||||||
**Verification**: {{How we confirmed the fix worked}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Lessons Learned
|
|
||||||
|
|
||||||
### What Went Well
|
|
||||||
|
|
||||||
- {{Thing we did right}} — This prevented {{worse outcome}}
|
|
||||||
- {{Thing we did right}} — Team coordination was excellent
|
|
||||||
- {{Thing we did right}} — Monitoring caught {{something}}
|
|
||||||
|
|
||||||
### What We Can Improve
|
|
||||||
|
|
||||||
| Issue | Category | Severity | Recommendation | Owner |
|
|
||||||
|-------|----------|----------|-----------------|-------|
|
|
||||||
| {{We didn't detect it for X minutes}} | Observability | HIGH | Add alert for {{metric}} when > {{threshold}} | DevOps |
|
|
||||||
| {{Runbook was outdated}} | Runbooks | MEDIUM | Update {{runbook}} with new architecture | SRE |
|
|
||||||
| {{New service not in alerting system}} | Process | MEDIUM | Add new services to alert config automatically | Platform |
|
|
||||||
| {{Team didn't know about new feature}} | Knowledge | LOW | Document new features in wiki | Tech Lead |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Action Items
|
|
||||||
|
|
||||||
### Critical (Must Complete Before Similar Incident)
|
|
||||||
|
|
||||||
- [ ] **{{Action 1}}** — {{Description}}
|
|
||||||
- Owner: {{NAME}}
|
|
||||||
- Deadline: {{DATE}} (within 1 week)
|
|
||||||
- Acceptance: {{How we verify it's done}}
|
|
||||||
|
|
||||||
- [ ] **{{Action 2}}** — {{Description}}
|
|
||||||
- Owner: {{NAME}}
|
|
||||||
- Deadline: {{DATE}} (within 1 week)
|
|
||||||
- Acceptance: {{How we verify it's done}}
|
|
||||||
|
|
||||||
### High Priority (Target Next 2 Weeks)
|
|
||||||
|
|
||||||
- [ ] {{Action}} — Owner: {{NAME}}, Deadline: {{DATE}}
|
|
||||||
- [ ] {{Action}} — Owner: {{NAME}}, Deadline: {{DATE}}
|
|
||||||
- [ ] {{Action}} — Owner: {{NAME}}, Deadline: {{DATE}}
|
|
||||||
|
|
||||||
### Medium Priority (Target This Sprint)
|
|
||||||
|
|
||||||
- [ ] {{Action}} — Owner: {{NAME}}
|
|
||||||
- [ ] {{Action}} — Owner: {{NAME}}
|
|
||||||
|
|
||||||
### Backlog (Good to Have)
|
|
||||||
|
|
||||||
- [ ] {{Action}} — {{Description}}
|
|
||||||
- [ ] {{Action}} — {{Description}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Prevention Measures
|
|
||||||
|
|
||||||
### Short-term (1-2 Weeks)
|
|
||||||
|
|
||||||
1. **{{Mitigation 1}}** — Prevents {{this exact incident}} from happening again
|
|
||||||
- How: {{Technical approach}}
|
|
||||||
- Effort: {{Estimate}}
|
|
||||||
- Timeline: {{When}}
|
|
||||||
|
|
||||||
2. **{{Mitigation 2}}** — Catches similar issues earlier
|
|
||||||
- How: {{Technical approach}}
|
|
||||||
- Effort: {{Estimate}}
|
|
||||||
- Timeline: {{When}}
|
|
||||||
|
|
||||||
### Long-term (Next Quarter)
|
|
||||||
|
|
||||||
1. **{{Large architectural change}}** — Eliminates root cause class
|
|
||||||
- Rationale: {{Why this is better}}
|
|
||||||
- Effort: {{Estimate}}
|
|
||||||
- Timeline: {{When}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Incident Stats
|
|
||||||
|
|
||||||
```
|
|
||||||
MTTD (Mean Time To Detect): {{MINUTES}} minutes
|
|
||||||
- Automatic detection: {{If applicable, how}}
|
|
||||||
- Manual detection: {{Who found it}}
|
|
||||||
|
|
||||||
MTTR (Mean Time To Resolve): {{MINUTES}} minutes
|
|
||||||
- Investigation time: {{MINUTES}}
|
|
||||||
- Fix implementation time: {{MINUTES}}
|
|
||||||
- Verification time: {{MINUTES}}
|
|
||||||
|
|
||||||
Severity: {{SEV1|SEV2|SEV3}} ({{Criteria}})
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Distribution & Follow-up
|
|
||||||
|
|
||||||
- [x] Postmortem shared with: {{TEAM_LIST}}
|
|
||||||
- [x] Customer communication sent: {{YES|NO|TEMPLATE_USED}}
|
|
||||||
- [x] Action items tracked in: {{JIRA/BACKLOG}}
|
|
||||||
- [x] Follow-up review scheduled: {{DATE}}
|
|
||||||
|
|
||||||
**Follow-up Review**: {{DATE}} with {{ATTENDEES}}
|
|
||||||
- Confirm all critical action items completed
|
|
||||||
- Verify prevention measures working
|
|
||||||
- Check for recurring patterns
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Appendix: Supporting Evidence
|
|
||||||
|
|
||||||
### Logs
|
|
||||||
|
|
||||||
```
|
|
||||||
[Relevant log entries showing the incident]
|
|
||||||
|
|
||||||
{{TIMESTAMP}} ERROR: {{MESSAGE}}
|
|
||||||
{{TIMESTAMP}} ERROR: {{MESSAGE}}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Metrics
|
|
||||||
|
|
||||||
[Include screenshots or links to metric dashboards showing the incident]
|
|
||||||
|
|
||||||
- Error rate spike: [Chart or metric]
|
|
||||||
- Latency spike: [Chart or metric]
|
|
||||||
- Traffic pattern: [Chart or metric]
|
|
||||||
|
|
||||||
### Configuration Changes
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
# Changes made before incident
|
|
||||||
- {{Change 1}} ({{TIMESTAMP}})
|
|
||||||
- {{Change 2}} ({{TIMESTAMP}})
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Document Completed By**: {{NAME}}
|
|
||||||
**Date**: {{DATE}}
|
|
||||||
**Review Status**: Draft | Final | Approved
|
|
||||||
|
|
||||||
**Approvals**:
|
|
||||||
- [ ] Incident Commander: {{NAME}} {{DATE}}
|
|
||||||
- [ ] Service Owner: {{NAME}} {{DATE}}
|
|
||||||
- [ ] VP Engineering (if SEV1): {{NAME}} {{DATE}}
|
|
||||||
|
|
@ -1,163 +0,0 @@
|
||||||
---
|
|
||||||
workflow_id: W-INCIDENT-001
|
|
||||||
workflow_name: Production Incident Response
|
|
||||||
version: 6.2.0
|
|
||||||
lead_agent: "SRE Minh"
|
|
||||||
supporting_agents: ["Architect Khang", "Mary Analyst"]
|
|
||||||
phase: "3-Run: Emergency Response & Recovery"
|
|
||||||
created_date: 2026-03-17
|
|
||||||
last_modified: 2026-03-17
|
|
||||||
config_file: "_config/config.yaml"
|
|
||||||
estimated_duration: "15 minutes to 2 hours (depending on severity)"
|
|
||||||
outputFile: '{output_folder}/psm-artifacts/incident-{{project_name}}-{{date}}.md'
|
|
||||||
---
|
|
||||||
|
|
||||||
# Production Incident Response Workflow — BMAD Pattern
|
|
||||||
|
|
||||||
## Metadata & Context
|
|
||||||
|
|
||||||
**Goal**: Triage, diagnose, resolve production incidents through systematic diagnosis and apply fixes with verification. This is the most critical workflow - minimize MTTR (Mean Time To Recovery) while maintaining system stability.
|
|
||||||
|
|
||||||
**Lead Team**:
|
|
||||||
- SRE Minh (Incident Command, Recovery Orchestration)
|
|
||||||
- Architect Khang (Root Cause Analysis, System-wide Impact)
|
|
||||||
- Mary Analyst (Impact Assessment, Post-Incident Review)
|
|
||||||
|
|
||||||
**Success Criteria**:
|
|
||||||
- ✓ Incident severity classified within 5 minutes
|
|
||||||
- ✓ Root cause identified within first triage pass
|
|
||||||
- ✓ Fix applied and verified
|
|
||||||
- ✓ System metrics returned to baseline
|
|
||||||
- ✓ Incident postmortem documented with action items
|
|
||||||
- ✓ Prevention measures identified
|
|
||||||
|
|
||||||
## Workflow Overview
|
|
||||||
|
|
||||||
Workflow này di qua 4 bước atomic, mỗi bước focus vào một phase khác nhau:
|
|
||||||
|
|
||||||
1. **Step-01-Triage** → Gather initial info, assess severity, classify impact
|
|
||||||
2. **Step-02-Diagnose** → Systematic diagnosis using observability data (logs, metrics, traces)
|
|
||||||
3. **Step-03-Fix** → Apply fix, verify resolution, validate recovery
|
|
||||||
4. **Step-04-Postmortem** → Document incident, identify action items, prevent recurrence
|
|
||||||
|
|
||||||
## Configuration Loading
|
|
||||||
|
|
||||||
Tự động load từ `_config/config.yaml`:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
project_context:
|
|
||||||
organization: "[loaded from config]"
|
|
||||||
environment: "production"
|
|
||||||
incident_channel: "slack:#incidents"
|
|
||||||
|
|
||||||
workflow_defaults:
|
|
||||||
communication_language: "Vietnamese-English"
|
|
||||||
severity_levels: ["SEV1", "SEV2", "SEV3", "SEV4"]
|
|
||||||
escalation_contacts: "[loaded from config]"
|
|
||||||
on_call_engineer: "[loaded from config]"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Workflow Architecture - Micro-File Design
|
|
||||||
|
|
||||||
BMAD pattern: Mỗi step là một file riêng, load just-in-time. Workflow chain:
|
|
||||||
|
|
||||||
```
|
|
||||||
workflow.md (entry point)
|
|
||||||
↓
|
|
||||||
step-01-triage.md (classify severity, initial assessment)
|
|
||||||
↓
|
|
||||||
step-02-diagnose.md (root cause analysis)
|
|
||||||
↓
|
|
||||||
step-03-fix.md (apply fix, verify)
|
|
||||||
↓
|
|
||||||
step-04-postmortem.md (document, prevent)
|
|
||||||
↓
|
|
||||||
incident-response-summary.md (final output)
|
|
||||||
```
|
|
||||||
|
|
||||||
**Key Benefits**:
|
|
||||||
- Single-step focus — engineer concentrates on one phase
|
|
||||||
- Knowledge isolation — load only relevant SKILL docs per step
|
|
||||||
- State tracking — save progress after each step
|
|
||||||
- Easy resumption — if interrupted, restart from exact step
|
|
||||||
|
|
||||||
## Skill References
|
|
||||||
|
|
||||||
Workflow này load knowledge từ:
|
|
||||||
|
|
||||||
- **5.07 Reliability & Resilience** → Circuit breaker patterns, fallback strategies, timeout management
|
|
||||||
- **5.08 Observability & Monitoring** → Structured logging, metrics queries, distributed tracing
|
|
||||||
- **5.09 Error Handling & Recovery** → Error classification, graceful degradation patterns
|
|
||||||
- **5.10 Production Readiness** → Incident prevention checklist, alerting setup
|
|
||||||
- **5.14 Documentation & Runbooks** → Postmortem templates, incident reports
|
|
||||||
|
|
||||||
## Execution Model
|
|
||||||
|
|
||||||
### Entry Point Logic
|
|
||||||
|
|
||||||
```
|
|
||||||
1. Check if incident session exists
|
|
||||||
→ If NEW incident: Start from step-01-triage.md
|
|
||||||
→ If ONGOING: Load incident-session.yaml → continue from last completed step
|
|
||||||
→ If RESOLVED: Load postmortem template
|
|
||||||
|
|
||||||
2. For each step:
|
|
||||||
a) Load step-{N}-{name}.md
|
|
||||||
b) Load referenced SKILL files (auto-parse "Load:" directives)
|
|
||||||
c) Execute MENU [A][C] options
|
|
||||||
d) Save step output to step-{N}-output.md + incident-context.yaml
|
|
||||||
e) Move to next step or conclude
|
|
||||||
|
|
||||||
3. Final: Generate incident report + postmortem in outputs folder
|
|
||||||
```
|
|
||||||
|
|
||||||
### State Tracking
|
|
||||||
|
|
||||||
Incident session frontmatter tracks progress:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
incident_context:
|
|
||||||
incident_id: "INC-2026-03-17-001"
|
|
||||||
severity: "SEV1" | "SEV2" | "SEV3" | "SEV4"
|
|
||||||
status: "triage" → "diagnosing" → "recovering" → "resolved" → "postmortem"
|
|
||||||
affected_services: ["service-1", "service-2"]
|
|
||||||
started_at: "2026-03-17T14:30:00Z"
|
|
||||||
timeline:
|
|
||||||
detected_at: "2026-03-17T14:30:00Z"
|
|
||||||
triage_completed_at: "2026-03-17T14:35:00Z"
|
|
||||||
root_cause_identified_at: "2026-03-17T14:50:00Z"
|
|
||||||
fix_applied_at: "2026-03-17T15:10:00Z"
|
|
||||||
resolved_at: "2026-03-17T15:15:00Z"
|
|
||||||
current_step: "step-02-diagnose"
|
|
||||||
last_updated: "2026-03-17T14:50:00Z"
|
|
||||||
incident_commander: "SRE Minh"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Mandatory Workflow Rules
|
|
||||||
|
|
||||||
1. **Speed first** — Triage must complete in < 5 minutes
|
|
||||||
2. **Root cause identification** — Must identify root cause before fix attempt
|
|
||||||
3. **Verify before declaring resolved** — Check metrics + user reports
|
|
||||||
4. **Document everything** — Every action logged for postmortem
|
|
||||||
5. **Escalation protocol** — SEV1 → Page on-call architect immediately
|
|
||||||
6. **Communication** — Update stakeholders every 5-10 minutes
|
|
||||||
7. **No flying blind** — All fixes must reference observability data
|
|
||||||
|
|
||||||
## Severity Scale
|
|
||||||
|
|
||||||
- **SEV1** — Service completely down, revenue impact, > 1% users affected → Page all on-call
|
|
||||||
- **SEV2** — Major degradation, significant users affected, partial functionality down
|
|
||||||
- **SEV3** — Moderate impact, some users affected, workaround possible
|
|
||||||
- **SEV4** — Minor issue, limited users, can defer to business hours
|
|
||||||
|
|
||||||
## Navigation
|
|
||||||
|
|
||||||
Hãy chọn cách bắt đầu:
|
|
||||||
|
|
||||||
- **[NEW-INC]** — Report new incident → Load step-01-triage
|
|
||||||
- **[RESUME-INC]** — Continue existing incident (detect progress from incident-session.yaml)
|
|
||||||
- **[ESCALATE]** — Escalate to on-call architect
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Hãy báo cáo tình trạng incident hoặc chọn [NEW-INC] để bắt đầu triage**
|
|
||||||
|
|
@ -1,6 +0,0 @@
|
||||||
---
|
|
||||||
name: bmad-psm-mlops-deployment
|
|
||||||
description: 'Deploy ML model to production with validation and monitoring. Use when the user says "deploy model" or "ML deployment" or "model serving"'
|
|
||||||
---
|
|
||||||
|
|
||||||
Follow the instructions in [workflow.md](workflow.md).
|
|
||||||
|
|
@ -1 +0,0 @@
|
||||||
type: skill
|
|
||||||
|
|
@ -1,89 +0,0 @@
|
||||||
---
|
|
||||||
workflow_id: MLOPS001
|
|
||||||
workflow_name: MLOps Deployment
|
|
||||||
description: Deploy ML model to production with validation, serving, and monitoring
|
|
||||||
entry_point: steps/step-01-model-validation.md
|
|
||||||
phase: 5-specialized
|
|
||||||
lead_agent: "Linh (MLOps)"
|
|
||||||
status: "active"
|
|
||||||
created_date: 2026-03-17
|
|
||||||
version: "1.0.0"
|
|
||||||
estimated_duration: "3-4 hours"
|
|
||||||
outputFile: '{output_folder}/psm-artifacts/mlops-deploy-{{project_name}}-{{date}}.md'
|
|
||||||
---
|
|
||||||
|
|
||||||
# Workflow: MLOps Deployment
|
|
||||||
|
|
||||||
## Goal
|
|
||||||
Deploy machine learning models to production with comprehensive validation, infrastructure setup, and post-deployment monitoring.
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
MLOps deployment ensures ML models are production-ready and continuously monitored for performance and data drift. The workflow:
|
|
||||||
|
|
||||||
1. **Validates** model quality, performance metrics, and data drift detection
|
|
||||||
2. **Deploys** model to serving infrastructure with versioning and A/B testing
|
|
||||||
3. **Monitors** model performance, data drift, and cost metrics post-deployment
|
|
||||||
|
|
||||||
## Execution Path
|
|
||||||
|
|
||||||
```
|
|
||||||
START
|
|
||||||
↓
|
|
||||||
[Step 01] Model Validation (Check metrics, data drift, A/B test plan)
|
|
||||||
↓
|
|
||||||
[Step 02] Deploy Model (Setup serving, infrastructure, GPU optimization)
|
|
||||||
↓
|
|
||||||
[Step 03] Monitor (Langfuse/MLflow, drift detection, cost tracking)
|
|
||||||
↓
|
|
||||||
END
|
|
||||||
```
|
|
||||||
|
|
||||||
## Key Roles
|
|
||||||
|
|
||||||
| Role | Agent | Responsibility |
|
|
||||||
|------|-------|-----------------|
|
|
||||||
| Lead | Linh (MLOps) | Coordinate deployment, monitor model health |
|
|
||||||
| Data Scientist | Data Lead | Validate model quality, approve for production |
|
|
||||||
| DevOps | Platform Eng | Setup infrastructure, manage resources |
|
|
||||||
|
|
||||||
## Validation Gates (3)
|
|
||||||
|
|
||||||
1. **Model Quality** — Accuracy, precision, recall metrics meet SLO
|
|
||||||
2. **Data Quality** — No data drift detected; training/production data distribution aligned
|
|
||||||
3. **Business Readiness** — A/B test plan ready, rollback strategy defined
|
|
||||||
|
|
||||||
## Input Requirements
|
|
||||||
|
|
||||||
- **Trained model artifact** — Model checkpoint, weights, configuration
|
|
||||||
- **Performance metrics** — Baseline accuracy, latency, throughput expectations
|
|
||||||
- **Data validation** — Training dataset description, expected data distribution
|
|
||||||
- **Serving infrastructure** — Compute requirements (GPU/CPU), latency targets
|
|
||||||
|
|
||||||
## Output Deliverable
|
|
||||||
|
|
||||||
- **MLOps Deployment Report**
|
|
||||||
- Model version and metadata
|
|
||||||
- Performance validation summary
|
|
||||||
- Serving infrastructure setup
|
|
||||||
- Monitoring dashboard and alerts
|
|
||||||
- Data drift detection configuration
|
|
||||||
|
|
||||||
## Success Criteria
|
|
||||||
|
|
||||||
1. Model passes all quality gates before deployment
|
|
||||||
2. Serving infrastructure deployed and load-tested
|
|
||||||
3. Monitoring and alerting configured and validated
|
|
||||||
4. Rollback strategy tested and documented
|
|
||||||
5. Team trained on model updates and incident response
|
|
||||||
|
|
||||||
## Next Steps After Workflow
|
|
||||||
|
|
||||||
- Monitor model performance daily for first week
|
|
||||||
- Track data drift metrics; alert if detected
|
|
||||||
- Plan model retraining based on performance degradation
|
|
||||||
- Document lessons learned in MLOps runbook
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Navigation**: [← Back to 5-specialized](../), [Next: Step 01 →](steps/step-01-model-validation.md)
|
|
||||||
|
|
@ -1,6 +0,0 @@
|
||||||
---
|
|
||||||
name: bmad-psm-production-readiness
|
|
||||||
description: 'Run production readiness review across 9 dimensions. Use when the user says "are we ready for production" or "PRR" or "go-live check"'
|
|
||||||
---
|
|
||||||
|
|
||||||
Follow the instructions in [workflow.md](workflow.md).
|
|
||||||
|
|
@ -1 +0,0 @@
|
||||||
type: skill
|
|
||||||
|
|
@ -1,367 +0,0 @@
|
||||||
---
|
|
||||||
template_name: production-readiness-checklist
|
|
||||||
template_version: "1.0.0"
|
|
||||||
created_date: 2026-03-17
|
|
||||||
description: Production Readiness Review checklist and report template
|
|
||||||
---
|
|
||||||
|
|
||||||
# Production Readiness Review (PRR)
|
|
||||||
|
|
||||||
**Service**: {{SERVICE_NAME}}
|
|
||||||
**Owner**: {{SERVICE_OWNER}}
|
|
||||||
**Reviewer**: {{SRE_LEAD}} (Minh)
|
|
||||||
**Review Date**: {{DATE}}
|
|
||||||
**Target Go-Live**: {{TARGET_DATE}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Executive Summary
|
|
||||||
|
|
||||||
{{1-2 paragraphs summarizing the readiness assessment, decision, and key findings}}
|
|
||||||
|
|
||||||
**Overall Assessment**: {{READY | CONDITIONAL | NOT_READY}}
|
|
||||||
|
|
||||||
**Timeline**: Service {{can | can conditionally | cannot}} proceed to production {{on {{DATE}}}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Production Readiness Scorecard
|
|
||||||
|
|
||||||
### 9-Dimension Assessment
|
|
||||||
|
|
||||||
| # | Dimension | Score | Status | Key Finding |
|
|
||||||
|---|-----------|-------|--------|-------------|
|
|
||||||
| 1 | Reliability | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
|
||||||
| 2 | Observability | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
|
||||||
| 3 | Performance | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
|
||||||
| 4 | Security | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
|
||||||
| 5 | Capacity | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
|
||||||
| 6 | Data | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
|
||||||
| 7 | Runbooks | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
|
||||||
| 8 | Dependencies | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
|
||||||
| 9 | Rollback | {{GREEN|YELLOW|RED}} | ✅/⚠️/❌ | {{Brief finding}} |
|
|
||||||
|
|
||||||
**Summary**: {{X}} GREEN, {{Y}} YELLOW, {{Z}} RED
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Detailed Findings by Dimension
|
|
||||||
|
|
||||||
### 1. Reliability
|
|
||||||
|
|
||||||
**Goal**: Service meets SLO targets with documented failure modes and incident response plan.
|
|
||||||
|
|
||||||
**Findings**:
|
|
||||||
|
|
||||||
- [ ] {{Finding 1}} ({{Status}})
|
|
||||||
- [ ] {{Finding 2}} ({{Status}})
|
|
||||||
- [ ] {{Finding 3}} ({{Status}})
|
|
||||||
|
|
||||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
|
||||||
|
|
||||||
**Score**: {{GREEN|YELLOW|RED}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 2. Observability
|
|
||||||
|
|
||||||
**Goal**: Service has comprehensive logging, metrics, tracing, and dashboards for operational visibility.
|
|
||||||
|
|
||||||
**Findings**:
|
|
||||||
|
|
||||||
- [ ] {{Finding 1}} ({{Status}})
|
|
||||||
- [ ] {{Finding 2}} ({{Status}})
|
|
||||||
- [ ] {{Finding 3}} ({{Status}})
|
|
||||||
|
|
||||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
|
||||||
|
|
||||||
**Score**: {{GREEN|YELLOW|RED}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 3. Performance
|
|
||||||
|
|
||||||
**Goal**: Service meets latency/throughput targets and scales under expected load.
|
|
||||||
|
|
||||||
**Findings**:
|
|
||||||
|
|
||||||
- [ ] {{Finding 1}} ({{Status}})
|
|
||||||
- [ ] {{Finding 2}} ({{Status}})
|
|
||||||
- [ ] {{Finding 3}} ({{Status}})
|
|
||||||
|
|
||||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
|
||||||
|
|
||||||
**Score**: {{GREEN|YELLOW|RED}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 4. Security
|
|
||||||
|
|
||||||
**Goal**: Authentication, authorization, encryption, and secrets management are implemented.
|
|
||||||
|
|
||||||
**Findings**:
|
|
||||||
|
|
||||||
- [ ] {{Finding 1}} ({{Status}})
|
|
||||||
- [ ] {{Finding 2}} ({{Status}})
|
|
||||||
- [ ] {{Finding 3}} ({{Status}})
|
|
||||||
|
|
||||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
|
||||||
|
|
||||||
**Score**: {{GREEN|YELLOW|RED}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 5. Capacity
|
|
||||||
|
|
||||||
**Goal**: Resource requirements defined with growth headroom and cost acceptable.
|
|
||||||
|
|
||||||
**Findings**:
|
|
||||||
|
|
||||||
- [ ] {{Finding 1}} ({{Status}})
|
|
||||||
- [ ] {{Finding 2}} ({{Status}})
|
|
||||||
- [ ] {{Finding 3}} ({{Status}})
|
|
||||||
|
|
||||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
|
||||||
|
|
||||||
**Score**: {{GREEN|YELLOW|RED}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 6. Data
|
|
||||||
|
|
||||||
**Goal**: Data governance, backup, retention, and disaster recovery documented and tested.
|
|
||||||
|
|
||||||
**Findings**:
|
|
||||||
|
|
||||||
- [ ] {{Finding 1}} ({{Status}})
|
|
||||||
- [ ] {{Finding 2}} ({{Status}})
|
|
||||||
- [ ] {{Finding 3}} ({{Status}})
|
|
||||||
|
|
||||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
|
||||||
|
|
||||||
**Score**: {{GREEN|YELLOW|RED}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 7. Runbooks
|
|
||||||
|
|
||||||
**Goal**: Incident response, deployment, troubleshooting procedures documented and drilled.
|
|
||||||
|
|
||||||
**Findings**:
|
|
||||||
|
|
||||||
- [ ] {{Finding 1}} ({{Status}})
|
|
||||||
- [ ] {{Finding 2}} ({{Status}})
|
|
||||||
- [ ] {{Finding 3}} ({{Status}})
|
|
||||||
|
|
||||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
|
||||||
|
|
||||||
**Score**: {{GREEN|YELLOW|RED}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 8. Dependencies
|
|
||||||
|
|
||||||
**Goal**: External/internal dependencies mapped, versioned, with fallback strategies.
|
|
||||||
|
|
||||||
**Findings**:
|
|
||||||
|
|
||||||
- [ ] {{Finding 1}} ({{Status}})
|
|
||||||
- [ ] {{Finding 2}} ({{Status}})
|
|
||||||
- [ ] {{Finding 3}} ({{Status}})
|
|
||||||
|
|
||||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
|
||||||
|
|
||||||
**Score**: {{GREEN|YELLOW|RED}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### 9. Rollback
|
|
||||||
|
|
||||||
**Goal**: Safe rollback strategy tested; deployment is reversible.
|
|
||||||
|
|
||||||
**Findings**:
|
|
||||||
|
|
||||||
- [ ] {{Finding 1}} ({{Status}})
|
|
||||||
- [ ] {{Finding 2}} ({{Status}})
|
|
||||||
- [ ] {{Finding 3}} ({{Status}})
|
|
||||||
|
|
||||||
**Assessment**: {{Detailed narrative, 3-5 sentences}}
|
|
||||||
|
|
||||||
**Score**: {{GREEN|YELLOW|RED}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Critical Blockers (P0)
|
|
||||||
|
|
||||||
{{If any P0 blockers exist:}}
|
|
||||||
|
|
||||||
Service **CANNOT** proceed to production until these are resolved:
|
|
||||||
|
|
||||||
### P0 Blocker #1: {{ISSUE_TITLE}}
|
|
||||||
|
|
||||||
- **Dimension**: {{Which dimension}}
|
|
||||||
- **Description**: {{What's the problem}}
|
|
||||||
- **Impact**: {{Why it's critical}}
|
|
||||||
- **Resolution**: {{How to fix}}
|
|
||||||
- **Owner**: {{Who must fix it}}
|
|
||||||
- **Deadline**: {{When it must be done}}
|
|
||||||
- **Acceptance**: {{How we verify it's fixed}}
|
|
||||||
|
|
||||||
### P0 Blocker #2: {{ISSUE_TITLE}}
|
|
||||||
|
|
||||||
{{Repeat format}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Risks to Manage (P1)
|
|
||||||
|
|
||||||
Service can proceed with documented monitoring and contingency plans:
|
|
||||||
|
|
||||||
### P1 Risk #1: {{ISSUE_TITLE}}
|
|
||||||
|
|
||||||
- **Dimension**: {{Which dimension}}
|
|
||||||
- **Description**: {{What's the problem}}
|
|
||||||
- **Impact**: {{If it happens, what's the consequence}}
|
|
||||||
- **Likelihood**: {{HIGH|MEDIUM|LOW}}
|
|
||||||
- **Mitigation**: {{How we'll manage it}}
|
|
||||||
- **Monitoring**: {{What metrics to watch}}
|
|
||||||
- **Contingency**: {{What we'll do if it occurs}}
|
|
||||||
- **Owner**: {{Who owns this risk}}
|
|
||||||
- **Target Fix**: {{Timeline to resolve permanently}}
|
|
||||||
|
|
||||||
### P1 Risk #2: {{ISSUE_TITLE}}
|
|
||||||
|
|
||||||
{{Repeat format}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Recommendations
|
|
||||||
|
|
||||||
**High Priority** (Next sprint):
|
|
||||||
- {{Recommendation 1}}
|
|
||||||
- {{Recommendation 2}}
|
|
||||||
|
|
||||||
**Medium Priority** (Within 1 month):
|
|
||||||
- {{Recommendation 1}}
|
|
||||||
- {{Recommendation 2}}
|
|
||||||
|
|
||||||
**Nice to Have** (Backlog):
|
|
||||||
- {{Recommendation 1}}
|
|
||||||
- {{Recommendation 2}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Final Decision
|
|
||||||
|
|
||||||
### Decision
|
|
||||||
|
|
||||||
**{{ ✅ GO | ⚠️ CONDITIONAL-GO | ❌ NO-GO }}**
|
|
||||||
|
|
||||||
### Rationale
|
|
||||||
|
|
||||||
{{Explain the decision. Why can/can't we proceed?}}
|
|
||||||
|
|
||||||
### Conditions (If CONDITIONAL-GO)
|
|
||||||
|
|
||||||
If proceeding despite P1 risks, document conditions:
|
|
||||||
|
|
||||||
1. **{{Condition 1}}**: {{Description}}
|
|
||||||
- Owner: {{Who oversees this}}
|
|
||||||
- Success Criteria: {{How we verify it}}
|
|
||||||
- Escalation: {{Who to contact if issues}}
|
|
||||||
|
|
||||||
2. **{{Condition 2}}**: {{Description}}
|
|
||||||
- Owner: {{Who oversees this}}
|
|
||||||
- Success Criteria: {{How we verify it}}
|
|
||||||
- Escalation: {{Who to contact if issues}}
|
|
||||||
|
|
||||||
### Deployment Timeline
|
|
||||||
|
|
||||||
{{If GO or CONDITIONAL-GO:}}
|
|
||||||
|
|
||||||
- **Approved for deployment**: {{DATE}}
|
|
||||||
- **Earliest go-live**: {{DATE}}
|
|
||||||
- **Recommended window**: {{DATE/TIME}}
|
|
||||||
- **On-call coverage required**: {{YES|NO}}
|
|
||||||
- **Emergency rollback plan**: {{REFERENCE TO RUNBOOK}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Sign-offs & Approvals
|
|
||||||
|
|
||||||
### Approval Chain
|
|
||||||
|
|
||||||
- [ ] **SRE Lead** ({{NAME}}) — Review completed and findings approved
|
|
||||||
- Signature: ________________________ Date: __________
|
|
||||||
|
|
||||||
- [ ] **Architecture Lead** ({{NAME}}) — Architecture validated
|
|
||||||
- Signature: ________________________ Date: __________
|
|
||||||
|
|
||||||
- [ ] **Service Owner** ({{NAME}}) — Acknowledged findings and committed to actions
|
|
||||||
- Signature: ________________________ Date: __________
|
|
||||||
|
|
||||||
- [ ] **VP Engineering** ({{NAME}}) — Risk accepted (if CONDITIONAL-GO)
|
|
||||||
- Signature: ________________________ Date: __________
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Post-Production Plan
|
|
||||||
|
|
||||||
### First 24 Hours
|
|
||||||
|
|
||||||
- [ ] SRE on-call monitoring closely
|
|
||||||
- [ ] Daily standup with service team
|
|
||||||
- [ ] Monitor for any unusual patterns
|
|
||||||
- [ ] Be ready to rollback if needed
|
|
||||||
|
|
||||||
### First Week
|
|
||||||
|
|
||||||
- [ ] Daily metrics review
|
|
||||||
- [ ] Watch for data drift or unusual behavior
|
|
||||||
- [ ] Follow up on any P1 risks
|
|
||||||
|
|
||||||
### Ongoing
|
|
||||||
|
|
||||||
- [ ] Monthly PRR follow-ups to verify improvements
|
|
||||||
- [ ] Track action items to completion
|
|
||||||
- [ ] Update this PRR if significant changes made
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Action Items
|
|
||||||
|
|
||||||
| ID | Action | Owner | Deadline | Type | Status |
|
|
||||||
|----|--------|-------|----------|------|--------|
|
|
||||||
| A1 | {{Action}} | {{Name}} | {{Date}} | {{BLOCKER|RISK|RECOMMENDATION}} | ☐ |
|
|
||||||
| A2 | {{Action}} | {{Name}} | {{Date}} | {{BLOCKER|RISK|RECOMMENDATION}} | ☐ |
|
|
||||||
| A3 | {{Action}} | {{Name}} | {{Date}} | {{BLOCKER|RISK|RECOMMENDATION}} | ☐ |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Appendix
|
|
||||||
|
|
||||||
### A. Load Test Results
|
|
||||||
|
|
||||||
[Link to or summary of load test results showing service meets performance targets]
|
|
||||||
|
|
||||||
### B. Security Review Results
|
|
||||||
|
|
||||||
[Link to or summary of security audit findings]
|
|
||||||
|
|
||||||
### C. Architecture Diagrams
|
|
||||||
|
|
||||||
[Include or link to system architecture, data flow, and deployment topology]
|
|
||||||
|
|
||||||
### D. SLO Definition
|
|
||||||
|
|
||||||
[Document the agreed-upon SLO targets for availability, latency, error rate]
|
|
||||||
|
|
||||||
### E. Runbooks
|
|
||||||
|
|
||||||
[Link to or list of key runbooks: incident response, deployment, rollback, troubleshooting]
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Report prepared by**: {{SRE_LEAD}}
|
|
||||||
**Report date**: {{DATE}}
|
|
||||||
**Last updated**: {{DATE}}
|
|
||||||
|
|
@ -1,92 +0,0 @@
|
||||||
---
|
|
||||||
workflow_id: PRR001
|
|
||||||
workflow_name: Production Readiness Review
|
|
||||||
description: Validate service is ready for production using comprehensive readiness checklist
|
|
||||||
entry_point: steps/step-01-init-checklist.md
|
|
||||||
phase: 3-run
|
|
||||||
lead_agent: "Minh (SRE)"
|
|
||||||
status: "active"
|
|
||||||
created_date: 2026-03-17
|
|
||||||
version: "1.0.0"
|
|
||||||
estimated_duration: "2-3 hours"
|
|
||||||
outputFile: '{output_folder}/psm-artifacts/prr-{{project_name}}-{{date}}.md'
|
|
||||||
---
|
|
||||||
|
|
||||||
# Workflow: Production Readiness Review (PRR)
|
|
||||||
|
|
||||||
## Goal
|
|
||||||
Validate and certify that a service meets production readiness standards across 9 key dimensions before deployment.
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
This workflow systematically evaluates a service against production readiness criteria defined in the Production Systems BMAD skill framework. Using SRE expertise and architectural patterns, the workflow:
|
|
||||||
|
|
||||||
1. **Initializes** the PRR process with service context and dimensional overview
|
|
||||||
2. **Deep reviews** each dimension (reliability, observability, performance, security, capacity, data, runbooks, dependencies, rollback)
|
|
||||||
3. **Renders final decision** with GO/NO-GO/CONDITIONAL-GO recommendation
|
|
||||||
|
|
||||||
## Execution Path
|
|
||||||
|
|
||||||
```
|
|
||||||
START
|
|
||||||
↓
|
|
||||||
[Step 01] Init Checklist (Load framework, gather service context, present dimensions)
|
|
||||||
↓
|
|
||||||
[Step 02] Deep Review (Score each dimension, identify blockers, recommendations)
|
|
||||||
↓
|
|
||||||
[Step 03] Final Decision (Scorecard, decision, action items, DONE)
|
|
||||||
↓
|
|
||||||
END
|
|
||||||
```
|
|
||||||
|
|
||||||
## Key Roles
|
|
||||||
|
|
||||||
| Role | Agent | Responsibility |
|
|
||||||
|------|-------|-----------------|
|
|
||||||
| Lead | Minh (SRE) | Navigate workflow, coordinate review, make final call |
|
|
||||||
| Subject Matter | Service Owner | Provide service context, clarify architecture |
|
|
||||||
| Review Committee | Arch, SecOps, MLOps | Contribute expertise on specific dimensions |
|
|
||||||
|
|
||||||
## Dimensions Evaluated (9)
|
|
||||||
|
|
||||||
1. **Reliability** — SLA/SLO definition, error budgets, failure modes, incident response
|
|
||||||
2. **Observability** — Logging, metrics, tracing, dashboards, alerting
|
|
||||||
3. **Performance** — Latency targets, throughput, P99 tail behavior, optimization opportunities
|
|
||||||
4. **Security** — Auth/authz, secrets management, encryption, audit logging, compliance
|
|
||||||
5. **Capacity** — Resource limits, scaling policies, burst capacity, cost projections
|
|
||||||
6. **Data** — Schema versioning, backup/restore, data governance, retention policies
|
|
||||||
7. **Runbooks** — Incident runbooks, operational playbooks, troubleshooting guides
|
|
||||||
8. **Dependencies** — External services, internal libraries, database versioning, API contracts
|
|
||||||
9. **Rollback** — Rollback strategy, canary deployment, feature flags, smoke tests
|
|
||||||
|
|
||||||
## Input Requirements
|
|
||||||
|
|
||||||
- **Service name and owner** — Which service are we evaluating?
|
|
||||||
- **Current architecture** — High-level design, tech stack, topology
|
|
||||||
- **Existing metrics/dashboards** — Links to monitoring, SLO definitions
|
|
||||||
- **Known gaps/risks** — Already identified issues to address
|
|
||||||
|
|
||||||
## Output Deliverable
|
|
||||||
|
|
||||||
- **Production Readiness Checklist** (template: `production-readiness.template.md`)
|
|
||||||
- Scorecard with 9 dimensions (red/yellow/green)
|
|
||||||
- Blockers and recommendations per dimension
|
|
||||||
- Final GO/NO-GO/CONDITIONAL-GO decision
|
|
||||||
- Explicit action items with owners and deadlines
|
|
||||||
|
|
||||||
## Success Criteria
|
|
||||||
|
|
||||||
1. All 9 dimensions evaluated with clear rationale
|
|
||||||
2. Blockers categorized as P0 (must fix) or P1 (should fix)
|
|
||||||
3. Team alignment on decision (documented in PRR report)
|
|
||||||
4. Action plan with clear accountability and timeline
|
|
||||||
|
|
||||||
## Next Steps After Workflow
|
|
||||||
|
|
||||||
- If **GO**: Proceed to deployment; document in CHANGELOG
|
|
||||||
- If **NO-GO**: Reschedule PRR once blockers addressed; track in backlog
|
|
||||||
- If **CONDITIONAL-GO**: Deploy with documented caveats; setup monitoring for risk areas
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Navigation**: [← Back to 3-run](../), [Next: Step 01 →](steps/step-01-init-checklist.md)
|
|
||||||
|
|
@ -1,6 +0,0 @@
|
||||||
---
|
|
||||||
name: bmad-psm-quick-diagnose
|
|
||||||
description: 'Quick diagnosis of production issue with minimal latency. Use when the user says "something is broken" or "quick diagnose" or "what is happening?"'
|
|
||||||
---
|
|
||||||
|
|
||||||
Follow the instructions in [workflow.md](workflow.md).
|
|
||||||
|
|
@ -1 +0,0 @@
|
||||||
type: skill
|
|
||||||
|
|
@ -1,80 +0,0 @@
|
||||||
---
|
|
||||||
workflow_id: QD001
|
|
||||||
workflow_name: Quick Diagnose
|
|
||||||
description: Fast diagnosis of production issue with root cause and fix suggestion
|
|
||||||
entry_point: steps/step-01-gather.md
|
|
||||||
phase: quick-flow
|
|
||||||
lead_agent: "Minh (SRE)"
|
|
||||||
status: "active"
|
|
||||||
created_date: 2026-03-17
|
|
||||||
version: "1.0.0"
|
|
||||||
estimated_duration: "15-25 minutes"
|
|
||||||
outputFile: '{output_folder}/psm-artifacts/quick-diagnose-{{date}}.md'
|
|
||||||
---
|
|
||||||
|
|
||||||
# Workflow: Quick Diagnose Production Issue
|
|
||||||
|
|
||||||
## Goal
|
|
||||||
Rapidly diagnose production issues by gathering symptom data, checking metrics, and suggesting fixes.
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
Quick Diagnose is a lightweight workflow for time-sensitive production troubleshooting:
|
|
||||||
|
|
||||||
1. **Gathers** symptom description and quick metrics check
|
|
||||||
2. **Diagnoses** root cause using observability data
|
|
||||||
3. **Suggests** fix or mitigation immediately
|
|
||||||
|
|
||||||
## Execution Path
|
|
||||||
|
|
||||||
```
|
|
||||||
START
|
|
||||||
↓
|
|
||||||
[Step 01] Gather Context (What's broken? Check metrics)
|
|
||||||
↓
|
|
||||||
[Step 02] Diagnose & Fix (Root cause analysis → fix suggestion → verify)
|
|
||||||
↓
|
|
||||||
END
|
|
||||||
```
|
|
||||||
|
|
||||||
## Key Roles
|
|
||||||
|
|
||||||
| Role | Agent |
|
|
||||||
|------|-------|
|
|
||||||
| Lead | Minh (SRE) |
|
|
||||||
|
|
||||||
## Input Requirements
|
|
||||||
|
|
||||||
- **Symptom description** — What is failing? (error message, behavior, timeline)
|
|
||||||
- **Affected service/component** — What system is broken?
|
|
||||||
- **Timeline** — When did it start? Is it ongoing?
|
|
||||||
- **Impact** — How many users affected? Is revenue impacted?
|
|
||||||
|
|
||||||
## Output Deliverable
|
|
||||||
|
|
||||||
- **Quick Diagnosis Report** (markdown, 1-2 pages)
|
|
||||||
- Symptom analysis
|
|
||||||
- Root cause hypothesis
|
|
||||||
- Immediate mitigation (if needed)
|
|
||||||
- Fix suggestion with effort
|
|
||||||
- Follow-up actions
|
|
||||||
|
|
||||||
## Success Criteria
|
|
||||||
|
|
||||||
1. Root cause identified within 15-20 minutes
|
|
||||||
2. Immediate mitigation available (if needed)
|
|
||||||
3. Fix suggestion documented with clear steps
|
|
||||||
4. Team knows what to do next
|
|
||||||
|
|
||||||
## Quick Diagnose vs Full Production Readiness Review
|
|
||||||
|
|
||||||
| Aspect | Quick Diagnose | Full PRR |
|
|
||||||
|--------|---|---|
|
|
||||||
| Trigger | Active incident | Pre-deployment |
|
|
||||||
| Duration | 15-25 min | 2-3 hours |
|
|
||||||
| Scope | Single issue | All 9 dimensions |
|
|
||||||
| Goal | Fix now | Prevent issues |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Navigation**: [← Back to quick-flow](../), [Next: Step 01 →](steps/step-01-gather.md)
|
|
||||||
|
|
@ -1,6 +0,0 @@
|
||||||
---
|
|
||||||
name: bmad-psm-security-audit
|
|
||||||
description: 'Run comprehensive security audit and threat assessment. Use when the user says "security audit" or "vulnerability assessment" or "security review"'
|
|
||||||
---
|
|
||||||
|
|
||||||
Follow the instructions in [workflow.md](workflow.md).
|
|
||||||
|
|
@ -1 +0,0 @@
|
||||||
type: skill
|
|
||||||
|
|
@ -1,502 +0,0 @@
|
||||||
---
|
|
||||||
template_name: security-audit-report
|
|
||||||
template_version: "1.0.0"
|
|
||||||
created_date: 2026-03-17
|
|
||||||
description: Security audit report with findings, severity levels, and remediation plan
|
|
||||||
---
|
|
||||||
|
|
||||||
# Security Audit Report
|
|
||||||
|
|
||||||
**Service**: {{SERVICE_NAME}}
|
|
||||||
**Service Owner**: {{SERVICE_OWNER}}
|
|
||||||
**Auditor**: {{SECURITY_LEAD}} (Hà)
|
|
||||||
**Audit Date**: {{START_DATE}} — {{END_DATE}}
|
|
||||||
**Report Date**: {{REPORT_DATE}}
|
|
||||||
**Scope**: {{SCOPE_DESCRIPTION}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Executive Summary
|
|
||||||
|
|
||||||
This security audit evaluated {{SERVICE_NAME}} against security best practices and compliance requirements. The assessment identified {{X}} findings across {{Y}} security domains.
|
|
||||||
|
|
||||||
**Overall Security Posture**: {{COMPLIANT | FINDINGS | CRITICAL}}
|
|
||||||
|
|
||||||
{{1-2 paragraph summary of key findings, critical issues if any, and recommendations}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Audit Scope
|
|
||||||
|
|
||||||
### Services Reviewed
|
|
||||||
|
|
||||||
- {{Service 1}} ({{Description}})
|
|
||||||
- {{Service 2}} ({{Description}})
|
|
||||||
- {{Service 3}} ({{Description}})
|
|
||||||
|
|
||||||
### Assessment Domains
|
|
||||||
|
|
||||||
- ✅ Authentication & Authorization
|
|
||||||
- ✅ API Security
|
|
||||||
- ✅ Secrets Management
|
|
||||||
- ✅ Encryption (in-transit & at-rest)
|
|
||||||
- ✅ PII & Data Protection
|
|
||||||
|
|
||||||
### Exclusions
|
|
||||||
|
|
||||||
{{Any out-of-scope areas:}}
|
|
||||||
- {{Item}} (reason)
|
|
||||||
- {{Item}} (reason)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Findings Summary
|
|
||||||
|
|
||||||
### By Severity
|
|
||||||
|
|
||||||
| Severity | Count | Trend |
|
|
||||||
|----------|-------|-------|
|
|
||||||
| **Critical** | {{X}} | {{↑/→/↓}} |
|
|
||||||
| **High** | {{Y}} | {{↑/→/↓}} |
|
|
||||||
| **Medium** | {{Z}} | {{↑/→/↓}} |
|
|
||||||
| **Low** | {{W}} | {{↑/→/↓}} |
|
|
||||||
| **Total** | {{X+Y+Z+W}} | |
|
|
||||||
|
|
||||||
### By Domain
|
|
||||||
|
|
||||||
| Domain | Critical | High | Medium | Low | Status |
|
|
||||||
|--------|----------|------|--------|-----|--------|
|
|
||||||
| Auth & Authz | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
|
|
||||||
| API Security | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
|
|
||||||
| Secrets Mgmt | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
|
|
||||||
| Encryption | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
|
|
||||||
| PII & Data | {{#}} | {{#}} | {{#}} | {{#}} | ✅/⚠️/❌ |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Critical Severity Findings
|
|
||||||
|
|
||||||
### [F1] {{Finding Title}}
|
|
||||||
|
|
||||||
**Severity**: CRITICAL (CVSS {{8.0-10.0}})
|
|
||||||
**Domain**: {{Which domain}}
|
|
||||||
**Status**: {{Open | In Progress | Resolved}}
|
|
||||||
|
|
||||||
**Description**:
|
|
||||||
{{Detailed description of the vulnerability, how it could be exploited, and the impact}}
|
|
||||||
|
|
||||||
**Evidence**:
|
|
||||||
- {{Evidence 1}}
|
|
||||||
- {{Evidence 2}}
|
|
||||||
- {{Testing confirmation}}
|
|
||||||
|
|
||||||
**Impact**:
|
|
||||||
- {{Business impact}}
|
|
||||||
- {{Technical impact}}
|
|
||||||
- {{Compliance impact}}
|
|
||||||
|
|
||||||
**Remediation**:
|
|
||||||
1. {{Step 1}} ({{Estimated time}})
|
|
||||||
2. {{Step 2}} ({{Estimated time}})
|
|
||||||
3. {{Step 3}} ({{Estimated time}})
|
|
||||||
|
|
||||||
**Owner**: {{Name}}
|
|
||||||
**Target Fix Date**: {{DATE}}
|
|
||||||
**Effort**: {{Est. hours/days}}
|
|
||||||
**Verification**: {{How we'll confirm it's fixed}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### [F2] {{Finding Title}}
|
|
||||||
|
|
||||||
{{Repeat Critical severity format}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## High Severity Findings
|
|
||||||
|
|
||||||
### [F3] {{Finding Title}}
|
|
||||||
|
|
||||||
**Severity**: HIGH (CVSS {{7.0-7.9}})
|
|
||||||
**Domain**: {{Which domain}}
|
|
||||||
**Status**: {{Open | In Progress | Resolved}}
|
|
||||||
|
|
||||||
**Description**: {{Brief description}}
|
|
||||||
|
|
||||||
**Impact**: {{Why it matters}}
|
|
||||||
|
|
||||||
**Remediation**:
|
|
||||||
1. {{Step 1}}
|
|
||||||
2. {{Step 2}}
|
|
||||||
|
|
||||||
**Owner**: {{Name}}
|
|
||||||
**Target Date**: {{DATE}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### [F4] {{Finding Title}}
|
|
||||||
|
|
||||||
{{Repeat High severity format}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Medium Severity Findings
|
|
||||||
|
|
||||||
### [F5] {{Finding Title}}
|
|
||||||
|
|
||||||
**Severity**: MEDIUM (CVSS {{4.0-6.9}})
|
|
||||||
**Domain**: {{Which domain}}
|
|
||||||
**Description**: {{Brief description}}
|
|
||||||
**Remediation**: {{Brief fix}}
|
|
||||||
**Owner**: {{Name}} | **Target Date**: {{DATE}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### [F6] {{Finding Title}}
|
|
||||||
|
|
||||||
{{Repeat Medium severity format}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Low Severity Findings
|
|
||||||
|
|
||||||
### [F7] {{Finding Title}}
|
|
||||||
|
|
||||||
**Severity**: LOW (CVSS {{0.1-3.9}})
|
|
||||||
**Description**: {{Brief description}}
|
|
||||||
**Remediation**: {{Brief fix}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### [F8] {{Finding Title}}
|
|
||||||
|
|
||||||
{{Repeat Low severity format}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Domain-Specific Assessment
|
|
||||||
|
|
||||||
### Domain 1: Authentication & Authorization
|
|
||||||
|
|
||||||
**Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
|
|
||||||
|
|
||||||
**Strengths**:
|
|
||||||
- {{Positive finding 1}}
|
|
||||||
- {{Positive finding 2}}
|
|
||||||
|
|
||||||
**Gaps**:
|
|
||||||
- {{Gap 1}} — {{Impact}}
|
|
||||||
- {{Gap 2}} — {{Impact}}
|
|
||||||
|
|
||||||
**Recommendations**:
|
|
||||||
1. {{Recommendation 1}}
|
|
||||||
2. {{Recommendation 2}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Domain 2: API Security
|
|
||||||
|
|
||||||
**Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
|
|
||||||
|
|
||||||
**Strengths**:
|
|
||||||
- {{Positive finding 1}}
|
|
||||||
- {{Positive finding 2}}
|
|
||||||
|
|
||||||
**Gaps**:
|
|
||||||
- {{Gap 1}} — {{Impact}}
|
|
||||||
- {{Gap 2}} — {{Impact}}
|
|
||||||
|
|
||||||
**Recommendations**:
|
|
||||||
1. {{Recommendation 1}}
|
|
||||||
2. {{Recommendation 2}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Domain 3: Secrets Management
|
|
||||||
|
|
||||||
**Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
|
|
||||||
|
|
||||||
**Strengths**:
|
|
||||||
- {{Positive finding 1}}
|
|
||||||
- {{Positive finding 2}}
|
|
||||||
|
|
||||||
**Gaps**:
|
|
||||||
- {{Gap 1}} — {{Impact}}
|
|
||||||
- {{Gap 2}} — {{Impact}}
|
|
||||||
|
|
||||||
**Recommendations**:
|
|
||||||
1. {{Recommendation 1}}
|
|
||||||
2. {{Recommendation 2}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Domain 4: Encryption
|
|
||||||
|
|
||||||
**Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
|
|
||||||
|
|
||||||
**Strengths**:
|
|
||||||
- {{Positive finding 1}}
|
|
||||||
- {{Positive finding 2}}
|
|
||||||
|
|
||||||
**Gaps**:
|
|
||||||
- {{Gap 1}} — {{Impact}}
|
|
||||||
- {{Gap 2}} — {{Impact}}
|
|
||||||
|
|
||||||
**Recommendations**:
|
|
||||||
1. {{Recommendation 1}}
|
|
||||||
2. {{Recommendation 2}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Domain 5: PII & Data Protection
|
|
||||||
|
|
||||||
**Status**: {{COMPLIANT | FINDINGS | CRITICAL}}
|
|
||||||
|
|
||||||
**Strengths**:
|
|
||||||
- {{Positive finding 1}}
|
|
||||||
- {{Positive finding 2}}
|
|
||||||
|
|
||||||
**Gaps**:
|
|
||||||
- {{Gap 1}} — {{Impact}}
|
|
||||||
- {{Gap 2}} — {{Impact}}
|
|
||||||
|
|
||||||
**Recommendations**:
|
|
||||||
1. {{Recommendation 1}}
|
|
||||||
2. {{Recommendation 2}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Compliance Assessment
|
|
||||||
|
|
||||||
### GDPR (General Data Protection Regulation)
|
|
||||||
|
|
||||||
**Applicable**: {{YES | NO | PARTIAL}}
|
|
||||||
**Status**: {{COMPLIANT | NON-COMPLIANT | CONDITIONAL}}
|
|
||||||
|
|
||||||
| Requirement | Status | Finding | Gap Fix |
|
|
||||||
|-------------|--------|---------|---------|
|
|
||||||
| Data Encryption | {{✅/❌}} | {{Description}} | {{Remediation}} |
|
|
||||||
| Access Control | {{✅/❌}} | {{Description}} | {{Remediation}} |
|
|
||||||
| Retention Policy | {{✅/❌}} | {{Description}} | {{Remediation}} |
|
|
||||||
| Right to Deletion | {{✅/❌}} | {{Description}} | {{Remediation}} |
|
|
||||||
| Data Processing Agreement | {{✅/❌}} | {{Description}} | {{Remediation}} |
|
|
||||||
|
|
||||||
**Timeline to Compliance**: {{DATE or "Already compliant"}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### PCI-DSS (Payment Card Industry Data Security Standard)
|
|
||||||
|
|
||||||
**Applicable**: {{YES | NO | PARTIAL}}
|
|
||||||
**Status**: {{COMPLIANT | NON-COMPLIANT | CONDITIONAL}}
|
|
||||||
|
|
||||||
| Requirement | Status | Finding | Gap Fix |
|
|
||||||
|-------------|--------|---------|---------|
|
|
||||||
| TLS 1.2+ | {{✅/❌}} | {{Description}} | {{Remediation}} |
|
|
||||||
| Secrets Management | {{✅/❌}} | {{Description}} | {{Remediation}} |
|
|
||||||
| Input Validation | {{✅/❌}} | {{Description}} | {{Remediation}} |
|
|
||||||
|
|
||||||
**Timeline to Compliance**: {{DATE or "Already compliant"}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### SOC 2 Type II
|
|
||||||
|
|
||||||
**Applicable**: {{YES | NO | PARTIAL}}
|
|
||||||
**Status**: {{COMPLIANT | NON-COMPLIANT | CONDITIONAL}}
|
|
||||||
|
|
||||||
**Gap Summary**: {{Description of gaps or "No gaps identified"}}
|
|
||||||
|
|
||||||
**Timeline**: {{When audit can be conducted}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Other Regulations
|
|
||||||
|
|
||||||
{{Any other applicable standards (HIPAA, FINRA, etc.)}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Remediation Roadmap
|
|
||||||
|
|
||||||
### Critical Path (Week 1-2)
|
|
||||||
|
|
||||||
**All Critical findings must be fixed before production deployment.**
|
|
||||||
|
|
||||||
- [ ] {{F1}} — Owner: {{Name}}, Deadline: {{DATE}}
|
|
||||||
- [ ] {{F2}} — Owner: {{Name}}, Deadline: {{DATE}}
|
|
||||||
|
|
||||||
**Milestone**: Security re-scan on {{DATE}} to verify fixes
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Phase 2 (Week 3-4)
|
|
||||||
|
|
||||||
Complete High-severity findings:
|
|
||||||
|
|
||||||
- [ ] {{F3}} — Owner: {{Name}}, Deadline: {{DATE}}
|
|
||||||
- [ ] {{F4}} — Owner: {{Name}}, Deadline: {{DATE}}
|
|
||||||
|
|
||||||
**Milestone**: Second security review on {{DATE}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Phase 3 (Weeks 5-8)
|
|
||||||
|
|
||||||
Address Medium-severity findings (can be post-production with monitoring):
|
|
||||||
|
|
||||||
- [ ] {{F5}} — Owner: {{Name}}, Target: {{DATE}}
|
|
||||||
- [ ] {{F6}} — Owner: {{Name}}, Target: {{DATE}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
### Backlog (Next Sprint)
|
|
||||||
|
|
||||||
Low-severity items:
|
|
||||||
|
|
||||||
- [ ] {{F7}} — {{Brief description}}
|
|
||||||
- [ ] {{F8}} — {{Brief description}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Remediation Status Tracking
|
|
||||||
|
|
||||||
| Finding | Owner | Deadline | Status | Last Update | Notes |
|
|
||||||
|---------|-------|----------|--------|-------------|-------|
|
|
||||||
| F1 | {{Name}} | {{Date}} | 🔴 Pending | {{Date}} | {{Notes}} |
|
|
||||||
| F2 | {{Name}} | {{Date}} | 🟡 In Progress | {{Date}} | {{Notes}} |
|
|
||||||
| F3 | {{Name}} | {{Date}} | 🟢 Complete | {{Date}} | {{Notes}} |
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Post-Audit Monitoring
|
|
||||||
|
|
||||||
### Controls to Monitor
|
|
||||||
|
|
||||||
{{If service proceeds to production despite findings:}}
|
|
||||||
|
|
||||||
- **{{Control 1}}** — Monitor via {{method}}, alert if {{threshold}}
|
|
||||||
- **{{Control 2}}** — Monitor via {{method}}, alert if {{threshold}}
|
|
||||||
- **{{Control 3}}** — Monitor via {{method}}, alert if {{threshold}}
|
|
||||||
|
|
||||||
### Incident Response
|
|
||||||
|
|
||||||
If a security incident occurs:
|
|
||||||
1. Activate incident response team
|
|
||||||
2. Notify {{Escalation contacts}}
|
|
||||||
3. Follow {{Incident response runbook}}
|
|
||||||
4. Conduct post-incident security review
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Risk Assessment Matrix
|
|
||||||
|
|
||||||
```
|
|
||||||
LIKELIHOOD
|
|
||||||
Low Med High
|
|
||||||
CRITICAL H C C
|
|
||||||
IMPACT
|
|
||||||
HIGH M H C
|
|
||||||
MEDIUM L M H
|
|
||||||
LOW L L M
|
|
||||||
|
|
||||||
Legend: C=Critical, H=High, M=Medium, L=Low
|
|
||||||
```
|
|
||||||
|
|
||||||
**Our findings map**:
|
|
||||||
- {{F1}} — {{Position on matrix}}
|
|
||||||
- {{F2}} — {{Position on matrix}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Positive Findings
|
|
||||||
|
|
||||||
**Strengths to maintain:**
|
|
||||||
|
|
||||||
- {{Positive 1}} — Keep doing this
|
|
||||||
- {{Positive 2}} — Keep doing this
|
|
||||||
- {{Positive 3}} — Keep doing this
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Recommendations Summary
|
|
||||||
|
|
||||||
### Immediate (Critical)
|
|
||||||
- {{Fix all Critical findings}} ({{effort}})
|
|
||||||
|
|
||||||
### Short-term (High Priority)
|
|
||||||
- {{Fix all High findings}} ({{effort}})
|
|
||||||
- {{Implement automated scanning}} ({{effort}})
|
|
||||||
- {{Setup security monitoring}} ({{effort}})
|
|
||||||
|
|
||||||
### Medium-term
|
|
||||||
- {{Implement {{technology}} for {{purpose}}}} ({{effort}})
|
|
||||||
- {{Security training for team}} ({{effort}})
|
|
||||||
|
|
||||||
### Long-term (Next 6 Months)
|
|
||||||
- {{Major security initiative}} ({{effort}})
|
|
||||||
- {{Penetration testing}} ({{effort}})
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Sign-offs & Approvals
|
|
||||||
|
|
||||||
### Audit Approval
|
|
||||||
|
|
||||||
- [ ] **Security Lead** ({{AUDITOR_NAME}})
|
|
||||||
- Signature: ________________________ Date: __________
|
|
||||||
- Assessment complete and findings documented
|
|
||||||
|
|
||||||
### Service Owner Acknowledgment
|
|
||||||
|
|
||||||
- [ ] **Service Owner** ({{SERVICE_OWNER}})
|
|
||||||
- Signature: ________________________ Date: __________
|
|
||||||
- Acknowledged findings and committed to remediation
|
|
||||||
|
|
||||||
### Compliance Officer Review
|
|
||||||
|
|
||||||
- [ ] **Compliance Officer** ({{NAME}})
|
|
||||||
- Signature: ________________________ Date: __________
|
|
||||||
- Compliance requirements verified
|
|
||||||
|
|
||||||
### Executive Approval (If Production Clearance Needed)
|
|
||||||
|
|
||||||
- [ ] **VP Engineering / Security** ({{NAME}})
|
|
||||||
- Signature: ________________________ Date: __________
|
|
||||||
- Risk accepted; approved for production
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Distribution
|
|
||||||
|
|
||||||
- [x] Shared with: {{Service team, Leadership, Compliance}}
|
|
||||||
- [x] Date shared: {{DATE}}
|
|
||||||
- [x] Follow-up review scheduled: {{DATE}}
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
## Appendix: Testing Evidence
|
|
||||||
|
|
||||||
### Code Review Findings
|
|
||||||
|
|
||||||
```
|
|
||||||
{{Code snippets demonstrating vulnerabilities}}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Configuration Issues
|
|
||||||
|
|
||||||
```
|
|
||||||
{{Configuration examples showing gaps}}
|
|
||||||
```
|
|
||||||
|
|
||||||
### Dependencies Scan
|
|
||||||
|
|
||||||
```
|
|
||||||
{{Vulnerable dependencies identified}}
|
|
||||||
```
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Report Prepared By**: {{AUDITOR_NAME}}
|
|
||||||
**Report Date**: {{DATE}}
|
|
||||||
**Review Status**: Draft | Final | Approved
|
|
||||||
|
|
@ -1,91 +0,0 @@
|
||||||
---
|
|
||||||
workflow_id: SA001
|
|
||||||
workflow_name: Security Audit
|
|
||||||
description: Comprehensive security review using security patterns, config management, and compliance framework
|
|
||||||
entry_point: steps/step-01-scope.md
|
|
||||||
phase: 4-cross
|
|
||||||
lead_agent: "Hà (Security)"
|
|
||||||
status: "active"
|
|
||||||
created_date: 2026-03-17
|
|
||||||
version: "1.0.0"
|
|
||||||
estimated_duration: "2-3 hours"
|
|
||||||
outputFile: '{output_folder}/psm-artifacts/security-audit-{{project_name}}-{{date}}.md'
|
|
||||||
---
|
|
||||||
|
|
||||||
# Workflow: Security Audit
|
|
||||||
|
|
||||||
## Goal
|
|
||||||
Perform comprehensive security evaluation using Production Systems BMAD framework, covering threat modeling, vulnerability assessment, compliance, and security controls.
|
|
||||||
|
|
||||||
## Overview
|
|
||||||
|
|
||||||
Security audit is a critical cross-functional workflow that evaluates service security posture before production deployment or for ongoing compliance verification. The audit:
|
|
||||||
|
|
||||||
1. **Scopes** the audit engagement, defines threat model, and identifies compliance requirements
|
|
||||||
2. **Executes** detailed security assessment across multiple domains (authentication, data protection, infrastructure, API security)
|
|
||||||
3. **Reports** findings with severity levels, remediation recommendations, and compliance status
|
|
||||||
|
|
||||||
## Execution Path
|
|
||||||
|
|
||||||
```
|
|
||||||
START
|
|
||||||
↓
|
|
||||||
[Step 01] Scope & Threat Model (Define audit scope, identify threats, compliance reqs)
|
|
||||||
↓
|
|
||||||
[Step 02] Security Assessment (Execute checklist across domains, identify vulns)
|
|
||||||
↓
|
|
||||||
[Step 03] Security Report (Findings report, severity, recommendations, compliance)
|
|
||||||
↓
|
|
||||||
END
|
|
||||||
```
|
|
||||||
|
|
||||||
## Key Roles
|
|
||||||
|
|
||||||
| Role | Agent | Responsibility |
|
|
||||||
|------|-------|-----------------|
|
|
||||||
| Lead | Hà (Security) | Lead audit, coordinate assessment, synthesize findings |
|
|
||||||
| Subject Matter | Service Owner + Platform Eng | Provide architecture, answer security questions |
|
|
||||||
| Compliance | Security/Compliance Team | Validate compliance mapping, sign-off |
|
|
||||||
|
|
||||||
## Assessment Domains (5)
|
|
||||||
|
|
||||||
1. **Authentication & Authorization** — Identity verification, access control, session management
|
|
||||||
2. **API Security** — Input validation, rate limiting, API key management, CORS
|
|
||||||
3. **Secrets Management** — Credential storage, rotation, access logging
|
|
||||||
4. **Encryption** — In-transit (TLS), at-rest, key management
|
|
||||||
5. **PII & Data Protection** — Classification, access controls, audit logging, retention
|
|
||||||
|
|
||||||
## Input Requirements
|
|
||||||
|
|
||||||
- **Service architecture diagram** — Components, data flows, external integrations
|
|
||||||
- **Authentication/authorization approach** — OAuth2, JWT, SAML, custom
|
|
||||||
- **Secrets storage mechanism** — Vault, cloud provider, environment variables
|
|
||||||
- **Compliance requirements** — GDPR, CCPA, SOC2, industry-specific
|
|
||||||
- **Known security controls** — WAF, TLS config, authentication libraries
|
|
||||||
|
|
||||||
## Output Deliverable
|
|
||||||
|
|
||||||
- **Security Audit Report** (template: `security-audit-report.template.md`)
|
|
||||||
- Audit scope and threat model
|
|
||||||
- Findings organized by domain with severity (Critical/High/Medium/Low)
|
|
||||||
- Remediation recommendations with priority and effort
|
|
||||||
- Compliance status matrix
|
|
||||||
- Sign-off
|
|
||||||
|
|
||||||
## Success Criteria
|
|
||||||
|
|
||||||
1. All security domains assessed with clear findings
|
|
||||||
2. Severity levels assigned (using CVSS or similar framework)
|
|
||||||
3. Remediation plan with owners and deadlines
|
|
||||||
4. Compliance requirements verified (if applicable)
|
|
||||||
5. Team alignment on security posture
|
|
||||||
|
|
||||||
## Next Steps After Workflow
|
|
||||||
|
|
||||||
- If **COMPLIANT**: Document in security registry; schedule periodic re-audit
|
|
||||||
- If **NON-COMPLIANT**: Add remediation items to backlog; track closure
|
|
||||||
- If **CRITICAL ISSUES**: Consider production pause until resolved
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Navigation**: [← Back to 4-cross](../), [Next: Step 01 →](steps/step-01-scope.md)
|
|
||||||
|
|
@ -1,6 +0,0 @@
|
||||||
---
|
|
||||||
name: bmad-psm-setup-new-service
|
|
||||||
description: 'Set up new production service from architecture through deployment. Use when the user says "new service" or "setup service" or "new microservice"'
|
|
||||||
---
|
|
||||||
|
|
||||||
Follow the instructions in [workflow.md](workflow.md).
|
|
||||||
|
|
@ -1 +0,0 @@
|
||||||
type: skill
|
|
||||||
|
|
@ -1,116 +0,0 @@
|
||||||
---
|
|
||||||
workflow_id: W-SETUP-SVC-001
|
|
||||||
workflow_name: Setup Production Service for BMAD
|
|
||||||
version: 6.2.0
|
|
||||||
lead_agent: "Architect Khang"
|
|
||||||
supporting_agents: ["SRE Minh", "Mary Analyst"]
|
|
||||||
phase: "1-Analysis → 2-Planning → 3-Solutioning → 4-Implementation"
|
|
||||||
created_date: 2026-03-17
|
|
||||||
last_modified: 2026-03-17
|
|
||||||
config_file: "_config/config.yaml"
|
|
||||||
estimated_duration: "12-20 hours"
|
|
||||||
outputFile: '{output_folder}/psm-artifacts/service-setup-{{project_name}}-{{date}}.md'
|
|
||||||
---
|
|
||||||
|
|
||||||
# Setup Production Service Workflow — BMAD Pattern
|
|
||||||
|
|
||||||
## Metadata & Context
|
|
||||||
|
|
||||||
**Goal**: Xây dựng production-grade service từ scratch, với đầy đủ architecture, API design, deployment pipeline, reliability patterns, security, và production readiness.
|
|
||||||
|
|
||||||
**Lead Team**:
|
|
||||||
- SRE Minh (Reliability, Infrastructure, Operations)
|
|
||||||
- Architect Khang (System Design, Technology Selection)
|
|
||||||
- Mary Analyst (Requirements, Risk Assessment)
|
|
||||||
|
|
||||||
**Success Criteria**:
|
|
||||||
- ✓ Architecture design document approved
|
|
||||||
- ✓ API contracts defined & validated
|
|
||||||
- ✓ Database schema designed & indexed
|
|
||||||
- ✓ CI/CD pipeline operational
|
|
||||||
- ✓ Resilience & observability in place
|
|
||||||
- ✓ Security & compliance verified
|
|
||||||
- ✓ Production readiness checklist passed
|
|
||||||
|
|
||||||
## Workflow Overview
|
|
||||||
|
|
||||||
Workflow này di qua 6 bước atomic, mỗi bước focus vào một domain riêng:
|
|
||||||
|
|
||||||
1. **Step-01-Architecture** → Requirements + Architecture Pattern Selection
|
|
||||||
2. **Step-02-API-Database** → API Design + Database Selection + Schema
|
|
||||||
3. **Step-03-Build-Deploy** → CI/CD + Containerization + Testing Strategy
|
|
||||||
4. **Step-04-Reliability** → Resilience Patterns + Observability + Error Handling
|
|
||||||
5. **Step-05-Security-Infra** → Auth/Authz + Secrets + K8s Config
|
|
||||||
6. **Step-06-Readiness** → PRR Checklist + Runbook + Go/No-Go Decision
|
|
||||||
|
|
||||||
## Configuration Loading
|
|
||||||
|
|
||||||
Tự động load từ `_config/config.yaml`:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
project_context:
|
|
||||||
user_name: "[loaded from config]"
|
|
||||||
organization: "[loaded from config]"
|
|
||||||
environment: "production"
|
|
||||||
|
|
||||||
workflow_defaults:
|
|
||||||
communication_language: "Vietnamese"
|
|
||||||
output_folder: "./outputs/setup-new-service-{service_name}"
|
|
||||||
timestamp: "2026-03-17"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Execution Model
|
|
||||||
|
|
||||||
### Entry Point Logic
|
|
||||||
|
|
||||||
```
|
|
||||||
1. Check if workflow.md exists in outputs folder
|
|
||||||
→ If NEW: Start from step-01-architecture.md
|
|
||||||
→ If RESUME: Load progress.yaml → auto-skip completed steps
|
|
||||||
→ If PARTIAL: Load step-N-context.yaml → resume from step N
|
|
||||||
|
|
||||||
2. For each step:
|
|
||||||
a) Load step-{N}-{name}.md
|
|
||||||
b) Load referenced SKILL files (auto-parse "Load:" directives)
|
|
||||||
c) Execute MENU [A][C] options
|
|
||||||
d) Save step output to step-{N}-output.md
|
|
||||||
e) Move to next step
|
|
||||||
|
|
||||||
3. Final: Generate comprehensive outputs in outputs folder
|
|
||||||
```
|
|
||||||
|
|
||||||
### State Tracking
|
|
||||||
|
|
||||||
Output document frontmatter tracks progress:
|
|
||||||
|
|
||||||
```yaml
|
|
||||||
workflow_progress:
|
|
||||||
step_01_architecture: "completed"
|
|
||||||
step_02_api_database: "completed"
|
|
||||||
step_03_build_deploy: "in_progress"
|
|
||||||
step_04_reliability: "pending"
|
|
||||||
step_05_security_infra: "pending"
|
|
||||||
step_06_readiness: "pending"
|
|
||||||
last_updated: "2026-03-17T14:30:00Z"
|
|
||||||
current_agent: "Architect Khang"
|
|
||||||
```
|
|
||||||
|
|
||||||
## Mandatory Workflow Rules
|
|
||||||
|
|
||||||
1. **No skipping steps** — Mỗi step phải được execute theo order
|
|
||||||
2. **Validate assumptions** — Mỗi decision phải được document
|
|
||||||
3. **Cross-phase collaboration** — Architects + SRE + Analysts work together
|
|
||||||
4. **Output artifacts** — Mỗi step produce tangible output documents
|
|
||||||
5. **Handoff protocol** — Context được transfer giữa steps rõ ràng
|
|
||||||
|
|
||||||
## Navigation
|
|
||||||
|
|
||||||
Hãy chọn cách bắt đầu:
|
|
||||||
|
|
||||||
- **[NEW]** — Bắt đầu workflow mới → Load step-01
|
|
||||||
- **[RESUME]** — Quay lại workflow đã từng chạy (detect progress)
|
|
||||||
- **[SKIP-TO]** — Nhảy tới step cụ thể (dev-only, requires confirmation)
|
|
||||||
|
|
||||||
---
|
|
||||||
|
|
||||||
**Tiếp tục bằng cách chọn [NEW] hoặc [RESUME]**
|
|
||||||
Loading…
Reference in New Issue