BMAD-METHOD/bmad-agent/checklists/infrastructure-checklist.md

12 KiB

Infrastructure Change Validation Checklist

This checklist serves as a comprehensive framework for validating infrastructure changes before deployment to production. The DevOps/Platform Engineer should systematically work through each item, ensuring the infrastructure is secure, compliant, resilient, and properly implemented according to organizational standards.

1. SECURITY & COMPLIANCE

1.1 Access Management

  • RBAC principles applied with least privilege access
  • Service accounts have minimal required permissions
  • Secrets management solution properly implemented
  • IAM policies and roles documented and reviewed
  • Access audit mechanisms configured

1.2 Data Protection

  • Data at rest encryption enabled for all applicable services
  • Data in transit encryption (TLS 1.2+) enforced
  • Sensitive data identified and protected appropriately
  • Backup encryption configured where required
  • Data access audit trails implemented where required

1.3 Network Security

  • Network security groups configured with minimal required access
  • Private endpoints used for PaaS services where available
  • Public-facing services protected with WAF policies
  • Network traffic flows documented and secured
  • Network segmentation properly implemented

1.4 Compliance Requirements

  • Regulatory compliance requirements verified and met
  • Security scanning integrated into pipeline
  • Compliance evidence collection automated where possible
  • Privacy requirements addressed in infrastructure design
  • Security monitoring and alerting enabled

2. INFRASTRUCTURE AS CODE

2.1 IaC Implementation

  • All resources defined in IaC (Terraform/Bicep/ARM)
  • IaC code follows organizational standards and best practices
  • No manual configuration changes permitted
  • Dependencies explicitly defined and documented
  • Modules and resource naming follow conventions

2.2 IaC Quality & Management

  • IaC code reviewed by at least one other engineer
  • State files securely stored and backed up
  • Version control best practices followed
  • IaC changes tested in non-production environment
  • Documentation for IaC updated

2.3 Resource Organization

  • Resources organized in appropriate resource groups
  • Tags applied consistently per tagging strategy
  • Resource locks applied where appropriate
  • Naming conventions followed consistently
  • Resource dependencies explicitly managed

3. RESILIENCE & AVAILABILITY

3.1 High Availability

  • Resources deployed across appropriate availability zones
  • SLAs for each component documented and verified
  • Load balancing configured properly
  • Failover mechanisms tested and verified
  • Single points of failure identified and mitigated

3.2 Fault Tolerance

  • Auto-scaling configured where appropriate
  • Health checks implemented for all services
  • Circuit breakers implemented where necessary
  • Retry policies configured for transient failures
  • Graceful degradation mechanisms implemented

3.3 Recovery Metrics & Testing

  • Recovery time objectives (RTOs) verified
  • Recovery point objectives (RPOs) verified
  • Resilience testing completed and documented
  • Chaos engineering principles applied where appropriate
  • Recovery procedures documented and tested

4. BACKUP & DISASTER RECOVERY

4.1 Backup Strategy

  • Backup strategy defined and implemented
  • Backup retention periods aligned with requirements
  • Backup recovery tested and validated
  • Point-in-time recovery configured where needed
  • Backup access controls implemented

4.2 Disaster Recovery

  • DR plan documented and accessible
  • DR runbooks created and tested
  • Cross-region recovery strategy implemented (if required)
  • Regular DR drills scheduled
  • Dependencies considered in DR planning

4.3 Recovery Procedures

  • System state recovery procedures documented
  • Data recovery procedures documented
  • Application recovery procedures aligned with infrastructure
  • Recovery roles and responsibilities defined
  • Communication plan for recovery scenarios established

5. MONITORING & OBSERVABILITY

5.1 Monitoring Implementation

  • Monitoring coverage for all critical components
  • Appropriate metrics collected and dashboarded
  • Log aggregation implemented
  • Distributed tracing implemented (if applicable)
  • User experience/synthetics monitoring configured

5.2 Alerting & Response

  • Alerts configured for critical thresholds
  • Alert routing and escalation paths defined
  • Service health integration configured
  • On-call procedures documented
  • Incident response playbooks created

5.3 Operational Visibility

  • Custom queries/dashboards created for key scenarios
  • Resource utilization tracking configured
  • Cost monitoring implemented
  • Performance baselines established
  • Operational runbooks available for common issues

6. PERFORMANCE & OPTIMIZATION

6.1 Performance Testing

  • Performance testing completed and baseline established
  • Resource sizing appropriate for workload
  • Performance bottlenecks identified and addressed
  • Latency requirements verified
  • Throughput requirements verified

6.2 Resource Optimization

  • Cost optimization opportunities identified
  • Auto-scaling rules validated
  • Resource reservation used where appropriate
  • Storage tier selection optimized
  • Idle/unused resources identified for cleanup

6.3 Efficiency Mechanisms

  • Caching strategy implemented where appropriate
  • CDN/edge caching configured for content
  • Network latency optimized
  • Database performance tuned
  • Compute resource efficiency validated

7. OPERATIONS & GOVERNANCE

7.1 Documentation

  • Change documentation updated
  • Runbooks created or updated
  • Architecture diagrams updated
  • Configuration values documented
  • Service dependencies mapped and documented

7.2 Governance Controls

  • Cost controls implemented
  • Resource quota limits configured
  • Policy compliance verified
  • Audit logging enabled
  • Management access reviewed

7.3 Knowledge Transfer

  • Cross-team impacts documented and communicated
  • Required training/knowledge transfer completed
  • Architectural decision records updated
  • Post-implementation review scheduled
  • Operations team handover completed

8. CI/CD & DEPLOYMENT

8.1 Pipeline Configuration

  • CI/CD pipelines configured and tested
  • Environment promotion strategy defined
  • Deployment notifications configured
  • Pipeline security scanning enabled
  • Artifact management properly configured

8.2 Deployment Strategy

  • Rollback procedures documented and tested
  • Zero-downtime deployment strategy implemented
  • Deployment windows identified and scheduled
  • Progressive deployment approach used (if applicable)
  • Feature flags implemented where appropriate

8.3 Verification & Validation

  • Post-deployment verification tests defined
  • Smoke tests automated
  • Configuration validation automated
  • Integration tests with dependent systems
  • Canary/blue-green deployment configured (if applicable)

9. NETWORKING & CONNECTIVITY

9.1 Network Design

  • VNet/subnet design follows least-privilege principles
  • Network security groups rules audited
  • Public IP addresses minimized and justified
  • DNS configuration verified
  • Network diagram updated and accurate

9.2 Connectivity

  • VNet peering configured correctly
  • Service endpoints configured where needed
  • Private link/private endpoints implemented
  • External connectivity requirements verified
  • Load balancer configuration verified

9.3 Traffic Management

  • Inbound/outbound traffic flows documented
  • Firewall rules reviewed and minimized
  • Traffic routing optimized
  • Network monitoring configured
  • DDoS protection implemented where needed

10. COMPLIANCE & DOCUMENTATION

10.1 Compliance Verification

  • Required compliance evidence collected
  • Non-functional requirements verified
  • License compliance verified
  • Third-party dependencies documented
  • Security posture reviewed

10.2 Documentation Completeness

  • All documentation updated
  • Architecture diagrams updated
  • Technical debt documented (if any accepted)
  • Cost estimates updated and approved
  • Capacity planning documented

10.3 Cross-Team Collaboration

  • Development team impact assessed and communicated
  • Operations team handover completed
  • Security team reviews completed
  • Business stakeholders informed of changes
  • Feedback loops established for continuous improvement

11. BMAD WORKFLOW INTEGRATION

11.1 Development Agent Alignment

  • Infrastructure changes support Frontend Dev (Mira) and Fullstack Dev (Enrique) requirements
  • Backend requirements from Backend Dev (Lily) and Fullstack Dev (Enrique) accommodated
  • Local development environment compatibility verified for all dev agents
  • Infrastructure changes support automated testing frameworks
  • Development agent feedback incorporated into infrastructure design

11.2 Product Alignment

  • Infrastructure changes mapped to PRD requirements maintained by Product Owner
  • Non-functional requirements from PRD verified in implementation
  • Infrastructure capabilities and limitations communicated to Product teams
  • Infrastructure release timeline aligned with product roadmap
  • Technical constraints documented and shared with Product Owner

11.3 Architecture Alignment

  • Infrastructure implementation validated against architecture documentation
  • Architecture Decision Records (ADRs) reflected in infrastructure
  • Technical debt identified by Architect addressed or documented
  • Infrastructure changes support documented design patterns
  • Performance requirements from architecture verified in implementation

12. ARCHITECTURE DOCUMENTATION VALIDATION

12.1 Completeness Assessment

  • All required sections of architecture template completed
  • Architecture decisions documented with clear rationales
  • Technical diagrams included for all major components
  • Integration points with application architecture defined
  • Non-functional requirements addressed with specific solutions

12.2 Consistency Verification

  • Architecture aligns with broader system architecture
  • Terminology used consistently throughout documentation
  • Component relationships clearly defined
  • Environment differences explicitly documented
  • No contradictions between different sections

12.3 Stakeholder Usability

  • Documentation accessible to both technical and non-technical stakeholders
  • Complex concepts explained with appropriate analogies or examples
  • Implementation guidance clear for development teams
  • Operations considerations explicitly addressed
  • Future evolution pathways documented

Prerequisites Verified

  • All checklist sections reviewed
  • No outstanding critical or high-severity issues
  • All infrastructure changes tested in non-production environment
  • Rollback plan documented and tested
  • Required approvals obtained
  • Infrastructure changes verified against architectural decisions documented by Architect agent
  • Development environment impacts identified and mitigated
  • Infrastructure changes mapped to relevant user stories and epics
  • Release coordination planned with development teams
  • Local development environment compatibility verified