BMAD-METHOD/src/modules/bmb/docs/workflows/csv-data-file-standards.md

207 lines
9.8 KiB
Markdown

# CSV Data File Standards for BMAD Workflows
## Purpose and Usage
CSV data files in BMAD workflows serve specific purposes for different workflow types:
**For Agents:** Provide structured data that agents need to reference but cannot realistically generate (such as specific configurations, domain-specific data, or structured knowledge bases).
**For Expert Agents:** Supply specialized knowledge bases, reference data, or persistent information that the expert agent needs to access consistently across sessions.
**For Workflows:** Include reference data, configuration parameters, or structured inputs that guide workflow execution and decision-making.
**Key Principle:** CSV files should contain data that is essential, structured, and not easily generated by LLMs during execution.
## Intent-Based Design Principle
**Core Philosophy:** The closer workflows stay to **intent** rather than **prescriptive** instructions, the more creative and adaptive the LLM experience becomes.
**CSV Enables Intent-Based Design:**
- **Instead of:** Hardcoded scripts with exact phrases LLM must say
- **CSV Provides:** Clear goals and patterns that LLM adapts creatively to context
- **Result:** Natural, contextual conversations rather than rigid scripts
**Example - Advanced Elicitation:**
- **Prescriptive Alternative:** 50 separate files with exact conversation scripts
- **Intent-Based Reality:** One CSV row with method goal + pattern → LLM adapts to user
- **Benefit:** Same method works differently for different users while maintaining essence
**Intent vs Prescriptive Spectrum:**
- **Highly Prescriptive:** "Say exactly: 'Based on my analysis, I recommend...'"
- **Balanced Intent:** "Help the user understand the implications using your professional judgment"
- **CSV Goal:** Provide just enough guidance to enable creative, context-aware execution
## Primary Use Cases
### 1. Knowledge Base Indexing (Document Lookup Optimization)
**Problem:** Large knowledge bases with hundreds of documents cause context blowup and missed details when LLMs try to process them all.
**CSV Solution:** Create a knowledge base index with:
- **Column 1:** Keywords and topics
- **Column 2:** Document file path/location
- **Column 3:** Section or line number where relevant content starts
- **Column 4:** Content type or summary (optional)
**Result:** Transform from context-blowing document loads to surgical precision lookups, creating agents with near-infinite knowledge bases while maintaining optimal context usage.
### 2. Workflow Sequence Optimization
**Problem:** Complex workflows (e.g., game development) with hundreds of potential steps for different scenarios become unwieldy and context-heavy.
**CSV Solution:** Create a workflow routing table:
- **Column 1:** Scenario type (e.g., "2D Platformer", "RPG", "Puzzle Game")
- **Column 2:** Required step sequence (e.g., "step-01,step-03,step-07,step-12")
- **Column 3:** Document sections to include
- **Column 4:** Specialized parameters or configurations
**Result:** Step 1 determines user needs, finds closest match in CSV, confirms with user, then follows optimized sequence - truly optimal for context usage.
### 3. Method Registry (Dynamic Technique Selection)
**Problem:** Tasks need to select optimal techniques from dozens of options based on context, without hardcoding selection logic.
**CSV Solution:** Create a method registry with:
- **Column 1:** Category (collaboration, advanced, technical, creative, etc.)
- **Column 2:** Method name and rich description
- **Column 3:** Execution pattern or flow guide (e.g., "analysis → insights → action")
- **Column 4:** Complexity level or use case indicators
**Example:** Advanced Elicitation task analyzes content context, selects 5 best-matched methods from 50 options, then executes dynamically using CSV descriptions.
**Result:** Smart, context-aware technique selection without hardcoded logic - infinitely extensible method libraries.
### 4. Configuration Management
**Problem:** Complex systems with many configuration options that vary by use case.
**CSV Solution:** Configuration lookup tables mapping scenarios to specific parameter sets.
## What NOT to Include in CSV Files
**Avoid Web-Searchable Data:** Do not include information that LLMs can readily access through web search or that exists in their training data, such as:
- Common programming syntax or standard library functions
- General knowledge about widely used technologies
- Historical facts or commonly available information
- Basic terminology or standard definitions
**Include Specialized Data:** Focus on data that is:
- Specific to your project or domain
- Not readily available through web search
- Essential for consistent workflow execution
- Too voluminous for LLM context windows
## CSV Data File Standards
### 1. Purpose Validation
- **Essential Data Only:** CSV must contain data that cannot be reasonably generated by LLMs
- **Domain Specific:** Data should be specific to the workflow's domain or purpose
- **Consistent Usage:** All columns and data must be referenced and used somewhere in the workflow
- **No Redundancy:** Avoid data that duplicates functionality already available to LLMs
### 2. Structural Standards
- **Valid CSV Format:** Proper comma-separated values with quoted fields where needed
- **Consistent Columns:** All rows must have the same number of columns
- **No Missing Data:** Empty values should be explicitly marked (e.g., "", "N/A", or NULL)
- **Header Row:** First row must contain clear, descriptive column headers
- **Proper Encoding:** UTF-8 encoding required for special characters
### 3. Content Standards
- **No LLM-Generated Content:** Avoid data that LLMs can easily generate (e.g., generic phrases, common knowledge)
- **Specific and Concrete:** Use specific values rather than vague descriptions
- **Verifiable Data:** Data should be factual and verifiable when possible
- **Consistent Formatting:** Date formats, numbers, and text should follow consistent patterns
### 4. Column Standards
- **Clear Headers:** Column names must be descriptive and self-explanatory
- **Consistent Data Types:** Each column should contain consistent data types
- **No Unused Columns:** Every column must be referenced and used in the workflow
- **Appropriate Width:** Columns should be reasonably narrow and focused
### 5. File Size Standards
- **Efficient Structure:** CSV files should be as small as possible while maintaining functionality
- **No Redundant Rows:** Avoid duplicate or nearly identical rows
- **Compressed Data:** Use efficient data representation (e.g., codes instead of full descriptions)
- **Maximum Size:** Individual CSV files should not exceed 1MB unless absolutely necessary
### 6. Documentation Standards
- **Documentation Required:** Each CSV file should have documentation explaining its purpose
- **Column Descriptions:** Each column must be documented with its usage and format
- **Data Sources:** Source of data should be documented when applicable
- **Update Procedures:** Process for updating CSV data should be documented
### 7. Integration Standards
- **File References:** CSV files must be properly referenced in workflow configuration
- **Access Patterns:** Workflow must clearly define how and when CSV data is accessed
- **Error Handling:** Workflow must handle cases where CSV files are missing or corrupted
- **Version Control:** CSV files should be versioned when changes occur
### 8. Quality Assurance
- **Data Validation:** CSV data should be validated for correctness and completeness
- **Format Consistency:** Consistent formatting across all rows and columns
- **No Ambiguity:** Data entries should be clear and unambiguous
- **Regular Review:** CSV content should be reviewed periodically for relevance
### 9. Security Considerations
- **No Sensitive Data:** Avoid including sensitive, personal, or confidential information
- **Data Sanitization:** CSV data should be sanitized for security issues
- **Access Control:** Access to CSV files should be controlled when necessary
- **Audit Trail:** Changes to CSV files should be logged when appropriate
### 10. Performance Standards
- **Fast Loading:** CSV files must load quickly within workflow execution
- **Memory Efficient:** Structure should minimize memory usage during processing
- **Optimized Queries:** If data lookup is needed, optimize for efficient access
- **Caching Strategy**: Consider whether data can be cached for performance
## Implementation Guidelines
When creating CSV data files for BMAD workflows:
1. **Start with Purpose:** Clearly define why CSV is needed instead of LLM generation
2. **Design Structure:** Plan columns and data types before creating the file
3. **Test Integration:** Ensure workflow properly accesses and uses CSV data
4. **Document Thoroughly:** Provide complete documentation for future maintenance
5. **Validate Quality:** Check data quality, format consistency, and integration
6. **Monitor Usage:** Track how CSV data is used and optimize as needed
## Common Anti-Patterns to Avoid
- **Generic Phrases:** CSV files containing common phrases or LLM-generated content
- **Redundant Data:** Duplicating information easily available to LLMs
- **Overly Complex:** Unnecessarily complex CSV structures when simple data suffices
- **Unused Columns:** Columns that are defined but never referenced in workflows
- **Poor Formatting:** Inconsistent data formats, missing values, or structural issues
- **No Documentation:** CSV files without clear purpose or usage documentation
## Validation Checklist
For each CSV file, verify:
- [ ] Purpose is essential and cannot be replaced by LLM generation
- [ ] All columns are used in the workflow
- [ ] Data is properly formatted and consistent
- [ ] File is efficiently sized and structured
- [ ] Documentation is complete and clear
- [ ] Integration with workflow is tested and working
- [ ] Security considerations are addressed
- [ ] Performance requirements are met