# Step 2: Knowledge Indexing & Chunking

## MANDATORY EXECUTION RULES (READ FIRST):

- 🛑 NEVER generate content without user input
- ✅ ALWAYS treat this as collaborative indexing between technical peers
- 📋 YOU ARE A FACILITATOR, not a content generator
- 💬 FOCUS on creating self-contained, retrievable knowledge chunks
- 🎯 EACH CHUNK must be independently useful without requiring full document context
- ⚠️ ABSOLUTELY NO TIME ESTIMATES - AI development speed has fundamentally changed
- ✅ YOU MUST ALWAYS SPEAK OUTPUT In your Agent communication style with the config `{communication_language}`

## EXECUTION PROTOCOLS:

- 🎯 Show your analysis before taking any action
- 📝 Focus on creating atomic, self-contained knowledge chunks
- ⚠️ Present A/P/C menu after each major category
- 💾 ONLY save when user chooses C (Continue)
- 📖 Update frontmatter with completed categories
- 🚫 FORBIDDEN to load next step until all categories are indexed

## COLLABORATION MENUS (A/P/C):

This step will generate content and present choices for each knowledge category:

- **A (Advanced Elicitation)**: Use discovery protocols to explore nuanced knowledge relationships
- **P (Party Mode)**: Bring multiple perspectives to identify missing knowledge connections
- **C (Continue)**: Save the current chunks and proceed to next category

## PROTOCOL INTEGRATION:

- When 'A' selected: Execute {project-root}/_bmad/core/workflows/advanced-elicitation/workflow.xml
- When 'P' selected: Execute {project-root}/_bmad/core/workflows/party-mode/workflow.md
- PROTOCOLS always return to display this step's A/P/C menu after the A or P have completed
- User accepts/rejects protocol changes before proceeding

## CONTEXT BOUNDARIES:

- Discovery catalog from step-1 is available
- All artifact paths and classifications are identified
- Focus on creating chunks optimized for embedding and retrieval
- Each chunk must carry enough context to be useful in isolation

## YOUR TASK:

Index each discovered artifact into self-contained knowledge chunks with metadata tags, source tracing, and retrieval-optimized formatting.

## CHUNKING PRINCIPLES:

### Chunk Design Rules

1. **Self-Contained**: Each chunk must be understandable without reading the source document
2. **Tagged**: Every chunk has category, priority, source path, and semantic tags
3. **Atomic**: One concept or decision per chunk - no compound knowledge
4. **Traceable**: Every chunk links back to its source artifact and section
5. **Contextual**: Include enough surrounding context for accurate retrieval
6. **Deduplicated**: Avoid redundant chunks across different source artifacts

### Chunk Format

Each chunk follows this standard format:

```markdown
### [CHUNK-ID] Chunk Title

- **Source:** `{relative_path_to_source_file}`
- **Category:** architecture | requirements | implementation | domain | operations | quality
- **Priority:** critical | high | standard | reference
- **Tags:** comma-separated semantic tags for retrieval matching

**Context:** One-line description of when this knowledge is relevant.

**Content:**
The actual knowledge content - specific, actionable, self-contained.
```

## INDEXING SEQUENCE:

### 1. Index Critical-Priority Artifacts

Process all artifacts marked as `critical` priority first:

**For each critical artifact:**

- Read the complete source file
- Identify distinct knowledge units (decisions, rules, constraints)
- Create one chunk per knowledge unit
- Apply semantic tags for retrieval matching
- Present chunks to user for validation

**Present results:**
"I've created {{chunk_count}} critical-priority chunks from {{source_count}} sources:

{{list_of_chunk_titles_with_tags}}

These chunks will be prioritized in every retrieval query.

[A] Advanced Elicitation - Explore deeper knowledge connections
[P] Party Mode - Review from multiple implementation perspectives
[C] Continue - Save these chunks and proceed"

### 2. Index High-Priority Artifacts

Process all `high` priority artifacts:

**For each high-priority artifact:**

- Read source file and identify knowledge units
- Create chunks with appropriate tags
- Cross-reference with critical chunks for consistency
- Identify any overlaps and deduplicate

### 3. Index Standard-Priority Artifacts

Process `standard` priority artifacts:

**For each standard artifact:**

- Read source file for domain-specific knowledge
- Create chunks focused on contextual information
- Tag for specific retrieval scenarios

### 4. Index Reference-Priority Artifacts

Process `reference` priority artifacts:

**For each reference artifact:**

- Extract background context and terminology
- Create lighter-weight chunks for supplementary retrieval
- Tag for broad topic matching

### 5. Cross-Reference and Deduplicate

After all categories are indexed:

**Deduplication Analysis:**

- Identify chunks with overlapping content across sources
- Merge or consolidate redundant chunks
- Ensure cross-references between related chunks are tagged
- Present deduplication summary to user

**Relationship Mapping:**

- Identify chunks that frequently co-occur in implementation contexts
- Tag related chunks for retrieval grouping
- Create chunk clusters for common query patterns

### 6. Generate Knowledge Index Document

Compile all validated chunks into the knowledge index file:

**Document Structure:**

```markdown
# Knowledge Index for {{project_name}}

_RAG-optimized knowledge base for AI agent retrieval. Each chunk is self-contained and tagged for semantic search._

---

## Index Summary

- **Total Chunks:** {{total_count}}
- **Critical:** {{critical_count}} | **High:** {{high_count}} | **Standard:** {{standard_count}} | **Reference:** {{ref_count}}
- **Sources Indexed:** {{source_count}}
- **Last Synced:** {{date}}

---

## Critical Knowledge

{{critical_chunks}}

## Architecture Knowledge

{{architecture_chunks}}

## Requirements Knowledge

{{requirements_chunks}}

## Implementation Knowledge

{{implementation_chunks}}

## Domain Knowledge

{{domain_chunks}}

## Operations Knowledge

{{operations_chunks}}

## Quality Knowledge

{{quality_chunks}}
```

### 7. Present Indexing Summary

"Knowledge indexing complete for {{project_name}}!

**Chunks Created:**

| Category | Critical | High | Standard | Reference | Total |
|---|---|---|---|---|---|
| Architecture | {{n}} | {{n}} | {{n}} | {{n}} | {{n}} |
| Requirements | {{n}} | {{n}} | {{n}} | {{n}} | {{n}} |
| Implementation | {{n}} | {{n}} | {{n}} | {{n}} | {{n}} |
| Domain | {{n}} | {{n}} | {{n}} | {{n}} | {{n}} |
| Operations | {{n}} | {{n}} | {{n}} | {{n}} | {{n}} |
| Quality | {{n}} | {{n}} | {{n}} | {{n}} | {{n}} |

**Deduplication:** Removed {{removed_count}} redundant chunks
**Cross-References:** {{xref_count}} chunk relationships mapped

[C] Continue to optimization"

## SUCCESS METRICS:

✅ All discovered artifacts indexed into self-contained chunks
✅ Each chunk has proper metadata tags and source tracing
✅ No redundant or overlapping chunks remain
✅ Cross-references between related chunks are mapped
✅ A/P/C menu presented and handled correctly for each category
✅ Knowledge index document properly structured

## FAILURE MODES:

❌ Creating chunks that require reading the full source document
❌ Missing semantic tags that prevent accurate retrieval
❌ Not deduplicating overlapping chunks from different sources
❌ Not cross-referencing related knowledge units
❌ Not getting user validation for each category
❌ Creating overly large chunks that reduce retrieval precision

## NEXT STEP:

After completing all categories and user selects [C], load `{project-root}/_bmad/bmm/workflows/4-implementation/genai-knowledge-sync/steps/step-03-optimize.md` to optimize the knowledge base for retrieval quality.

Remember: Do NOT proceed to step-03 until all categories are indexed and user explicitly selects [C]!