# OCR to Excel - Workflow Instructions
The workflow execution engine is governed by: {project-root}/bmad/core/tasks/workflow.xml
You MUST have already loaded and processed: {installed_path}/workflow.yaml
Welcome the user to the OCR Data Extraction workflow
Explain that this workflow will process documents using OCR and extract data to Excel
Validate API key is set
Set api_key_configured = true
**⚠️ OpenRouter API Key Required**
This workflow uses the OpenRouter API for Mistral OCR processing.
Please set your API key as an environment variable:
\`\`\`bash
export OPENROUTER_API_KEY="your-api-key-here"
\`\`\`
Options:
1. I've set the API key, continue
2. Exit and set it up first
What would you like to do?
If user chooses option 1 → Verify key exists, continue
If user chooses option 2 → HALT with setup instructions
Let's set up your data extraction configuration.
Do you have an existing configuration file, or should we create one?
1. Use existing configuration file
2. Create new configuration with wizard
3. Use default configuration
Which option?
Ask for configuration file path
Load and validate configuration
Run configuration wizard (steps 2-5)
Load default configuration from config-template.yaml
extraction_config
**File Paths Configuration**
Please provide the following paths:
1. **Source Documents Folder:** Where are your files to process?
Example: ./source-documents or /path/to/files
2. **Master Excel File:** Where should extracted data be saved?
Example: ./master-data.xlsx
3. **Processed Files Folder:** Where to move processed files?
Example: ./processed/done
4. **Backup Folder:** Where to store Excel backups?
Example: ./backups
Validate that source folder exists
Validate that master Excel file exists or can be created
Create processed and backup folders if they don't exist
source_folder
master_file
processed_folder
backup_folder
**Data Fields Configuration**
What fields should I extract from your documents?
Here are common fields for sales reports. Select all that apply:
- [ ] Date (sales report date)
- [ ] Store/Tenant Name
- [ ] Sales Amount
- [ ] Part Timer/Employee Name
- [ ] Shift Hours
- [ ] Invoice Number
- [ ] Custom field (specify)
Which fields do you need?
Create extraction_fields configuration based on selections
For each custom field, ask for name, type, and whether it's required
extraction_fields
**Processing Settings**
Configure how files should be processed:
1. **Batch Size:** How many files to process before saving? (default: 10)
2. **Parallel Processing:** Number of concurrent API calls (default: 3, max: 5)
3. **Confidence Threshold:** Auto-approve extractions with confidence >= ? (default: 85%)
4. **File Types:** Which file types to process? (pdf, xlsx, xls, msg)
Use defaults or customize?
batch_size
parallel_limit
confidence_threshold
file_types
Generate complete configuration file
Save to {project-root}/bmad/bmm/workflows/data-extraction/ocr-to-excel/extraction-config.yaml
config_file_path
Scan the source folder recursively for supported file types
Filter out already-processed files (check processing log)
Build processing queue with file metadata
Ready to start processing {{queue_length}} files?
1. Yes, start batch processing
2. Show me the file list first
3. Cancel
What would you like to do?
Display paginated file list with details
HALT
processing_queue
Select next file from queue
Display progress: "Processing {{current_file_number}}/{{total_files}}: {{filename}}"
Prepare file for OCR:
- If PDF: Convert pages to images
- If Excel: Read and prepare content
- If MSG: Extract and prepare content
Call Mistral OCR API via OpenRouter
Implement retry logic with exponential backoff (max 3 retries)
Extract OCR text from response
Parse text into structured data using field mappings
Calculate confidence score for extraction
Store extracted data with metadata
Log error details
Add file to failed_files list
API call failed for {{filename}}: {{error_message}}
Options:
1. Retry this file
2. Skip and continue
3. Pause processing
What would you like to do?
extraction_results
Review all extraction results
Separate into high_confidence and low_confidence groups
**⚠️ Low Confidence Extractions Detected**
{{low_confidence_count}} files have confidence scores below {{confidence_threshold}}%.
These require human review before saving to Excel.
Options:
1. Review and correct each one
2. Auto-approve all (risky)
3. Skip low-confidence files for now
What would you like to do?
Present each low-confidence extraction for review
Mark all as approved with warning
Move to failed_files list
Please review and correct if needed:
1. Approve as-is
2. Edit field values
3. Skip this file
4. View raw OCR text
What would you like to do?
Prompt for corrected values for each field
Update extraction_results with corrected data
Move to skipped_files list
validated_extractions
Check if master Excel file exists
Create timestamped backup: master-file-YYYYMMDD-HHMMSS.xlsx
Save backup to backup_folder
Master Excel file does not exist: {{master_file}}
Would you like to:
1. Create new Excel file
2. Specify different file
3. Cancel
What would you like to do?
Create new Excel file with proper headers
Ask for new file path, validate, continue
HALT
backup_file_path
Load master Excel file
Validate that sheet exists (create if configured to do so)
Validate that headers match configured fields
For each validated extraction:
- Append new row with extracted data
- Add metadata (source_file, processed_date, confidence)
- Format cells according to field types
Save Excel file with atomic write operation
Options:
1. Restore from backup and retry
2. Save extraction results to JSON file
3. Cancel
What would you like to do?
Restore backup, retry write operation
Save results to {{output_folder}}/extraction-results-{{date}}.json
excel_write_log
For each successfully processed file:
- Determine target path in processed_folder
- Maintain original folder structure if configured
- Move file to processed_folder
- Log file movement
file_movement_log
Compile processing statistics
Generate extraction report using template.md
Save report to {{output_folder}}/extraction-results-{{date}}.md
Create processing log with detailed audit trail
Save log to {{log_folder}}/processing-log-{{date}}.json
processing_report
processing_log
Would you like to:
1. View the extraction report
2. Process failed files
3. Start another batch
4. Exit
What would you like to do?
Display report content
Restart workflow with failed_files as queue
Restart workflow from step 6
HALT