# OCR to Excel - Workflow Instructions The workflow execution engine is governed by: {project-root}/bmad/core/tasks/workflow.xml You MUST have already loaded and processed: {installed_path}/workflow.yaml Welcome the user to the OCR Data Extraction workflow Explain that this workflow will process documents using OCR and extract data to Excel Validate API key is set Set api_key_configured = true **⚠️ OpenRouter API Key Required** This workflow uses the OpenRouter API for Mistral OCR processing. Please set your API key as an environment variable: \`\`\`bash export OPENROUTER_API_KEY="your-api-key-here" \`\`\` Options: 1. I've set the API key, continue 2. Exit and set it up first What would you like to do? If user chooses option 1 → Verify key exists, continue If user chooses option 2 → HALT with setup instructions Let's set up your data extraction configuration. Do you have an existing configuration file, or should we create one? 1. Use existing configuration file 2. Create new configuration with wizard 3. Use default configuration Which option? Ask for configuration file path Load and validate configuration Run configuration wizard (steps 2-5) Load default configuration from config-template.yaml extraction_config **File Paths Configuration** Please provide the following paths: 1. **Source Documents Folder:** Where are your files to process? Example: ./source-documents or /path/to/files 2. **Master Excel File:** Where should extracted data be saved? Example: ./master-data.xlsx 3. **Processed Files Folder:** Where to move processed files? Example: ./processed/done 4. **Backup Folder:** Where to store Excel backups? Example: ./backups Validate that source folder exists Validate that master Excel file exists or can be created Create processed and backup folders if they don't exist source_folder master_file processed_folder backup_folder **Data Fields Configuration** What fields should I extract from your documents? Here are common fields for sales reports. Select all that apply: - [ ] Date (sales report date) - [ ] Store/Tenant Name - [ ] Sales Amount - [ ] Part Timer/Employee Name - [ ] Shift Hours - [ ] Invoice Number - [ ] Custom field (specify) Which fields do you need? Create extraction_fields configuration based on selections For each custom field, ask for name, type, and whether it's required extraction_fields **Processing Settings** Configure how files should be processed: 1. **Batch Size:** How many files to process before saving? (default: 10) 2. **Parallel Processing:** Number of concurrent API calls (default: 3, max: 5) 3. **Confidence Threshold:** Auto-approve extractions with confidence >= ? (default: 85%) 4. **File Types:** Which file types to process? (pdf, xlsx, xls, msg) Use defaults or customize? batch_size parallel_limit confidence_threshold file_types Generate complete configuration file Save to {project-root}/bmad/bmm/workflows/data-extraction/ocr-to-excel/extraction-config.yaml **✅ Configuration Saved** Your extraction configuration has been saved and is ready to use. You can edit it anytime at: extraction-config.yaml config_file_path Scan the source folder recursively for supported file types Filter out already-processed files (check processing log) Build processing queue with file metadata **📁 Document Scan Results** Found {{total_files_found}} files: - PDF: {{pdf_count}} - Excel: {{excel_count}} - MSG: {{msg_count}} Already processed: {{already_processed_count}} Ready to process: {{queue_length}} Ready to start processing {{queue_length}} files? 1. Yes, start batch processing 2. Show me the file list first 3. Cancel What would you like to do? Display paginated file list with details HALT processing_queue Select next file from queue Display progress: "Processing {{current_file_number}}/{{total_files}}: {{filename}}" Prepare file for OCR: - If PDF: Convert pages to images - If Excel: Read and prepare content - If MSG: Extract and prepare content Call Mistral OCR API via OpenRouter Implement retry logic with exponential backoff (max 3 retries) Extract OCR text from response Parse text into structured data using field mappings Calculate confidence score for extraction Store extracted data with metadata Log error details Add file to failed_files list API call failed for {{filename}}: {{error_message}} Options: 1. Retry this file 2. Skip and continue 3. Pause processing What would you like to do? extraction_results Review all extraction results Separate into high_confidence and low_confidence groups **🎯 Extraction Complete** Successfully extracted: {{successful_count}} files - High confidence (>= {{confidence_threshold}}%): {{high_confidence_count}} - Low confidence (< {{confidence_threshold}}%): {{low_confidence_count}} - Failed: {{failed_count}} **⚠️ Low Confidence Extractions Detected** {{low_confidence_count}} files have confidence scores below {{confidence_threshold}}%. These require human review before saving to Excel. Options: 1. Review and correct each one 2. Auto-approve all (risky) 3. Skip low-confidence files for now What would you like to do? Present each low-confidence extraction for review Mark all as approved with warning Move to failed_files list **📄 File:** {{filename}} **Confidence:** {{confidence_score}}% **Extracted Data:** {{extracted_fields_display}} **Raw OCR Text:** {{ocr_text}} Please review and correct if needed: 1. Approve as-is 2. Edit field values 3. Skip this file 4. View raw OCR text What would you like to do? Prompt for corrected values for each field Update extraction_results with corrected data Move to skipped_files list validated_extractions Check if master Excel file exists Create timestamped backup: master-file-YYYYMMDD-HHMMSS.xlsx Save backup to backup_folder ✅ Created backup: {{backup_filename}} Master Excel file does not exist: {{master_file}} Would you like to: 1. Create new Excel file 2. Specify different file 3. Cancel What would you like to do? Create new Excel file with proper headers Ask for new file path, validate, continue HALT backup_file_path Load master Excel file Validate that sheet exists (create if configured to do so) Validate that headers match configured fields For each validated extraction: - Append new row with extracted data - Add metadata (source_file, processed_date, confidence) - Format cells according to field types Save Excel file with atomic write operation ✅ Successfully wrote {{row_count}} rows to {{master_file}} ⚠️ Failed to write to Excel file: {{error_message}} Options: 1. Restore from backup and retry 2. Save extraction results to JSON file 3. Cancel What would you like to do? Restore backup, retry write operation Save results to {{output_folder}}/extraction-results-{{date}}.json excel_write_log For each successfully processed file: - Determine target path in processed_folder - Maintain original folder structure if configured - Move file to processed_folder - Log file movement 📦 Moved {{moved_file_count}} files to {{processed_folder}} file_movement_log Compile processing statistics Generate extraction report using template.md Save report to {{output_folder}}/extraction-results-{{date}}.md Create processing log with detailed audit trail Save log to {{log_folder}}/processing-log-{{date}}.json processing_report processing_log **✅ Batch Processing Complete!** **Summary:** - Total files processed: {{total_files_processed}} - Successfully extracted: {{successful_count}} - Failed/skipped: {{failed_count}} - Average confidence: {{average_confidence}}% - Total processing time: {{total_duration}} **Files Updated:** - Master Excel: {{master_file}} ({{row_count}} new rows) - Backup: {{backup_file_path}} - Processing Report: {{processing_report}} - Processing Log: {{processing_log}} **Next Steps:** 1. Review extraction report: {{processing_report}} 2. Validate data in master Excel file 3. Process failed files manually if needed ({{failed_count}} files) 4. Archive processed files: {{processed_folder}} **⚠️ Failed Files:** The following files could not be processed: {{failed_files_list}} See processing log for details: {{processing_log}} Would you like to: 1. View the extraction report 2. Process failed files 3. Start another batch 4. Exit What would you like to do? Display report content Restart workflow with failed_files as queue Restart workflow from step 6 HALT