# OCR to Excel Workflow - Next Steps for Testing **Created:** 2025-10-18 **Status:** Ready for testing after PR #764 is merged **Related PR:** https://github.com/bmad-code-org/BMAD-METHOD/pull/764 **Related Issue:** https://github.com/bmad-code-org/BMAD-METHOD/issues/763 ## Current Status ✅ **COMPLETE:** - Phase 1: Infrastructure (agent, workflow config, templates, docs) - Phase 2: OCR & File Processing implementation - Phase 3: Data Parsing & Validation implementation - Phase 4: Excel Integration (placeholder - needs library) - Phase 5: Batch Processing implementation - Code committed and PR created ⏳ **PENDING:** - Phase 6: Testing & Documentation (this document) - Real-world testing with actual data ## Prerequisites for Testing Before starting the test session, ensure: 1. **PR #764 is merged** to v6-alpha branch 2. **OpenRouter API key** is ready ```bash export OPENROUTER_API_KEY="your-api-key-here" ``` 3. **Test data available:** - Master Excel file: `/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/TM - Daily Sales Report DSR by Part Timers_260225.xlsx` - Source files: `/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/2021/` (~2400 files) ## Test Plan: Phase 6 Implementation ### Step 1: Install Dependencies (15 minutes) ```bash # Navigate to project root cd /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/BMAD-METHOD # Install required npm packages npm install --save xlsx pdf-parse @kenjiuno/msgreader # Verify installation npm list xlsx pdf-parse @kenjiuno/msgreader ``` **Expected Output:** ``` ├── xlsx@0.18.5 ├── pdf-parse@1.1.1 └── @kenjiuno/msgreader@2.0.0 ``` ### Step 2: Create Test Configuration (10 minutes) Create `test-config.yaml` based on your real data: ```yaml # OCR to Excel Test Configuration name: "Daily Sales Report Extraction - Test" description: "Test configuration for 2021 sales reports" # API Configuration api: provider: openrouter model: 'mistral/pixtral-large-latest' api_key: ${OPENROUTER_API_KEY} # File Paths paths: source_folder: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/2021' master_file: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/TM - Daily Sales Report DSR by Part Timers_260225.xlsx' processed_folder: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/processed/done' backup_folder: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/backups' log_folder: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/logs' # File Types file_types: - pdf - xlsx - xls - msg # Extraction Fields (customize based on your actual documents) extraction_fields: - name: date type: date format: 'YYYY-MM-DD' required: true description: 'Sales report date' - name: store_name type: string required: true description: 'Tenant/store name' - name: sales_amount type: currency required: true description: 'Total daily sales' - name: part_timer_name type: string required: false description: 'Part timer employee name' # Processing Configuration processing: batch_size: 10 parallel_limit: 3 confidence_threshold: 0.85 pause_on_low_confidence: true # Logging logging: level: 'info' log_to_console: true log_to_file: true ``` ### Step 3: Small Batch Test (20 minutes) Test with a small batch first (5-10 files): ```bash # Create a test folder with just a few files mkdir -p /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/test-batch cp /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/2021/01.\ Jan\ 2021/*.pdf /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/test-batch/ | head -10 # Update test-config.yaml to use test-batch folder # Then run the workflow ``` **Testing Checklist:** - [ ] API key loads correctly from environment - [ ] Files are discovered successfully - [ ] OCR API calls succeed - [ ] Data extraction produces reasonable results - [ ] Confidence scoring works - [ ] Low-confidence items are flagged - [ ] Validation UI appears for low-confidence items - [ ] Excel backup is created before writing - [ ] Data is written to master Excel file - [ ] Processed files are moved to done folder - [ ] Processing log is created - [ ] Report is generated **Expected Issues:** 1. **Excel library integration** - Currently placeholder, needs actual xlsx integration 2. **MSG parsing** - Placeholder, needs @kenjiuno/msgreader integration 3. **Field extraction patterns** - May need tuning based on actual document formats ### Step 4: Fix Excel Integration (30 minutes) Update `task-excel-writer.js` to use actual xlsx library: **Current (placeholder):** ```javascript async function appendToExcel(config, dataRows) { const backup = await createBackup(masterFile, backupFolder); // TODO: Actual Excel writing implementation return { success: true, rowsWritten: dataRows.length }; } ``` **Needs to become:** ```javascript const XLSX = require('xlsx'); async function appendToExcel(config, dataRows) { const { masterFile, backupFolder } = config.paths; // Create backup first const backup = await createBackup(masterFile, backupFolder); try { // Read existing workbook const workbook = XLSX.readFile(masterFile); const sheetName = workbook.SheetNames[0]; const worksheet = workbook.Sheets[sheetName]; // Convert worksheet to JSON const existingData = XLSX.utils.sheet_to_json(worksheet); // Append new rows const updatedData = [...existingData, ...dataRows]; // Convert back to worksheet const newWorksheet = XLSX.utils.json_to_sheet(updatedData); workbook.Sheets[sheetName] = newWorksheet; // Write to file XLSX.writeFile(workbook, masterFile); return { success: true, rowsWritten: dataRows.length, totalRows: updatedData.length, backupPath: backup }; } catch (error) { // Restore from backup on error await restoreBackup(backup, masterFile); throw error; } } ``` ### Step 5: Tune Field Extraction (45 minutes) Based on test results, you may need to: 1. **Analyze sample OCR output:** ```bash # Check processing logs to see what OCR actually returns cat /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/logs/processing-log-*.json | jq '.processedFiles[0].data' ``` 2. **Adjust regex patterns in `task-data-parser.js`:** - Date format patterns - Currency extraction patterns - Store name patterns 3. **Add custom extraction prompts:** - Make prompts more specific to your document format - Add examples in the prompt for better accuracy ### Step 6: Medium Batch Test (30 minutes) Test with ~50-100 files: **Testing Focus:** - [ ] Parallel processing works correctly - [ ] Progress tracking is accurate - [ ] Memory usage stays stable - [ ] API rate limits are respected - [ ] Error recovery works (simulate failures) - [ ] Batch statistics are correct ### Step 7: Full Batch Test (2-3 hours) Process all ~2400 files: **Before running:** - [ ] Ensure sufficient OpenRouter credits - [ ] Verify disk space for backups and logs - [ ] Close Excel file if open - [ ] Set up monitoring (check CPU/memory periodically) **Monitoring:** ```bash # Monitor progress tail -f /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/logs/processing-log-*.json # Check memory usage watch -n 5 'ps aux | grep node' ``` ### Step 8: Data Quality Review (1 hour) After processing: 1. **Spot-check random samples:** - Open master Excel file - Compare 20-30 random entries with source documents - Verify dates, amounts, store names 2. **Check statistics:** - Total files processed vs. expected - Success rate - Average confidence scores - Common error patterns 3. **Review low-confidence items:** - Check all items flagged for manual review - Identify patterns in low-confidence extractions - Adjust confidence threshold if needed ### Step 9: Create Unit Tests (2 hours) Create Jest tests for each task module: ```bash # Install Jest npm install --save-dev jest # Create test directory structure mkdir -p src/modules/bmm/tasks/ocr-extraction/__tests__ ``` **Test files to create:** - `task-file-scanner.test.js` - `task-ocr-process.test.js` - `task-data-parser.test.js` - `task-excel-writer.test.js` - `task-batch-processor.test.js` **Example test:** ```javascript const { parseOCRText } = require('../task-data-parser'); describe('task-data-parser', () => { describe('parseOCRText', () => { it('should extract date from OCR text', () => { const ocrText = 'Sales Report Date: 01/15/2021 Store: ABC Mart'; const fields = [{ name: 'date', type: 'date', required: true }]; const result = parseOCRText(ocrText, fields); expect(result.isValid).toBe(true); expect(result.data.date).toBe('2021-01-15'); }); }); }); ``` ### Step 10: Integration Tests (1 hour) Create integration tests with mock API: ```javascript // Mock OpenRouter API responses jest.mock('node-fetch', () => jest.fn()); describe('End-to-end workflow', () => { it('should process a file from OCR to Excel', async () => { // Setup: Create test file, mock API, prepare config // Execute: Run batch processor // Assert: Verify Excel file updated correctly }); }); ``` ## Success Criteria ✅ **Ready for production when:** - [ ] All task modules fully implemented (no placeholders) - [ ] Small batch test (10 files) completes successfully - [ ] Medium batch test (100 files) with 90%+ success rate - [ ] Full batch test (2400 files) completes - [ ] Data quality spot-check shows 95%+ accuracy - [ ] Unit test coverage >80% - [ ] Integration tests pass - [ ] Performance acceptable (<5 sec/file average) - [ ] Memory usage stable (no leaks) - [ ] Documentation updated with findings ## Known Issues to Address 1. **Excel Library Integration** - Status: Placeholder implementation - Priority: High - Estimated effort: 30 minutes 2. **MSG File Parsing** - Status: Placeholder implementation - Priority: Medium - Estimated effort: 1 hour 3. **Interactive Validation UI** - Status: Placeholder (auto-approves all) - Priority: Medium - Estimated effort: 1 hour 4. **Field Extraction Tuning** - Status: Generic patterns - Priority: High - Estimated effort: 1-2 hours based on test results ## Resources - **OpenRouter Docs:** https://openrouter.ai/docs - **xlsx Library:** https://www.npmjs.com/package/xlsx - **pdf-parse Library:** https://www.npmjs.com/package/pdf-parse - **msgreader Library:** https://www.npmjs.com/package/@kenjiuno/msgreader - **Jest Testing:** https://jestjs.io/docs/getting-started ## Session Checklist **At the start of next session:** - [ ] Pull latest changes (if PR merged) - [ ] Review this document - [ ] Set OpenRouter API key - [ ] Check test data is accessible - [ ] Install dependencies (npm install) - [ ] Create test configuration file - [ ] Start with Step 3 (Small Batch Test) **By end of next session (ideal):** - [ ] Dependencies installed - [ ] Excel integration implemented - [ ] Small batch test passed - [ ] Medium batch test passed - [ ] Full batch test started (can run overnight) **Follow-up session:** - [ ] Review full batch results - [ ] Data quality review - [ ] Unit tests created - [ ] Integration tests created - [ ] Documentation updated - [ ] Mark Phase 6 complete ## Quick Start Commands ```bash # Navigate to project cd /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/BMAD-METHOD # Set API key export OPENROUTER_API_KEY="your-key-here" # Install dependencies npm install # Run small batch test (after creating test-config.yaml) # Use BMAD CLI or agent to trigger workflow /ocr-to-excel # Monitor progress tail -f logs/processing-log-*.json # Check results open backups/ # View backups open processed/done/ # View processed files open "TM - Daily Sales Report DSR by Part Timers_260225.xlsx" # View master file ``` ## Notes - Keep this document updated as you progress through testing - Document any issues found and their resolutions - Note any performance bottlenecks - Record API costs for ~2400 files - Save sample OCR outputs for future reference --- **Last Updated:** 2025-10-18 **Next Session Goal:** Complete Steps 1-6 (through Medium Batch Test) **Estimated Time for Next Session:** 3-4 hours