12 KiB
OCR to Excel Workflow - Next Steps for Testing
Created: 2025-10-18 Status: Ready for testing after PR #764 is merged Related PR: https://github.com/bmad-code-org/BMAD-METHOD/pull/764 Related Issue: https://github.com/bmad-code-org/BMAD-METHOD/issues/763
Current Status
✅ COMPLETE:
- Phase 1: Infrastructure (agent, workflow config, templates, docs)
- Phase 2: OCR & File Processing implementation
- Phase 3: Data Parsing & Validation implementation
- Phase 4: Excel Integration (placeholder - needs library)
- Phase 5: Batch Processing implementation
- Code committed and PR created
⏳ PENDING:
- Phase 6: Testing & Documentation (this document)
- Real-world testing with actual data
Prerequisites for Testing
Before starting the test session, ensure:
- PR #764 is merged to v6-alpha branch
- OpenRouter API key is ready
export OPENROUTER_API_KEY="your-api-key-here" - Test data available:
- Master Excel file:
/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/TM - Daily Sales Report DSR by Part Timers_260225.xlsx - Source files:
/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/2021/(~2400 files)
- Master Excel file:
Test Plan: Phase 6 Implementation
Step 1: Install Dependencies (15 minutes)
# Navigate to project root
cd /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/BMAD-METHOD
# Install required npm packages
npm install --save xlsx pdf-parse @kenjiuno/msgreader
# Verify installation
npm list xlsx pdf-parse @kenjiuno/msgreader
Expected Output:
├── xlsx@0.18.5
├── pdf-parse@1.1.1
└── @kenjiuno/msgreader@2.0.0
Step 2: Create Test Configuration (10 minutes)
Create test-config.yaml based on your real data:
# OCR to Excel Test Configuration
name: "Daily Sales Report Extraction - Test"
description: "Test configuration for 2021 sales reports"
# API Configuration
api:
provider: openrouter
model: 'mistral/pixtral-large-latest'
api_key: ${OPENROUTER_API_KEY}
# File Paths
paths:
source_folder: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/2021'
master_file: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/TM - Daily Sales Report DSR by Part Timers_260225.xlsx'
processed_folder: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/processed/done'
backup_folder: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/backups'
log_folder: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/logs'
# File Types
file_types:
- pdf
- xlsx
- xls
- msg
# Extraction Fields (customize based on your actual documents)
extraction_fields:
- name: date
type: date
format: 'YYYY-MM-DD'
required: true
description: 'Sales report date'
- name: store_name
type: string
required: true
description: 'Tenant/store name'
- name: sales_amount
type: currency
required: true
description: 'Total daily sales'
- name: part_timer_name
type: string
required: false
description: 'Part timer employee name'
# Processing Configuration
processing:
batch_size: 10
parallel_limit: 3
confidence_threshold: 0.85
pause_on_low_confidence: true
# Logging
logging:
level: 'info'
log_to_console: true
log_to_file: true
Step 3: Small Batch Test (20 minutes)
Test with a small batch first (5-10 files):
# Create a test folder with just a few files
mkdir -p /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/test-batch
cp /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/2021/01.\ Jan\ 2021/*.pdf /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/test-batch/ | head -10
# Update test-config.yaml to use test-batch folder
# Then run the workflow
Testing Checklist:
- API key loads correctly from environment
- Files are discovered successfully
- OCR API calls succeed
- Data extraction produces reasonable results
- Confidence scoring works
- Low-confidence items are flagged
- Validation UI appears for low-confidence items
- Excel backup is created before writing
- Data is written to master Excel file
- Processed files are moved to done folder
- Processing log is created
- Report is generated
Expected Issues:
- Excel library integration - Currently placeholder, needs actual xlsx integration
- MSG parsing - Placeholder, needs @kenjiuno/msgreader integration
- Field extraction patterns - May need tuning based on actual document formats
Step 4: Fix Excel Integration (30 minutes)
Update task-excel-writer.js to use actual xlsx library:
Current (placeholder):
async function appendToExcel(config, dataRows) {
const backup = await createBackup(masterFile, backupFolder);
// TODO: Actual Excel writing implementation
return { success: true, rowsWritten: dataRows.length };
}
Needs to become:
const XLSX = require('xlsx');
async function appendToExcel(config, dataRows) {
const { masterFile, backupFolder } = config.paths;
// Create backup first
const backup = await createBackup(masterFile, backupFolder);
try {
// Read existing workbook
const workbook = XLSX.readFile(masterFile);
const sheetName = workbook.SheetNames[0];
const worksheet = workbook.Sheets[sheetName];
// Convert worksheet to JSON
const existingData = XLSX.utils.sheet_to_json(worksheet);
// Append new rows
const updatedData = [...existingData, ...dataRows];
// Convert back to worksheet
const newWorksheet = XLSX.utils.json_to_sheet(updatedData);
workbook.Sheets[sheetName] = newWorksheet;
// Write to file
XLSX.writeFile(workbook, masterFile);
return {
success: true,
rowsWritten: dataRows.length,
totalRows: updatedData.length,
backupPath: backup
};
} catch (error) {
// Restore from backup on error
await restoreBackup(backup, masterFile);
throw error;
}
}
Step 5: Tune Field Extraction (45 minutes)
Based on test results, you may need to:
-
Analyze sample OCR output:
# Check processing logs to see what OCR actually returns cat /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/logs/processing-log-*.json | jq '.processedFiles[0].data' -
Adjust regex patterns in
task-data-parser.js:- Date format patterns
- Currency extraction patterns
- Store name patterns
-
Add custom extraction prompts:
- Make prompts more specific to your document format
- Add examples in the prompt for better accuracy
Step 6: Medium Batch Test (30 minutes)
Test with ~50-100 files:
Testing Focus:
- Parallel processing works correctly
- Progress tracking is accurate
- Memory usage stays stable
- API rate limits are respected
- Error recovery works (simulate failures)
- Batch statistics are correct
Step 7: Full Batch Test (2-3 hours)
Process all ~2400 files:
Before running:
- Ensure sufficient OpenRouter credits
- Verify disk space for backups and logs
- Close Excel file if open
- Set up monitoring (check CPU/memory periodically)
Monitoring:
# Monitor progress
tail -f /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/logs/processing-log-*.json
# Check memory usage
watch -n 5 'ps aux | grep node'
Step 8: Data Quality Review (1 hour)
After processing:
-
Spot-check random samples:
- Open master Excel file
- Compare 20-30 random entries with source documents
- Verify dates, amounts, store names
-
Check statistics:
- Total files processed vs. expected
- Success rate
- Average confidence scores
- Common error patterns
-
Review low-confidence items:
- Check all items flagged for manual review
- Identify patterns in low-confidence extractions
- Adjust confidence threshold if needed
Step 9: Create Unit Tests (2 hours)
Create Jest tests for each task module:
# Install Jest
npm install --save-dev jest
# Create test directory structure
mkdir -p src/modules/bmm/tasks/ocr-extraction/__tests__
Test files to create:
task-file-scanner.test.jstask-ocr-process.test.jstask-data-parser.test.jstask-excel-writer.test.jstask-batch-processor.test.js
Example test:
const { parseOCRText } = require('../task-data-parser');
describe('task-data-parser', () => {
describe('parseOCRText', () => {
it('should extract date from OCR text', () => {
const ocrText = 'Sales Report Date: 01/15/2021 Store: ABC Mart';
const fields = [{ name: 'date', type: 'date', required: true }];
const result = parseOCRText(ocrText, fields);
expect(result.isValid).toBe(true);
expect(result.data.date).toBe('2021-01-15');
});
});
});
Step 10: Integration Tests (1 hour)
Create integration tests with mock API:
// Mock OpenRouter API responses
jest.mock('node-fetch', () => jest.fn());
describe('End-to-end workflow', () => {
it('should process a file from OCR to Excel', async () => {
// Setup: Create test file, mock API, prepare config
// Execute: Run batch processor
// Assert: Verify Excel file updated correctly
});
});
Success Criteria
✅ Ready for production when:
- All task modules fully implemented (no placeholders)
- Small batch test (10 files) completes successfully
- Medium batch test (100 files) with 90%+ success rate
- Full batch test (2400 files) completes
- Data quality spot-check shows 95%+ accuracy
- Unit test coverage >80%
- Integration tests pass
- Performance acceptable (<5 sec/file average)
- Memory usage stable (no leaks)
- Documentation updated with findings
Known Issues to Address
-
Excel Library Integration
- Status: Placeholder implementation
- Priority: High
- Estimated effort: 30 minutes
-
MSG File Parsing
- Status: Placeholder implementation
- Priority: Medium
- Estimated effort: 1 hour
-
Interactive Validation UI
- Status: Placeholder (auto-approves all)
- Priority: Medium
- Estimated effort: 1 hour
-
Field Extraction Tuning
- Status: Generic patterns
- Priority: High
- Estimated effort: 1-2 hours based on test results
Resources
- OpenRouter Docs: https://openrouter.ai/docs
- xlsx Library: https://www.npmjs.com/package/xlsx
- pdf-parse Library: https://www.npmjs.com/package/pdf-parse
- msgreader Library: https://www.npmjs.com/package/@kenjiuno/msgreader
- Jest Testing: https://jestjs.io/docs/getting-started
Session Checklist
At the start of next session:
- Pull latest changes (if PR merged)
- Review this document
- Set OpenRouter API key
- Check test data is accessible
- Install dependencies (npm install)
- Create test configuration file
- Start with Step 3 (Small Batch Test)
By end of next session (ideal):
- Dependencies installed
- Excel integration implemented
- Small batch test passed
- Medium batch test passed
- Full batch test started (can run overnight)
Follow-up session:
- Review full batch results
- Data quality review
- Unit tests created
- Integration tests created
- Documentation updated
- Mark Phase 6 complete
Quick Start Commands
# Navigate to project
cd /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/BMAD-METHOD
# Set API key
export OPENROUTER_API_KEY="your-key-here"
# Install dependencies
npm install
# Run small batch test (after creating test-config.yaml)
# Use BMAD CLI or agent to trigger workflow
/ocr-to-excel
# Monitor progress
tail -f logs/processing-log-*.json
# Check results
open backups/ # View backups
open processed/done/ # View processed files
open "TM - Daily Sales Report DSR by Part Timers_260225.xlsx" # View master file
Notes
- Keep this document updated as you progress through testing
- Document any issues found and their resolutions
- Note any performance bottlenecks
- Record API costs for ~2400 files
- Save sample OCR outputs for future reference
Last Updated: 2025-10-18 Next Session Goal: Complete Steps 1-6 (through Medium Batch Test) Estimated Time for Next Session: 3-4 hours