docs: add comprehensive testing guide for next session
Added NEXT_STEPS.md with detailed testing plan covering: - Dependency installation - Test configuration setup - Small/medium/full batch testing strategy - Excel library integration implementation - Field extraction tuning - Unit and integration test creation - Success criteria and known issues Ready for Phase 6 testing in next session. Related: #763 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
parent
45c1ce454b
commit
24ba3a696d
|
|
@ -0,0 +1,444 @@
|
|||
# OCR to Excel Workflow - Next Steps for Testing
|
||||
|
||||
**Created:** 2025-10-18
|
||||
**Status:** Ready for testing after PR #764 is merged
|
||||
**Related PR:** https://github.com/bmad-code-org/BMAD-METHOD/pull/764
|
||||
**Related Issue:** https://github.com/bmad-code-org/BMAD-METHOD/issues/763
|
||||
|
||||
## Current Status
|
||||
|
||||
✅ **COMPLETE:**
|
||||
- Phase 1: Infrastructure (agent, workflow config, templates, docs)
|
||||
- Phase 2: OCR & File Processing implementation
|
||||
- Phase 3: Data Parsing & Validation implementation
|
||||
- Phase 4: Excel Integration (placeholder - needs library)
|
||||
- Phase 5: Batch Processing implementation
|
||||
- Code committed and PR created
|
||||
|
||||
⏳ **PENDING:**
|
||||
- Phase 6: Testing & Documentation (this document)
|
||||
- Real-world testing with actual data
|
||||
|
||||
## Prerequisites for Testing
|
||||
|
||||
Before starting the test session, ensure:
|
||||
|
||||
1. **PR #764 is merged** to v6-alpha branch
|
||||
2. **OpenRouter API key** is ready
|
||||
```bash
|
||||
export OPENROUTER_API_KEY="your-api-key-here"
|
||||
```
|
||||
3. **Test data available:**
|
||||
- Master Excel file: `/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/TM - Daily Sales Report DSR by Part Timers_260225.xlsx`
|
||||
- Source files: `/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/2021/` (~2400 files)
|
||||
|
||||
## Test Plan: Phase 6 Implementation
|
||||
|
||||
### Step 1: Install Dependencies (15 minutes)
|
||||
|
||||
```bash
|
||||
# Navigate to project root
|
||||
cd /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/BMAD-METHOD
|
||||
|
||||
# Install required npm packages
|
||||
npm install --save xlsx pdf-parse @kenjiuno/msgreader
|
||||
|
||||
# Verify installation
|
||||
npm list xlsx pdf-parse @kenjiuno/msgreader
|
||||
```
|
||||
|
||||
**Expected Output:**
|
||||
```
|
||||
├── xlsx@0.18.5
|
||||
├── pdf-parse@1.1.1
|
||||
└── @kenjiuno/msgreader@2.0.0
|
||||
```
|
||||
|
||||
### Step 2: Create Test Configuration (10 minutes)
|
||||
|
||||
Create `test-config.yaml` based on your real data:
|
||||
|
||||
```yaml
|
||||
# OCR to Excel Test Configuration
|
||||
name: "Daily Sales Report Extraction - Test"
|
||||
description: "Test configuration for 2021 sales reports"
|
||||
|
||||
# API Configuration
|
||||
api:
|
||||
provider: openrouter
|
||||
model: 'mistral/pixtral-large-latest'
|
||||
api_key: ${OPENROUTER_API_KEY}
|
||||
|
||||
# File Paths
|
||||
paths:
|
||||
source_folder: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/2021'
|
||||
master_file: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/TM - Daily Sales Report DSR by Part Timers_260225.xlsx'
|
||||
processed_folder: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/processed/done'
|
||||
backup_folder: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/backups'
|
||||
log_folder: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/logs'
|
||||
|
||||
# File Types
|
||||
file_types:
|
||||
- pdf
|
||||
- xlsx
|
||||
- xls
|
||||
- msg
|
||||
|
||||
# Extraction Fields (customize based on your actual documents)
|
||||
extraction_fields:
|
||||
- name: date
|
||||
type: date
|
||||
format: 'YYYY-MM-DD'
|
||||
required: true
|
||||
description: 'Sales report date'
|
||||
|
||||
- name: store_name
|
||||
type: string
|
||||
required: true
|
||||
description: 'Tenant/store name'
|
||||
|
||||
- name: sales_amount
|
||||
type: currency
|
||||
required: true
|
||||
description: 'Total daily sales'
|
||||
|
||||
- name: part_timer_name
|
||||
type: string
|
||||
required: false
|
||||
description: 'Part timer employee name'
|
||||
|
||||
# Processing Configuration
|
||||
processing:
|
||||
batch_size: 10
|
||||
parallel_limit: 3
|
||||
confidence_threshold: 0.85
|
||||
pause_on_low_confidence: true
|
||||
|
||||
# Logging
|
||||
logging:
|
||||
level: 'info'
|
||||
log_to_console: true
|
||||
log_to_file: true
|
||||
```
|
||||
|
||||
### Step 3: Small Batch Test (20 minutes)
|
||||
|
||||
Test with a small batch first (5-10 files):
|
||||
|
||||
```bash
|
||||
# Create a test folder with just a few files
|
||||
mkdir -p /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/test-batch
|
||||
cp /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/2021/01.\ Jan\ 2021/*.pdf /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/test-batch/ | head -10
|
||||
|
||||
# Update test-config.yaml to use test-batch folder
|
||||
# Then run the workflow
|
||||
```
|
||||
|
||||
**Testing Checklist:**
|
||||
- [ ] API key loads correctly from environment
|
||||
- [ ] Files are discovered successfully
|
||||
- [ ] OCR API calls succeed
|
||||
- [ ] Data extraction produces reasonable results
|
||||
- [ ] Confidence scoring works
|
||||
- [ ] Low-confidence items are flagged
|
||||
- [ ] Validation UI appears for low-confidence items
|
||||
- [ ] Excel backup is created before writing
|
||||
- [ ] Data is written to master Excel file
|
||||
- [ ] Processed files are moved to done folder
|
||||
- [ ] Processing log is created
|
||||
- [ ] Report is generated
|
||||
|
||||
**Expected Issues:**
|
||||
1. **Excel library integration** - Currently placeholder, needs actual xlsx integration
|
||||
2. **MSG parsing** - Placeholder, needs @kenjiuno/msgreader integration
|
||||
3. **Field extraction patterns** - May need tuning based on actual document formats
|
||||
|
||||
### Step 4: Fix Excel Integration (30 minutes)
|
||||
|
||||
Update `task-excel-writer.js` to use actual xlsx library:
|
||||
|
||||
**Current (placeholder):**
|
||||
```javascript
|
||||
async function appendToExcel(config, dataRows) {
|
||||
const backup = await createBackup(masterFile, backupFolder);
|
||||
// TODO: Actual Excel writing implementation
|
||||
return { success: true, rowsWritten: dataRows.length };
|
||||
}
|
||||
```
|
||||
|
||||
**Needs to become:**
|
||||
```javascript
|
||||
const XLSX = require('xlsx');
|
||||
|
||||
async function appendToExcel(config, dataRows) {
|
||||
const { masterFile, backupFolder } = config.paths;
|
||||
|
||||
// Create backup first
|
||||
const backup = await createBackup(masterFile, backupFolder);
|
||||
|
||||
try {
|
||||
// Read existing workbook
|
||||
const workbook = XLSX.readFile(masterFile);
|
||||
const sheetName = workbook.SheetNames[0];
|
||||
const worksheet = workbook.Sheets[sheetName];
|
||||
|
||||
// Convert worksheet to JSON
|
||||
const existingData = XLSX.utils.sheet_to_json(worksheet);
|
||||
|
||||
// Append new rows
|
||||
const updatedData = [...existingData, ...dataRows];
|
||||
|
||||
// Convert back to worksheet
|
||||
const newWorksheet = XLSX.utils.json_to_sheet(updatedData);
|
||||
workbook.Sheets[sheetName] = newWorksheet;
|
||||
|
||||
// Write to file
|
||||
XLSX.writeFile(workbook, masterFile);
|
||||
|
||||
return {
|
||||
success: true,
|
||||
rowsWritten: dataRows.length,
|
||||
totalRows: updatedData.length,
|
||||
backupPath: backup
|
||||
};
|
||||
} catch (error) {
|
||||
// Restore from backup on error
|
||||
await restoreBackup(backup, masterFile);
|
||||
throw error;
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
### Step 5: Tune Field Extraction (45 minutes)
|
||||
|
||||
Based on test results, you may need to:
|
||||
|
||||
1. **Analyze sample OCR output:**
|
||||
```bash
|
||||
# Check processing logs to see what OCR actually returns
|
||||
cat /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/logs/processing-log-*.json | jq '.processedFiles[0].data'
|
||||
```
|
||||
|
||||
2. **Adjust regex patterns in `task-data-parser.js`:**
|
||||
- Date format patterns
|
||||
- Currency extraction patterns
|
||||
- Store name patterns
|
||||
|
||||
3. **Add custom extraction prompts:**
|
||||
- Make prompts more specific to your document format
|
||||
- Add examples in the prompt for better accuracy
|
||||
|
||||
### Step 6: Medium Batch Test (30 minutes)
|
||||
|
||||
Test with ~50-100 files:
|
||||
|
||||
**Testing Focus:**
|
||||
- [ ] Parallel processing works correctly
|
||||
- [ ] Progress tracking is accurate
|
||||
- [ ] Memory usage stays stable
|
||||
- [ ] API rate limits are respected
|
||||
- [ ] Error recovery works (simulate failures)
|
||||
- [ ] Batch statistics are correct
|
||||
|
||||
### Step 7: Full Batch Test (2-3 hours)
|
||||
|
||||
Process all ~2400 files:
|
||||
|
||||
**Before running:**
|
||||
- [ ] Ensure sufficient OpenRouter credits
|
||||
- [ ] Verify disk space for backups and logs
|
||||
- [ ] Close Excel file if open
|
||||
- [ ] Set up monitoring (check CPU/memory periodically)
|
||||
|
||||
**Monitoring:**
|
||||
```bash
|
||||
# Monitor progress
|
||||
tail -f /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/logs/processing-log-*.json
|
||||
|
||||
# Check memory usage
|
||||
watch -n 5 'ps aux | grep node'
|
||||
```
|
||||
|
||||
### Step 8: Data Quality Review (1 hour)
|
||||
|
||||
After processing:
|
||||
|
||||
1. **Spot-check random samples:**
|
||||
- Open master Excel file
|
||||
- Compare 20-30 random entries with source documents
|
||||
- Verify dates, amounts, store names
|
||||
|
||||
2. **Check statistics:**
|
||||
- Total files processed vs. expected
|
||||
- Success rate
|
||||
- Average confidence scores
|
||||
- Common error patterns
|
||||
|
||||
3. **Review low-confidence items:**
|
||||
- Check all items flagged for manual review
|
||||
- Identify patterns in low-confidence extractions
|
||||
- Adjust confidence threshold if needed
|
||||
|
||||
### Step 9: Create Unit Tests (2 hours)
|
||||
|
||||
Create Jest tests for each task module:
|
||||
|
||||
```bash
|
||||
# Install Jest
|
||||
npm install --save-dev jest
|
||||
|
||||
# Create test directory structure
|
||||
mkdir -p src/modules/bmm/tasks/ocr-extraction/__tests__
|
||||
```
|
||||
|
||||
**Test files to create:**
|
||||
- `task-file-scanner.test.js`
|
||||
- `task-ocr-process.test.js`
|
||||
- `task-data-parser.test.js`
|
||||
- `task-excel-writer.test.js`
|
||||
- `task-batch-processor.test.js`
|
||||
|
||||
**Example test:**
|
||||
```javascript
|
||||
const { parseOCRText } = require('../task-data-parser');
|
||||
|
||||
describe('task-data-parser', () => {
|
||||
describe('parseOCRText', () => {
|
||||
it('should extract date from OCR text', () => {
|
||||
const ocrText = 'Sales Report Date: 01/15/2021 Store: ABC Mart';
|
||||
const fields = [{ name: 'date', type: 'date', required: true }];
|
||||
|
||||
const result = parseOCRText(ocrText, fields);
|
||||
|
||||
expect(result.isValid).toBe(true);
|
||||
expect(result.data.date).toBe('2021-01-15');
|
||||
});
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
### Step 10: Integration Tests (1 hour)
|
||||
|
||||
Create integration tests with mock API:
|
||||
|
||||
```javascript
|
||||
// Mock OpenRouter API responses
|
||||
jest.mock('node-fetch', () => jest.fn());
|
||||
|
||||
describe('End-to-end workflow', () => {
|
||||
it('should process a file from OCR to Excel', async () => {
|
||||
// Setup: Create test file, mock API, prepare config
|
||||
// Execute: Run batch processor
|
||||
// Assert: Verify Excel file updated correctly
|
||||
});
|
||||
});
|
||||
```
|
||||
|
||||
## Success Criteria
|
||||
|
||||
✅ **Ready for production when:**
|
||||
- [ ] All task modules fully implemented (no placeholders)
|
||||
- [ ] Small batch test (10 files) completes successfully
|
||||
- [ ] Medium batch test (100 files) with 90%+ success rate
|
||||
- [ ] Full batch test (2400 files) completes
|
||||
- [ ] Data quality spot-check shows 95%+ accuracy
|
||||
- [ ] Unit test coverage >80%
|
||||
- [ ] Integration tests pass
|
||||
- [ ] Performance acceptable (<5 sec/file average)
|
||||
- [ ] Memory usage stable (no leaks)
|
||||
- [ ] Documentation updated with findings
|
||||
|
||||
## Known Issues to Address
|
||||
|
||||
1. **Excel Library Integration**
|
||||
- Status: Placeholder implementation
|
||||
- Priority: High
|
||||
- Estimated effort: 30 minutes
|
||||
|
||||
2. **MSG File Parsing**
|
||||
- Status: Placeholder implementation
|
||||
- Priority: Medium
|
||||
- Estimated effort: 1 hour
|
||||
|
||||
3. **Interactive Validation UI**
|
||||
- Status: Placeholder (auto-approves all)
|
||||
- Priority: Medium
|
||||
- Estimated effort: 1 hour
|
||||
|
||||
4. **Field Extraction Tuning**
|
||||
- Status: Generic patterns
|
||||
- Priority: High
|
||||
- Estimated effort: 1-2 hours based on test results
|
||||
|
||||
## Resources
|
||||
|
||||
- **OpenRouter Docs:** https://openrouter.ai/docs
|
||||
- **xlsx Library:** https://www.npmjs.com/package/xlsx
|
||||
- **pdf-parse Library:** https://www.npmjs.com/package/pdf-parse
|
||||
- **msgreader Library:** https://www.npmjs.com/package/@kenjiuno/msgreader
|
||||
- **Jest Testing:** https://jestjs.io/docs/getting-started
|
||||
|
||||
## Session Checklist
|
||||
|
||||
**At the start of next session:**
|
||||
- [ ] Pull latest changes (if PR merged)
|
||||
- [ ] Review this document
|
||||
- [ ] Set OpenRouter API key
|
||||
- [ ] Check test data is accessible
|
||||
- [ ] Install dependencies (npm install)
|
||||
- [ ] Create test configuration file
|
||||
- [ ] Start with Step 3 (Small Batch Test)
|
||||
|
||||
**By end of next session (ideal):**
|
||||
- [ ] Dependencies installed
|
||||
- [ ] Excel integration implemented
|
||||
- [ ] Small batch test passed
|
||||
- [ ] Medium batch test passed
|
||||
- [ ] Full batch test started (can run overnight)
|
||||
|
||||
**Follow-up session:**
|
||||
- [ ] Review full batch results
|
||||
- [ ] Data quality review
|
||||
- [ ] Unit tests created
|
||||
- [ ] Integration tests created
|
||||
- [ ] Documentation updated
|
||||
- [ ] Mark Phase 6 complete
|
||||
|
||||
## Quick Start Commands
|
||||
|
||||
```bash
|
||||
# Navigate to project
|
||||
cd /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/BMAD-METHOD
|
||||
|
||||
# Set API key
|
||||
export OPENROUTER_API_KEY="your-key-here"
|
||||
|
||||
# Install dependencies
|
||||
npm install
|
||||
|
||||
# Run small batch test (after creating test-config.yaml)
|
||||
# Use BMAD CLI or agent to trigger workflow
|
||||
/ocr-to-excel
|
||||
|
||||
# Monitor progress
|
||||
tail -f logs/processing-log-*.json
|
||||
|
||||
# Check results
|
||||
open backups/ # View backups
|
||||
open processed/done/ # View processed files
|
||||
open "TM - Daily Sales Report DSR by Part Timers_260225.xlsx" # View master file
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Keep this document updated as you progress through testing
|
||||
- Document any issues found and their resolutions
|
||||
- Note any performance bottlenecks
|
||||
- Record API costs for ~2400 files
|
||||
- Save sample OCR outputs for future reference
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** 2025-10-18
|
||||
**Next Session Goal:** Complete Steps 1-6 (through Medium Batch Test)
|
||||
**Estimated Time for Next Session:** 3-4 hours
|
||||
Loading…
Reference in New Issue