12 KiB

Raw Blame History

OCR to Excel Workflow - Next Steps for Testing

Created: 2025-10-18 Status: Ready for testing after PR #764 is merged Related PR: https://github.com/bmad-code-org/BMAD-METHOD/pull/764 Related Issue: https://github.com/bmad-code-org/BMAD-METHOD/issues/763

Current Status

✅ COMPLETE:

Phase 1: Infrastructure (agent, workflow config, templates, docs)
Phase 2: OCR & File Processing implementation
Phase 3: Data Parsing & Validation implementation
Phase 4: Excel Integration (placeholder - needs library)
Phase 5: Batch Processing implementation
Code committed and PR created

⏳ PENDING:

Phase 6: Testing & Documentation (this document)
Real-world testing with actual data

Prerequisites for Testing

Before starting the test session, ensure:

PR #764 is merged to v6-alpha branch

OpenRouter API key is ready

export OPENROUTER_API_KEY="your-api-key-here"

Test data available:
- Master Excel file: /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/TM - Daily Sales Report DSR by Part Timers_260225.xlsx
- Source files: /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/2021/ (~2400 files)

Test Plan: Phase 6 Implementation

Step 1: Install Dependencies (15 minutes)

# Navigate to project root
cd /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/BMAD-METHOD

# Install required npm packages
npm install --save xlsx pdf-parse @kenjiuno/msgreader

# Verify installation
npm list xlsx pdf-parse @kenjiuno/msgreader

Expected Output:

├── xlsx@0.18.5
├── pdf-parse@1.1.1
└── @kenjiuno/msgreader@2.0.0

Step 2: Create Test Configuration (10 minutes)

Create test-config.yaml based on your real data:

# OCR to Excel Test Configuration
name: "Daily Sales Report Extraction - Test"
description: "Test configuration for 2021 sales reports"

# API Configuration
api:
  provider: openrouter
  model: 'mistral/pixtral-large-latest'
  api_key: ${OPENROUTER_API_KEY}

# File Paths
paths:
  source_folder: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/2021'
  master_file: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/TM - Daily Sales Report DSR by Part Timers_260225.xlsx'
  processed_folder: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/processed/done'
  backup_folder: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/backups'
  log_folder: '/Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/logs'

# File Types
file_types:
  - pdf
  - xlsx
  - xls
  - msg

# Extraction Fields (customize based on your actual documents)
extraction_fields:
  - name: date
    type: date
    format: 'YYYY-MM-DD'
    required: true
    description: 'Sales report date'

  - name: store_name
    type: string
    required: true
    description: 'Tenant/store name'

  - name: sales_amount
    type: currency
    required: true
    description: 'Total daily sales'

  - name: part_timer_name
    type: string
    required: false
    description: 'Part timer employee name'

# Processing Configuration
processing:
  batch_size: 10
  parallel_limit: 3
  confidence_threshold: 0.85
  pause_on_low_confidence: true

# Logging
logging:
  level: 'info'
  log_to_console: true
  log_to_file: true

Step 3: Small Batch Test (20 minutes)

Test with a small batch first (5-10 files):

# Create a test folder with just a few files
mkdir -p /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/test-batch
cp /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/2021/01.\ Jan\ 2021/*.pdf /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/test-batch/ | head -10

# Update test-config.yaml to use test-batch folder
# Then run the workflow

Testing Checklist:

API key loads correctly from environment
Files are discovered successfully
OCR API calls succeed
Data extraction produces reasonable results
Confidence scoring works
Low-confidence items are flagged
Validation UI appears for low-confidence items
Excel backup is created before writing
Data is written to master Excel file
Processed files are moved to done folder
Processing log is created
Report is generated

Expected Issues:

Excel library integration - Currently placeholder, needs actual xlsx integration
MSG parsing - Placeholder, needs @kenjiuno/msgreader integration
Field extraction patterns - May need tuning based on actual document formats

Step 4: Fix Excel Integration (30 minutes)

Update task-excel-writer.js to use actual xlsx library:

Current (placeholder):

async function appendToExcel(config, dataRows) {
  const backup = await createBackup(masterFile, backupFolder);
  // TODO: Actual Excel writing implementation
  return { success: true, rowsWritten: dataRows.length };
}

Needs to become:

const XLSX = require('xlsx');

async function appendToExcel(config, dataRows) {
  const { masterFile, backupFolder } = config.paths;

  // Create backup first
  const backup = await createBackup(masterFile, backupFolder);

  try {
    // Read existing workbook
    const workbook = XLSX.readFile(masterFile);
    const sheetName = workbook.SheetNames[0];
    const worksheet = workbook.Sheets[sheetName];

    // Convert worksheet to JSON
    const existingData = XLSX.utils.sheet_to_json(worksheet);

    // Append new rows
    const updatedData = [...existingData, ...dataRows];

    // Convert back to worksheet
    const newWorksheet = XLSX.utils.json_to_sheet(updatedData);
    workbook.Sheets[sheetName] = newWorksheet;

    // Write to file
    XLSX.writeFile(workbook, masterFile);

    return {
      success: true,
      rowsWritten: dataRows.length,
      totalRows: updatedData.length,
      backupPath: backup
    };
  } catch (error) {
    // Restore from backup on error
    await restoreBackup(backup, masterFile);
    throw error;
  }
}

Step 5: Tune Field Extraction (45 minutes)

Based on test results, you may need to:

Analyze sample OCR output:

# Check processing logs to see what OCR actually returns
cat /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/logs/processing-log-*.json | jq '.processedFiles[0].data'

Adjust regex patterns in task-data-parser.js:
- Date format patterns
- Currency extraction patterns
- Store name patterns
Add custom extraction prompts:
- Make prompts more specific to your document format
- Add examples in the prompt for better accuracy

Step 6: Medium Batch Test (30 minutes)

Test with ~50-100 files:

Testing Focus:

Parallel processing works correctly
Progress tracking is accurate
Memory usage stays stable
API rate limits are respected
Error recovery works (simulate failures)
Batch statistics are correct

Step 7: Full Batch Test (2-3 hours)

Process all ~2400 files:

Before running:

Ensure sufficient OpenRouter credits
Verify disk space for backups and logs
Close Excel file if open
Set up monitoring (check CPU/memory periodically)

Monitoring:

# Monitor progress
tail -f /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/logs/processing-log-*.json

# Check memory usage
watch -n 5 'ps aux | grep node'

Step 8: Data Quality Review (1 hour)

After processing:

Spot-check random samples:
- Open master Excel file
- Compare 20-30 random entries with source documents
- Verify dates, amounts, store names
Check statistics:
- Total files processed vs. expected
- Success rate
- Average confidence scores
- Common error patterns
Review low-confidence items:
- Check all items flagged for manual review
- Identify patterns in low-confidence extractions
- Adjust confidence threshold if needed

Step 9: Create Unit Tests (2 hours)

Create Jest tests for each task module:

# Install Jest
npm install --save-dev jest

# Create test directory structure
mkdir -p src/modules/bmm/tasks/ocr-extraction/__tests__

Test files to create:

task-file-scanner.test.js
task-ocr-process.test.js
task-data-parser.test.js
task-excel-writer.test.js
task-batch-processor.test.js

Example test:

const { parseOCRText } = require('../task-data-parser');

describe('task-data-parser', () => {
  describe('parseOCRText', () => {
    it('should extract date from OCR text', () => {
      const ocrText = 'Sales Report Date: 01/15/2021 Store: ABC Mart';
      const fields = [{ name: 'date', type: 'date', required: true }];

      const result = parseOCRText(ocrText, fields);

      expect(result.isValid).toBe(true);
      expect(result.data.date).toBe('2021-01-15');
    });
  });
});

Step 10: Integration Tests (1 hour)

Create integration tests with mock API:

// Mock OpenRouter API responses
jest.mock('node-fetch', () => jest.fn());

describe('End-to-end workflow', () => {
  it('should process a file from OCR to Excel', async () => {
    // Setup: Create test file, mock API, prepare config
    // Execute: Run batch processor
    // Assert: Verify Excel file updated correctly
  });
});

Success Criteria

✅ Ready for production when:

All task modules fully implemented (no placeholders)
Small batch test (10 files) completes successfully
Medium batch test (100 files) with 90%+ success rate
Full batch test (2400 files) completes
Data quality spot-check shows 95%+ accuracy
Unit test coverage >80%
Integration tests pass
Performance acceptable (<5 sec/file average)
Memory usage stable (no leaks)
Documentation updated with findings

Known Issues to Address

Excel Library Integration
- Status: Placeholder implementation
- Priority: High
- Estimated effort: 30 minutes
MSG File Parsing
- Status: Placeholder implementation
- Priority: Medium
- Estimated effort: 1 hour
Interactive Validation UI
- Status: Placeholder (auto-approves all)
- Priority: Medium
- Estimated effort: 1 hour
Field Extraction Tuning
- Status: Generic patterns
- Priority: High
- Estimated effort: 1-2 hours based on test results

Resources

OpenRouter Docs: https://openrouter.ai/docs
xlsx Library: https://www.npmjs.com/package/xlsx
pdf-parse Library: https://www.npmjs.com/package/pdf-parse
msgreader Library: https://www.npmjs.com/package/@kenjiuno/msgreader
Jest Testing: https://jestjs.io/docs/getting-started

Session Checklist

At the start of next session:

Pull latest changes (if PR merged)
Review this document
Set OpenRouter API key
Check test data is accessible
Install dependencies (npm install)
Create test configuration file
Start with Step 3 (Small Batch Test)

By end of next session (ideal):

Dependencies installed
Excel integration implemented
Small batch test passed
Medium batch test passed
Full batch test started (can run overnight)

Follow-up session:

Review full batch results
Data quality review
Unit tests created
Integration tests created
Documentation updated
Mark Phase 6 complete

Quick Start Commands

# Navigate to project
cd /Users/baito.kevin/Downloads/dev/BMAD-METHOD/MyTown/BMAD-METHOD

# Set API key
export OPENROUTER_API_KEY="your-key-here"

# Install dependencies
npm install

# Run small batch test (after creating test-config.yaml)
# Use BMAD CLI or agent to trigger workflow
/ocr-to-excel

# Monitor progress
tail -f logs/processing-log-*.json

# Check results
open backups/  # View backups
open processed/done/  # View processed files
open "TM - Daily Sales Report DSR by Part Timers_260225.xlsx"  # View master file

Notes

Keep this document updated as you progress through testing
Document any issues found and their resolutions
Note any performance bottlenecks
Record API costs for ~2400 files
Save sample OCR outputs for future reference

Last Updated: 2025-10-18 Next Session Goal: Complete Steps 1-6 (through Medium Batch Test) Estimated Time for Next Session: 3-4 hours

12 KiB Raw Blame History