feat: add OCR to Excel data extraction workflow (Phase 1 - Infrastructure)

Add new BMM workflow for automated document processing using Mistral OCR
via OpenRouter API. This workflow extracts structured data from PDFs, Excel
files, and Outlook messages, consolidating results into a master Excel file.

This commit completes Phase 1 (Core Infrastructure) of the implementation
plan outlined in issue #763. Future phases will add the actual processing
tasks (OCR, parsing, Excel writing, etc.).

**New Components:**

- Data Extraction Specialist agent with OCR/parsing persona
- OCR to Excel workflow with 14-step interactive process
- Comprehensive configuration template
- Processing report template
- Validation checklist
- Complete documentation

**Features:**

- Multi-format support (PDF, XLSX, XLS, MSG)
- Confidence-based extraction validation
- Human-AI collaboration design
- Batch processing configuration
- Automatic backup system
- Audit trail and logging

**Files Added:**

- src/modules/bmm/agents/data-extraction.agent.yaml
- src/modules/bmm/workflows/data-extraction/ocr-to-excel/workflow.yaml
- src/modules/bmm/workflows/data-extraction/ocr-to-excel/config-template.yaml
- src/modules/bmm/workflows/data-extraction/ocr-to-excel/template.md
- src/modules/bmm/workflows/data-extraction/ocr-to-excel/instructions.md
- src/modules/bmm/workflows/data-extraction/ocr-to-excel/checklist.md
- src/modules/bmm/workflows/data-extraction/ocr-to-excel/README.md

**Next Steps:**

- Phase 2: Implement OCR & file processing tasks
- Phase 3: Implement data parsing & validation tasks
- Phase 4: Implement Excel integration tasks
- Phase 5: Implement batch processing & cleanup tasks
- Phase 6: Add tests and finalize documentation

Related to #763

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
Kevin Reuben Lee 2025-10-18 18:13:11 +08:00
parent 4b6f34dff8
commit 4a50ad8b31
7 changed files with 1362 additions and 0 deletions

View File

@ -0,0 +1,31 @@
# Data Extraction Agent Definition
agent:
metadata:
id: bmad/bmm/agents/data-extraction.md
name: Data Extraction Specialist
title: OCR & Data Processing Expert
icon: 📄
module: bmm
persona:
role: OCR Processing Specialist + Data Extraction Expert
identity: Senior data processing specialist with deep expertise in optical character recognition, document parsing, and automated data extraction. Specializes in transforming unstructured documents into structured data with high accuracy and reliability. Background in machine learning, document processing pipelines, and data quality assurance.
communication_style: Systematic and detail-oriented in approach - focuses on accuracy and validation. Presents extraction results with confidence scores and highlights areas needing human review. Structures workflows to balance automation with human oversight. Uses clear, precise language when describing data transformations and validation criteria.
principles:
- I believe that automated data extraction should empower humans through confidence-based handoff - high-confidence extractions proceed automatically while uncertain cases receive human attention.
- My approach centers on data integrity and audit trails, ensuring every extraction is traceable, reversible, and validated before committing to permanent storage.
- I operate as a reliable partner in document processing workflows, maintaining comprehensive logs, graceful error handling, and clear communication about what succeeded and what needs attention.
menu:
- trigger: ocr-to-excel
workflow: "{project-root}/bmad/bmm/workflows/data-extraction/ocr-to-excel/workflow.yaml"
description: Extract data from documents and consolidate to Excel
- trigger: batch-extract
workflow: "{project-root}/bmad/bmm/workflows/data-extraction/ocr-to-excel/workflow.yaml"
description: Process multiple documents in batch mode
- trigger: configure-extraction
workflow: "{project-root}/bmad/bmm/workflows/data-extraction/ocr-to-excel/workflow.yaml"
description: Configure field mappings and extraction settings

View File

@ -0,0 +1,381 @@
# OCR to Excel Data Extraction Workflow
Automated document processing workflow that uses Mistral OCR via OpenRouter API to extract structured data from PDFs, Excel files, and Outlook messages, consolidating results into a master Excel spreadsheet.
## Overview
This workflow demonstrates BMAD-METHOD's capability for human-AI collaboration in document processing tasks. It intelligently balances automation with human oversight through confidence-based decision making.
**Key Features:**
- Multi-format document support (PDF, XLSX, XLS, MSG)
- Automated OCR using Mistral Vision API via OpenRouter
- Confidence-based extraction validation
- Batch processing with progress tracking
- Automatic Excel backup before writing
- Comprehensive audit trail and logging
- File management with folder structure preservation
## Status: Phase 1 - Foundation Complete
**Current Implementation:**
- ✅ Agent definition and persona
- ✅ Workflow configuration
- ✅ Configuration templates
- ✅ Workflow instructions (14 steps)
- ✅ Output templates
- ✅ Validation checklist
- ✅ Documentation
**Next Phases (Not Yet Implemented):**
- ⏳ Phase 2: OCR & File Processing tasks
- ⏳ Phase 3: Data Parsing & Validation tasks
- ⏳ Phase 4: Excel Integration tasks
- ⏳ Phase 5: Batch Processing & Cleanup tasks
- ⏳ Phase 6: Testing & Documentation
> **Note:** This workflow has a complete design and documentation, but the actual implementation tasks (JavaScript modules for OCR processing, Excel writing, etc.) are planned for future development. See [Implementation Plan](#implementation-plan) below.
## Use Case
Organizations often receive hundreds of documents (sales reports, invoices, forms) in various formats that require manual data entry into spreadsheets. This workflow automates that process:
**Example:** Daily sales reports from multiple tenants:
- **Current Process:** ~40 hours/month manual data entry
- **With This Workflow:** <2 hours/month for validation
- **Time Savings:** 95%+ reduction in manual effort
## Quick Start
### Prerequisites
1. **OpenRouter API Key:** Sign up at https://openrouter.ai and get your API key
2. **Node.js:** Version 20.0.0 or higher
3. **BMAD-METHOD:** v6-alpha or later
### Installation
```bash
# Install the BMM module (if not already installed)
bmad install module bmm
# The OCR workflow is included in the BMM module
# Access via the Data Extraction Specialist agent
```
### Setup
1. **Set your API key:**
```bash
export OPENROUTER_API_KEY="your-api-key-here"
```
2. **Prepare your files:**
```
project/
├── source-documents/
│ └── 2021/
│ ├── 01. Jan 2021/
│ └── 02. Feb 2021/
└── master-file.xlsx
```
3. **Run the workflow:**
Trigger via your BMad-enabled IDE using the Data Extraction Specialist agent:
```
/ocr-to-excel
```
## Configuration
The workflow uses a YAML configuration file. Copy `config-template.yaml` to your project and customize:
```yaml
# API Configuration
api:
provider: openrouter
model: "mistral/pixtral-large-latest"
api_key: ${OPENROUTER_API_KEY}
# File Paths
paths:
source_folder: "./source-documents"
master_file: "./master-file.xlsx"
processed_folder: "./processed/done"
# Extraction Fields
extraction_fields:
- name: date
type: date
required: true
- name: store_name
type: string
required: true
- name: sales_amount
type: number
required: true
```
See `config-template.yaml` for complete configuration options.
## Workflow Process
The workflow follows a 14-step process:
1. **Environment Validation:** Check API key and dependencies
2. **Configuration Setup:** Load or create extraction configuration
3. **File Discovery:** Scan source folders for supported files
4. **OCR Processing:** Send documents to Mistral OCR API
5. **Data Extraction:** Parse OCR results into structured data
6. **Confidence Scoring:** Calculate extraction confidence (0-100%)
7. **Human Validation:** Review low-confidence extractions
8. **Excel Backup:** Create timestamped backup of master file
9. **Data Writing:** Append validated data to Excel
10. **File Management:** Move processed files to done folder
11. **Audit Logging:** Record all operations with timestamps
12. **Report Generation:** Create comprehensive processing report
13. **Error Handling:** Log failures and provide retry options
14. **Summary:** Display statistics and next steps
## Human-AI Collaboration
The workflow implements confidence-based decision making:
- **High Confidence (≥85%):** Automatically approved
- **Low Confidence (<85%):** Flagged for human review
- **Failed OCR:** User prompted with options (retry/skip/manual)
This ensures efficiency without sacrificing accuracy.
## Output Files
After processing, you'll have:
- **Master Excel File:** Updated with new extracted data
- **Backup:** `backups/master-file-YYYYMMDD-HHMMSS.xlsx`
- **Processing Report:** `output/extraction-results-YYYYMMDD.md`
- **Processing Log:** `logs/processing-log-YYYYMMDD.json`
- **Processed Files:** Moved to `processed/done/` folder
## Examples
### Basic Usage
```bash
# Process all files in source-documents/
/ocr-to-excel
# The workflow will:
# 1. Ask for configuration (use existing or create new)
# 2. Scan for files
# 3. Process each file with OCR
# 4. Present low-confidence items for review
# 5. Write data to Excel
# 6. Move processed files
# 7. Generate report
```
### Batch Processing
For large batches (100+ files):
1. Configure parallel processing limit (default: 3)
2. Set confidence threshold (default: 85%)
3. Enable pause/resume capability
4. Monitor progress bar
### Field Mapping Example
Extract sales data from PDF reports:
```yaml
extraction_fields:
- name: date
type: date
format: "YYYY-MM-DD"
description: "Sales report date"
- name: store_name
type: string
description: "Tenant/store name"
- name: sales_amount
type: number
format: "currency"
description: "Total sales"
```
## Implementation Plan
This workflow is part of a larger implementation plan outlined in GitHub Issue #763.
### Phase 1: Core Infrastructure ✅ **COMPLETE**
- Agent definition
- Workflow configuration
- Templates and documentation
### Phase 2: OCR & File Processing ⏳ **PLANNED**
**Tasks:**
- Implement `task-ocr-process.js` - OpenRouter API integration
- Implement `task-file-scanner.js` - Recursive file discovery
- Add support for PDF, XLSX, XLS, MSG formats
- Retry logic and error handling
- Progress tracking
**Estimated Effort:** 1 week
### Phase 3: Data Parsing & Validation ⏳ **PLANNED**
**Tasks:**
- Implement `task-data-parser.js` - Parse OCR to structured data
- Create validation prompts (using inquirer)
- Confidence scoring algorithm
- Data correction UI
- Field mapping logic
**Estimated Effort:** 1 week
### Phase 4: Excel Integration ⏳ **PLANNED**
**Tasks:**
- Implement `task-excel-writer.js` - Write to Excel with xlsx library
- Automatic backup system
- Atomic write operations
- Rollback capability
- Excel structure analyzer
**Estimated Effort:** 1 week
### Phase 5: Batch Processing ⏳ **PLANNED**
**Tasks:**
- Implement `task-file-mover.js` - File management
- Batch orchestration
- Pause/resume functionality
- Processing queue with state persistence
- Summary statistics
**Estimated Effort:** 1 week
### Phase 6: Testing & Documentation ⏳ **PLANNED**
**Tasks:**
- Unit tests (Jest)
- Integration tests with mock API
- Comprehensive README
- Troubleshooting guide
- Real-world validation
**Estimated Effort:** 1 week
## Contributing
This workflow is part of issue #763. Contributions are welcome!
**To contribute:**
1. Check issue #763 for current status
2. Follow BMAD-METHOD contributing guidelines
3. Target the `v6-alpha` branch
4. Keep PRs small (200-400 lines ideal)
5. Include tests for new functionality
## Troubleshooting
### Common Issues
**API Key Not Found:**
```bash
export OPENROUTER_API_KEY="your-api-key"
```
**OCR Quality Poor:**
- Ensure source documents are high quality (not scanned at low DPI)
- Check that PDFs are not password-protected
- Verify file format is supported
**Excel Write Failures:**
- Check file permissions
- Ensure Excel file is not open in another application
- Verify backup folder exists and is writable
**Low Confidence Scores:**
- Review OCR text quality
- Adjust field extraction prompts
- Consider manual extraction for complex layouts
## Technical Details
### Dependencies
- **xlsx:** Excel file reading/writing (to be added in Phase 4)
- **pdf-parse:** PDF text extraction (to be added in Phase 2)
- **node-fetch:** API calls to OpenRouter (to be added in Phase 2)
- **fs-extra:** File operations (already in project)
- **glob:** File discovery (already in project)
### API Usage
The workflow uses OpenRouter's Mistral Pixtral Large model for OCR:
```javascript
// Example API call (implementation in Phase 2)
const response = await fetch("https://openrouter.ai/api/v1/chat/completions", {
method: "POST",
headers: {
Authorization: `Bearer ${apiKey}`,
"Content-Type": "application/json",
},
body: JSON.stringify({
model: "mistral/pixtral-large-latest",
messages: [
{
role: "user",
content: [
{ type: "image_url", image_url: { url: base64Image } },
{ type: "text", text: "Extract: date, store name, amount..." },
],
},
],
}),
});
```
### Security Considerations
- API keys stored in environment variables (never committed)
- Document contents not logged (only filenames)
- Backups created before any write operations
- User warned about sending documents to external API
## License
MIT - See main BMAD-METHOD repository
## Related Resources
- **Issue #763:** https://github.com/bmad-code-org/BMAD-METHOD/issues/763
- **OpenRouter Docs:** https://openrouter.ai/docs
- **BMAD-METHOD v6:** https://github.com/bmad-code-org/BMAD-METHOD
## Support
- **GitHub Issues:** https://github.com/bmad-code-org/BMAD-METHOD/issues
- **Discord:** Join the BMAD-METHOD community
- **Documentation:** https://github.com/bmad-code-org/BMAD-METHOD/wiki

View File

@ -0,0 +1,243 @@
# OCR to Excel Extraction Validation Checklist
## Pre-Processing Checklist
### Environment Setup
- [ ] OpenRouter API key is set as environment variable (OPENROUTER_API_KEY)
- [ ] API key has sufficient credits for batch processing
- [ ] Node.js version >= 20.0.0 is installed
- [ ] Required npm dependencies are installed (xlsx, pdf-parse, node-fetch, fs-extra, glob)
### Configuration Validation
- [ ] Configuration file exists and is valid YAML
- [ ] Source documents folder path is correct and accessible
- [ ] Master Excel file path is correct (or will be created)
- [ ] Processed files folder path is correct (or will be created)
- [ ] Backup folder path is correct (or will be created)
- [ ] All extraction fields are properly defined with types
- [ ] Required fields are marked correctly
- [ ] File types list matches actual source files
### Source Documents
- [ ] Source folder contains files to process
- [ ] File types match configured supported types (pdf, xlsx, xls, msg)
- [ ] Folder structure is organized (e.g., by year/month)
- [ ] No password-protected or corrupted files (or willing to skip them)
- [ ] Sample files tested manually for OCR quality
## During Processing Checklist
### File Discovery
- [ ] Correct number of files discovered in source folder
- [ ] Already-processed files are filtered out correctly
- [ ] File paths are displayed correctly without encoding issues
- [ ] File metadata (size, type, date) is captured
### OCR Processing
- [ ] API calls are successful (check first 5 files)
- [ ] OCR text quality is acceptable for extraction
- [ ] Retry logic works for failed API calls
- [ ] Error messages are clear and actionable
- [ ] Progress indicator shows accurate file counts
### Data Extraction
- [ ] Extracted data matches expected format
- [ ] Date formats are parsed correctly (YYYY-MM-DD)
- [ ] Number formats are parsed correctly (currency, integers)
- [ ] Store/tenant names are extracted consistently
- [ ] Required fields are present in all extractions
- [ ] Optional fields handle missing data gracefully
### Confidence Scoring
- [ ] Confidence scores are reasonable (not all 0% or 100%)
- [ ] Low-confidence extractions are flagged for review
- [ ] High-confidence threshold (default 85%) is appropriate
- [ ] User is prompted to review low-confidence items
### Human Validation
- [ ] Low-confidence extractions are displayed clearly
- [ ] Raw OCR text is available for reference
- [ ] User can edit/correct extracted values
- [ ] Corrected values are saved properly
- [ ] Skip option works for invalid files
## Excel Integration Checklist
### Backup Operations
- [ ] Master Excel file is backed up before writing
- [ ] Backup filename includes timestamp
- [ ] Backup is saved to correct location
- [ ] Backup file is readable and not corrupted
### Excel Writing
- [ ] Sheet name matches configuration
- [ ] Headers match configured extraction fields
- [ ] New rows are appended (not overwriting existing data)
- [ ] Cell formatting is correct (dates, numbers, text)
- [ ] Metadata fields are populated (source_file, processed_date, confidence)
- [ ] No data loss or corruption in master file
- [ ] Excel file can be opened after writing
### Data Integrity
- [ ] All validated extractions are written to Excel
- [ ] No duplicate entries created
- [ ] Row count increased by expected number
- [ ] No blank rows inserted
- [ ] Special characters handled correctly (no encoding issues)
## File Management Checklist
### File Movement
- [ ] Processed files are moved to correct location
- [ ] Original folder structure is preserved (if configured)
- [ ] Files are moved (not copied) to avoid duplicates
- [ ] File names remain unchanged after moving
- [ ] No files lost during movement operation
### Audit Trail
- [ ] Processing log file is created
- [ ] All operations are logged with timestamps
- [ ] Success and failure statuses are recorded
- [ ] Error messages include actionable details
- [ ] File paths in logs are accurate
## Post-Processing Checklist
### Results Validation
- [ ] Extraction report is generated successfully
- [ ] Report statistics match actual processing results
- [ ] Total files processed count is correct
- [ ] Success/failure counts are accurate
- [ ] Average confidence score is calculated correctly
### Data Quality
- [ ] Spot-check 10 random extracted records in Excel
- [ ] Verify dates are correct and consistent
- [ ] Verify sales amounts match source documents
- [ ] Verify store names are spelled correctly
- [ ] Check for any obviously incorrect data
### Error Handling
- [ ] Failed files list is complete and accurate
- [ ] Error messages are logged with sufficient detail
- [ ] Failed files remain in source folder (not moved)
- [ ] Retry mechanism works for transient failures
- [ ] Critical errors halt processing appropriately
## Final Validation
### Completeness
- [ ] All expected files were discovered
- [ ] All discovered files were processed or logged as failed
- [ ] No files processed more than once
- [ ] Processing queue was fully cleared
- [ ] No unexpected errors occurred
### Output Quality
- [ ] Master Excel file opens without errors
- [ ] Data is readable and formatted correctly
- [ ] No formula errors or #REF! cells
- [ ] Filtering and sorting work correctly
- [ ] File size is reasonable (not corrupted)
### Documentation
- [ ] Processing report is comprehensive and clear
- [ ] Processing log contains all necessary details
- [ ] Failed files are documented with reasons
- [ ] Next steps are clearly stated
- [ ] Audit trail is complete for compliance
## Performance Checklist
### Processing Time
- [ ] Average time per file is reasonable (< 3 seconds excluding API wait)
- [ ] Total processing time meets expectations
- [ ] No unexpected delays or hangs
- [ ] Parallel processing worked correctly (if configured)
- [ ] API rate limits were not exceeded
### Resource Usage
- [ ] Memory usage remained stable (no memory leaks)
- [ ] Disk space is sufficient for processed files and backups
- [ ] CPU usage was reasonable
- [ ] Network usage matches expected API call volume
## Security & Privacy Checklist
### API Security
- [ ] API key is not logged or exposed in reports
- [ ] API key is not committed to git
- [ ] API calls use HTTPS
- [ ] API responses do not contain sensitive keys
### Data Privacy
- [ ] Sensitive document contents are not logged in plain text
- [ ] Only necessary data is sent to external OCR API
- [ ] User was warned about sending documents to external API
- [ ] Audit logs don't expose sensitive information
- [ ] Backup files are stored securely
### File Permissions
- [ ] Master Excel file has appropriate read/write permissions
- [ ] Backup folder is only accessible by authorized users
- [ ] Processing logs are protected from unauthorized access
## Troubleshooting Checklist
If issues occur, verify:
- [ ] Error messages provide actionable guidance
- [ ] Failed files can be retried individually
- [ ] Backup can be restored if needed
- [ ] Processing can be paused and resumed
- [ ] Logs contain sufficient debug information
## Sign-Off
### Pre-Production
- [ ] All checklist items passed
- [ ] Sample batch (10-20 files) processed successfully
- [ ] Data quality spot-checked and verified
- [ ] Error handling tested with intentionally bad files
- [ ] Ready for full batch processing
### Post-Production
- [ ] Full batch processed successfully
- [ ] Master Excel file validated
- [ ] All files accounted for (processed or failed)
- [ ] Backups verified
- [ ] Processing documentation complete
---
**Processed By:** ******\_\_\_******
**Date:** ******\_\_\_******
**Batch Size:** ******\_\_\_******
**Issues Found:** ******\_\_\_******
**Resolution:** ******\_\_\_******

View File

@ -0,0 +1,110 @@
# OCR to Excel Extraction Configuration Template
# Copy this file to your project and customize the settings
# API Configuration
api:
provider: openrouter
model: "mistral/pixtral-large-latest"
api_key: ${OPENROUTER_API_KEY} # Set this environment variable
endpoint: "https://openrouter.ai/api/v1/chat/completions"
timeout: 60000 # milliseconds
max_retries: 3
retry_delay: 2000 # milliseconds
# File Paths
paths:
source_folder: "./source-documents" # Folder containing files to process
processed_folder: "./processed/done" # Where to move processed files
master_file: "./master-file.xlsx" # Master Excel file to append data
backup_folder: "./backups" # Automatic backups location
log_folder: "./logs" # Processing logs location
# Extraction Fields
# Define the fields to extract from documents
extraction_fields:
- name: date
type: date
format: "YYYY-MM-DD"
required: true
description: "Sales report date"
- name: store_name
type: string
required: true
description: "Tenant/store name"
- name: sales_amount
type: number
format: "currency"
required: true
description: "Total sales amount"
- name: part_timer_name
type: string
required: false
description: "Employee name"
- name: shift_hours
type: number
required: false
description: "Hours worked"
# Metadata Fields (automatically added)
metadata_fields:
- name: source_file
description: "Original filename for audit trail"
- name: processed_date
description: "Date when file was processed"
- name: confidence
description: "OCR confidence score (0-1)"
# Processing Settings
processing:
batch_size: 10 # Process N files before saving to Excel
parallel_limit: 3 # Number of concurrent API calls
confidence_threshold: 0.85 # Auto-approve if >= this score
pause_on_low_confidence: true # Stop for human review on low confidence
skip_duplicates: true # Skip files already processed
# File Type Settings
file_types:
- pdf
- xlsx
- xls
- msg
# Excel Configuration
excel:
sheet_name: "Extracted Data" # Sheet to append data
start_row: 2 # First data row (after headers)
create_sheet_if_missing: true
backup_before_write: true
validate_headers: true # Ensure headers match field names
# Logging
logging:
level: "info" # debug, info, warn, error
log_to_file: true
log_to_console: true
include_timestamps: true
include_confidence_scores: true
# Validation Rules
validation:
required_fields: ["date", "store_name", "sales_amount"]
date_range:
min: "2020-01-01"
max: "2030-12-31"
amount_range:
min: 0
max: 1000000
# Error Handling
error_handling:
on_api_failure: "retry" # retry, skip, halt
on_parsing_error: "prompt" # prompt, skip, halt
on_validation_error: "prompt" # prompt, skip, halt
save_failed_files_list: true
failed_files_log: "./logs/failed-files.json"

View File

@ -0,0 +1,402 @@
# OCR to Excel - Workflow Instructions
<critical>The workflow execution engine is governed by: {project-root}/bmad/core/tasks/workflow.xml</critical>
<critical>You MUST have already loaded and processed: {installed_path}/workflow.yaml</critical>
<workflow>
<step n="0" goal="Initialize and validate environment">
<action>Welcome the user to the OCR Data Extraction workflow</action>
<action>Explain that this workflow will process documents using OCR and extract data to Excel</action>
<check if="OPENROUTER_API_KEY environment variable exists">
<action>Validate API key is set</action>
<action>Set api_key_configured = true</action>
</check>
<check if="OPENROUTER_API_KEY not set">
<ask>**⚠️ OpenRouter API Key Required**
This workflow uses the OpenRouter API for Mistral OCR processing.
Please set your API key as an environment variable:
\`\`\`bash
export OPENROUTER_API_KEY="your-api-key-here"
\`\`\`
Options:
1. I've set the API key, continue
2. Exit and set it up first
What would you like to do?</ask>
<action>If user chooses option 1 → Verify key exists, continue</action>
<action>If user chooses option 2 → HALT with setup instructions</action>
</check>
</step>
<step n="1" goal="Configure extraction settings">
<ask>Let's set up your data extraction configuration.
Do you have an existing configuration file, or should we create one?
1. Use existing configuration file
2. Create new configuration with wizard
3. Use default configuration
Which option?</ask>
<action if="option 1">Ask for configuration file path</action>
<action if="option 1">Load and validate configuration</action>
<action if="option 2">Run configuration wizard (steps 2-5)</action>
<action if="option 3">Load default configuration from config-template.yaml</action>
<template-output>extraction_config</template-output>
</step>
<step n="2" goal="Configure source and destination paths" if="create_new_config">
<ask>**File Paths Configuration**
Please provide the following paths:
1. **Source Documents Folder:** Where are your files to process?
Example: ./source-documents or /path/to/files
2. **Master Excel File:** Where should extracted data be saved?
Example: ./master-data.xlsx
3. **Processed Files Folder:** Where to move processed files?
Example: ./processed/done
4. **Backup Folder:** Where to store Excel backups?
Example: ./backups</ask>
<action>Validate that source folder exists</action>
<action>Validate that master Excel file exists or can be created</action>
<action>Create processed and backup folders if they don't exist</action>
<template-output>source_folder</template-output>
<template-output>master_file</template-output>
<template-output>processed_folder</template-output>
<template-output>backup_folder</template-output>
</step>
<step n="3" goal="Configure extraction fields" if="create_new_config">
<ask>**Data Fields Configuration**
What fields should I extract from your documents?
Here are common fields for sales reports. Select all that apply:
- [ ] Date (sales report date)
- [ ] Store/Tenant Name
- [ ] Sales Amount
- [ ] Part Timer/Employee Name
- [ ] Shift Hours
- [ ] Invoice Number
- [ ] Custom field (specify)
Which fields do you need?</ask>
<action>Create extraction_fields configuration based on selections</action>
<action>For each custom field, ask for name, type, and whether it's required</action>
<template-output>extraction_fields</template-output>
</step>
<step n="4" goal="Configure processing settings" if="create_new_config">
<ask>**Processing Settings**
Configure how files should be processed:
1. **Batch Size:** How many files to process before saving? (default: 10)
2. **Parallel Processing:** Number of concurrent API calls (default: 3, max: 5)
3. **Confidence Threshold:** Auto-approve extractions with confidence >= ? (default: 85%)
4. **File Types:** Which file types to process? (pdf, xlsx, xls, msg)
Use defaults or customize?</ask>
<template-output>batch_size</template-output>
<template-output>parallel_limit</template-output>
<template-output>confidence_threshold</template-output>
<template-output>file_types</template-output>
</step>
<step n="5" goal="Save configuration" if="create_new_config">
<action>Generate complete configuration file</action>
<action>Save to {project-root}/bmad/bmm/workflows/data-extraction/ocr-to-excel/extraction-config.yaml</action>
<output>**✅ Configuration Saved**
Your extraction configuration has been saved and is ready to use.
You can edit it anytime at: extraction-config.yaml</output>
<template-output>config_file_path</template-output>
</step>
<step n="6" goal="Scan source documents">
<action>Scan the source folder recursively for supported file types</action>
<action>Filter out already-processed files (check processing log)</action>
<action>Build processing queue with file metadata</action>
<output>**📁 Document Scan Results**
Found {{total_files_found}} files:
- PDF: {{pdf_count}}
- Excel: {{excel_count}}
- MSG: {{msg_count}}
Already processed: {{already_processed_count}}
Ready to process: {{queue_length}}</output>
<ask>Ready to start processing {{queue_length}} files?
1. Yes, start batch processing
2. Show me the file list first
3. Cancel
What would you like to do?</ask>
<action if="option 2">Display paginated file list with details</action>
<action if="option 3">HALT</action>
<template-output>processing_queue</template-output>
</step>
<step n="7" goal="Process documents with OCR" repeat="for-each-file-in-queue">
<action>Select next file from queue</action>
<action>Display progress: "Processing {{current_file_number}}/{{total_files}}: {{filename}}"</action>
<action>Prepare file for OCR:
- If PDF: Convert pages to images
- If Excel: Read and prepare content
- If MSG: Extract and prepare content</action>
<action>Call Mistral OCR API via OpenRouter</action>
<action>Implement retry logic with exponential backoff (max 3 retries)</action>
<check if="API call successful">
<action>Extract OCR text from response</action>
<action>Parse text into structured data using field mappings</action>
<action>Calculate confidence score for extraction</action>
<action>Store extracted data with metadata</action>
</check>
<check if="API call failed">
<action>Log error details</action>
<action>Add file to failed_files list</action>
<ask>API call failed for {{filename}}: {{error_message}}
Options:
1. Retry this file
2. Skip and continue
3. Pause processing
What would you like to do?</ask>
</check>
<template-output>extraction_results</template-output>
</step>
<step n="8" goal="Validate extracted data">
<action>Review all extraction results</action>
<action>Separate into high_confidence and low_confidence groups</action>
<output>**🎯 Extraction Complete**
Successfully extracted: {{successful_count}} files
- High confidence (>= {{confidence_threshold}}%): {{high_confidence_count}}
- Low confidence (< {{confidence_threshold}}%): {{low_confidence_count}}
- Failed: {{failed_count}}</output>
<check if="low_confidence_count > 0">
<ask>**⚠️ Low Confidence Extractions Detected**
{{low_confidence_count}} files have confidence scores below {{confidence_threshold}}%.
These require human review before saving to Excel.
Options:
1. Review and correct each one
2. Auto-approve all (risky)
3. Skip low-confidence files for now
What would you like to do?</ask>
<action if="option 1">Present each low-confidence extraction for review</action>
<action if="option 2">Mark all as approved with warning</action>
<action if="option 3">Move to failed_files list</action>
</check>
</step>
<step n="9" goal="Review and correct low-confidence extractions" if="review_required" repeat="for-each-low-confidence">
<output>**📄 File:** {{filename}}
**Confidence:** {{confidence_score}}%
**Extracted Data:**
{{extracted_fields_display}}
**Raw OCR Text:**
{{ocr_text}}</output>
<ask>Please review and correct if needed:
1. Approve as-is
2. Edit field values
3. Skip this file
4. View raw OCR text
What would you like to do?</ask>
<action if="option 2">Prompt for corrected values for each field</action>
<action if="option 2">Update extraction_results with corrected data</action>
<action if="option 3">Move to skipped_files list</action>
<template-output>validated_extractions</template-output>
</step>
<step n="10" goal="Backup master Excel file">
<action>Check if master Excel file exists</action>
<check if="file exists">
<action>Create timestamped backup: master-file-YYYYMMDD-HHMMSS.xlsx</action>
<action>Save backup to backup_folder</action>
<output>✅ Created backup: {{backup_filename}}</output>
</check>
<check if="file not exists">
<ask>Master Excel file does not exist: {{master_file}}
Would you like to:
1. Create new Excel file
2. Specify different file
3. Cancel
What would you like to do?</ask>
<action if="option 1">Create new Excel file with proper headers</action>
<action if="option 2">Ask for new file path, validate, continue</action>
<action if="option 3">HALT</action>
</check>
<template-output>backup_file_path</template-output>
</step>
<step n="11" goal="Write data to master Excel file">
<action>Load master Excel file</action>
<action>Validate that sheet exists (create if configured to do so)</action>
<action>Validate that headers match configured fields</action>
<action>For each validated extraction:
- Append new row with extracted data
- Add metadata (source_file, processed_date, confidence)
- Format cells according to field types</action>
<action>Save Excel file with atomic write operation</action>
<check if="write successful">
<output>✅ Successfully wrote {{row_count}} rows to {{master_file}}</output>
</check>
<check if="write failed">
<output>⚠️ Failed to write to Excel file: {{error_message}}</output>
<ask>Options:
1. Restore from backup and retry
2. Save extraction results to JSON file
3. Cancel
What would you like to do?</ask>
<action if="option 1">Restore backup, retry write operation</action>
<action if="option 2">Save results to {{output_folder}}/extraction-results-{{date}}.json</action>
</check>
<template-output>excel_write_log</template-output>
</step>
<step n="12" goal="Move processed files">
<action>For each successfully processed file:
- Determine target path in processed_folder
- Maintain original folder structure if configured
- Move file to processed_folder
- Log file movement</action>
<output>📦 Moved {{moved_file_count}} files to {{processed_folder}}</output>
<template-output>file_movement_log</template-output>
</step>
<step n="13" goal="Generate processing report">
<action>Compile processing statistics</action>
<action>Generate extraction report using template.md</action>
<action>Save report to {{output_folder}}/extraction-results-{{date}}.md</action>
<action>Create processing log with detailed audit trail</action>
<action>Save log to {{log_folder}}/processing-log-{{date}}.json</action>
<template-output>processing_report</template-output>
<template-output>processing_log</template-output>
</step>
<step n="14" goal="Final summary and next steps">
<output>**✅ Batch Processing Complete!**
**Summary:**
- Total files processed: {{total_files_processed}}
- Successfully extracted: {{successful_count}}
- Failed/skipped: {{failed_count}}
- Average confidence: {{average_confidence}}%
- Total processing time: {{total_duration}}
**Files Updated:**
- Master Excel: {{master_file}} ({{row_count}} new rows)
- Backup: {{backup_file_path}}
- Processing Report: {{processing_report}}
- Processing Log: {{processing_log}}
**Next Steps:**
1. Review extraction report: {{processing_report}}
2. Validate data in master Excel file
3. Process failed files manually if needed ({{failed_count}} files)
4. Archive processed files: {{processed_folder}}</output>
<check if="failed_count > 0">
<output>
**⚠️ Failed Files:**
The following files could not be processed:
{{failed_files_list}}
See processing log for details: {{processing_log}}</output>
</check>
<ask>Would you like to:
1. View the extraction report
2. Process failed files
3. Start another batch
4. Exit
What would you like to do?</ask>
<action if="option 1">Display report content</action>
<action if="option 2">Restart workflow with failed_files as queue</action>
<action if="option 3">Restart workflow from step 6</action>
<action if="option 4">HALT</action>
</step>
</workflow>

View File

@ -0,0 +1,128 @@
# OCR Data Extraction Results: {{project_name}}
**Date:** {{date}}
**Processed By:** {{user_name}}
**Status:** {{processing_status}}
---
## Extraction Summary
**Total Files Processed:** {{total_files_processed}}
**Successfully Extracted:** {{successful_extractions}}
**Failed/Skipped:** {{failed_extractions}}
**Average Confidence Score:** {{average_confidence}}%
---
## Processing Configuration
**Source Folder:** {{source_folder}}
**Master Excel File:** {{master_file}}
**Extraction Fields:** {{extraction_fields_list}}
---
## Extraction Results
### High Confidence Extractions (>= {{confidence_threshold}}%)
{{high_confidence_results}}
### Low Confidence Extractions (Requires Review)
{{low_confidence_results}}
### Failed Extractions
{{failed_extractions_list}}
---
## Extracted Data Sample
| Date | Store Name | Part Timer Name | Shift Hours | Sales Amount | Source File | Confidence |
| ---- | ---------- | --------------- | ----------- | ------------ | ----------- | ---------- |
{{extraction_table_rows}}
---
## Processing Statistics
### Files by Type
- **PDF Files:** {{pdf_count}}
- **Excel Files:** {{excel_count}}
- **MSG Files:** {{msg_count}}
### Processing Time
- **Start Time:** {{start_time}}
- **End Time:** {{end_time}}
- **Total Duration:** {{total_duration}}
- **Average Time per File:** {{avg_time_per_file}}
### API Usage
- **Total API Calls:** {{total_api_calls}}
- **Successful Calls:** {{successful_api_calls}}
- **Failed Calls:** {{failed_api_calls}}
- **Retry Count:** {{retry_count}}
---
## Data Quality Metrics
### Confidence Distribution
- **90-100%:** {{count_90_100}}files
- **80-89%:** {{count_80_89}} files
- **70-79%:** {{count_70_79}} files
- **Below 70%:** {{count_below_70}} files
### Field Extraction Success Rates
- **Date:** {{date_success_rate}}%
- **Store Name:** {{store_name_success_rate}}%
- **Sales Amount:** {{sales_amount_success_rate}}%
- **Part Timer Name:** {{part_timer_name_success_rate}}%
- **Shift Hours:** {{shift_hours_success_rate}}%
---
## Files Requiring Human Review
{{files_needing_review}}
---
## Next Steps
- [ ] Review low-confidence extractions
- [ ] Manually process failed files
- [ ] Validate data in master Excel file
- [ ] Move processed files to archive
- [ ] Update extraction configuration if needed
---
## Audit Trail
### Files Processed
{{audit_trail_files}}
### Excel Write Operations
{{excel_write_log}}
### Errors and Warnings
{{errors_and_warnings}}
---
_This report was generated automatically by the OCR to Excel workflow._
_Master Excel File: {{master_file}}_
_Processing Log: {{processing_log_file}}_

View File

@ -0,0 +1,67 @@
# OCR to Excel - Data Extraction Workflow Configuration
name: ocr-to-excel
description: "Automated document processing workflow using Mistral OCR to extract structured data from PDFs, Excel files, and Outlook messages, consolidating results into a master Excel spreadsheet"
author: "BMad"
# Critical variables
config_source: "{project-root}/bmad/bmm/config.yaml"
output_folder: "{config_source}:output_folder"
user_name: "{config_source}:user_name"
date: system-generated
# Data extraction configuration
extraction_config_file: "{project-root}/bmad/bmm/workflows/data-extraction/ocr-to-excel/extraction-config.yaml"
# Optional input documents
recommended_inputs:
- source_documents_folder: "Folder containing files to process (required)"
- master_excel_file: "Master Excel file to append data (required)"
- field_mapping_config: "Field mapping configuration (optional - will use defaults)"
# Module path and component files
installed_path: "{project-root}/bmad/bmm/workflows/data-extraction/ocr-to-excel"
template: "{installed_path}/template.md"
instructions: "{installed_path}/instructions.md"
validation: "{installed_path}/checklist.md"
config_template: "{installed_path}/config-template.yaml"
# Output configuration
default_output_file: "{output_folder}/extraction-results-{{date}}.md"
processing_log_file: "{output_folder}/logs/processing-log-{{date}}.json"
# Workflow settings
autonomous: false # Requires human validation of extracted data
batch_processing: true # Supports processing multiple files
parallel_processing: 3 # Process N files concurrently (default: 3)
confidence_threshold: 0.85 # Auto-approve extractions with confidence >= 85%
# API configuration
api_provider: openrouter
api_model: "mistral/pixtral-large-latest"
api_endpoint: "https://openrouter.ai/api/v1/chat/completions"
# Processing configuration
supported_file_types:
- pdf
- xlsx
- xls
- msg
# File management
create_backups: true # Backup master Excel before each write
move_processed_files: true # Move processed files to done folder
maintain_folder_structure: true # Preserve original folder structure in done folder
web_bundle:
name: "ocr-to-excel"
description: "Automated document processing workflow using Mistral OCR to extract structured data from PDFs, Excel files, and Outlook messages"
author: "BMad"
instructions: "bmad/bmm/workflows/data-extraction/ocr-to-excel/instructions.md"
validation: "bmad/bmm/workflows/data-extraction/ocr-to-excel/checklist.md"
template: "bmad/bmm/workflows/data-extraction/ocr-to-excel/template.md"
use_advanced_elicitation: true
web_bundle_files:
- "bmad/bmm/workflows/data-extraction/ocr-to-excel/template.md"
- "bmad/bmm/workflows/data-extraction/ocr-to-excel/instructions.md"
- "bmad/bmm/workflows/data-extraction/ocr-to-excel/checklist.md"
- "bmad/bmm/workflows/data-extraction/ocr-to-excel/config-template.yaml"