function main_v4
Main entry point function for an invoice processing system that monitors an inbound directory for PDF invoices, processes them using LLM extraction, generates Excel outputs, and moves processed files to a processed directory.
/tf/active/vicechatdev/invoice_extraction/main.py
193 - 270
complex
Purpose
This function serves as the command-line interface and orchestrator for an automated invoice processing pipeline. It handles argument parsing, configuration loading, directory setup, PDF file discovery, invoice processing via the InvoiceProcessor class, file management (moving processed files), error handling, and logging. It's designed to be run as a standalone script to batch process invoices from a monitored directory.
Source Code
def main():
"""Main entry point for the invoice processing system."""
parser = argparse.ArgumentParser(description='Process invoices using LLM extraction')
parser.add_argument('--config', help='Path to config file')
parser.add_argument('--output', help='Output directory for Excel files')
parser.add_argument('--log-level', choices=['DEBUG', 'INFO', 'WARNING', 'ERROR'], help='Logging level')
args = parser.parse_args()
# Load configuration
config = load_config(args.config)
# Override config with command line arguments
if args.output:
config.set("storage.path", args.output)
if args.log_level:
config.set("logging.log_level", args.log_level)
# Create output directory if it doesn't exist
output_dir = config.get("storage.path", "output")
os.makedirs(output_dir, exist_ok=True)
# Create processed directory if it doesn't exist
inbound_dir = "inbound"
processed_dir = os.path.join(inbound_dir, "processed")
os.makedirs(processed_dir, exist_ok=True)
# Initialize processor
processor = InvoiceProcessor(config)
logger.info(f"Monitoring {inbound_dir} for invoices...")
try:
# Check for PDF files in the inbound directory
# Using a case-insensitive glob to catch both .pdf and .PDF files
pdf_files = []
pdf_files.extend(Path(inbound_dir).glob("*.pdf"))
pdf_files.extend(Path(inbound_dir).glob("*.PDF"))
if not pdf_files:
logger.info(f"No PDF files found in {inbound_dir}")
return 1
logger.info(f"Found {len(pdf_files)} PDF files to process")
for pdf_file in pdf_files:
logger.info(f"Processing {pdf_file}")
result = processor.process_invoice(str(pdf_file))
if result['status'] == 'success':
logger.info(f"Successfully processed invoice: {pdf_file}")
logger.info(f"Output saved to: {result['excel_path']}")
# Move the file to processed directory instead of deleting it
destination = os.path.join(processed_dir, pdf_file.name)
os.rename(str(pdf_file), destination)
logger.info(f"Moved {pdf_file} to {destination}")
else:
logger.error(f"Failed to process invoice: {pdf_file}")
logger.error(f"Error: {result.get('error', 'Unknown error')}")
# Still delete failed files or potentially move to a different location
os.remove(pdf_file)
# Count how many files were processed
processed = sum(1 for f in pdf_files if not f.exists())
failed = len(pdf_files) - processed
logger.info(f"Processed {len(pdf_files)} invoices: {processed} successful, {failed} failed")
return 1
except KeyboardInterrupt:
logger.info("Processing interrupted by user")
return 130
except Exception as e:
logger.error(f"Unhandled error: {str(e)}")
logger.debug(traceback.format_exc())
return 1
Return Value
Returns an integer exit code: 1 for normal completion (whether successful or with failures), 130 for keyboard interrupt (SIGINT), or 1 for unhandled exceptions. Note: The function returns 1 in both success and error cases, which is non-standard (typically 0 indicates success).
Dependencies
argparseospathlibsystracebacktypingtime
Required Imports
import os
import argparse
import logging
from pathlib import Path
import sys
import traceback
from typing import Dict, Any, List, Optional
import time
from config import load_config, Config
from utils.logging_utils import get_logger, PerformanceLogger
from utils.llm_client import LLMClient
from extractors.uk_extractor import UKExtractor
from extractors.be_extractor import BEExtractor
from extractors.au_extractor import AUExtractor
from validators.uk_validator import UKValidator
from validators.be_validator import BEValidator
from validators.au_validator import AUValidator
from core.document_processor import DocumentProcessor
from core.entity_classifier import EntityClassifier
from core.language_detector import LanguageDetector
from core.excel_generator import ExcelGenerator
Usage Example
# Run from command line:
# python script.py --config config.yaml --output ./results --log-level INFO
# Or call directly in code:
if __name__ == '__main__':
exit_code = main()
sys.exit(exit_code)
# Ensure directory structure exists:
# inbound/ <- Place PDF invoices here
# inbound/processed/ <- Processed files moved here
# output/ <- Excel outputs saved here
# The function will:
# 1. Parse command-line arguments
# 2. Load configuration
# 3. Create necessary directories
# 4. Find all PDF files in 'inbound' directory
# 5. Process each invoice using InvoiceProcessor
# 6. Generate Excel output for each invoice
# 7. Move successfully processed PDFs to 'inbound/processed'
# 8. Delete failed PDFs
# 9. Log summary statistics
Best Practices
- The function returns 1 for both success and failure cases, which is non-standard. Consider returning 0 for success and non-zero for errors.
- Failed invoice PDFs are deleted (os.remove), which may result in data loss. Consider moving them to a 'failed' directory instead.
- The function processes files synchronously in a loop. For large batches, consider implementing parallel processing.
- Ensure the InvoiceProcessor class is properly initialized with all required dependencies before calling this function.
- The 'inbound' directory path is hardcoded. Consider making it configurable via command-line argument or config file.
- The function catches KeyboardInterrupt separately, returning exit code 130 (standard for SIGINT), which is good practice.
- File existence check after processing (f.exists()) may not accurately reflect success since files are moved, not deleted on success.
- Ensure proper logging configuration is set up before calling this function, as it relies on a pre-configured logger object.
- The function creates directories with exist_ok=True, which is safe for concurrent execution.
- Case-insensitive PDF file matching (.pdf and .PDF) ensures cross-platform compatibility.
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
class InvoiceProcessor 75.2% similar
-
function main_v51 64.3% similar
-
function main_v6 61.5% similar
-
function main_v44 60.3% similar
-
function main 60.2% similar