🔍 Code Extractor

function main_v4

Maturity: 51

Main entry point function for an invoice processing system that monitors an inbound directory for PDF invoices, processes them using LLM extraction, generates Excel outputs, and moves processed files to a processed directory.

File:
/tf/active/vicechatdev/invoice_extraction/main.py
Lines:
193 - 270
Complexity:
complex

Purpose

This function serves as the command-line interface and orchestrator for an automated invoice processing pipeline. It handles argument parsing, configuration loading, directory setup, PDF file discovery, invoice processing via the InvoiceProcessor class, file management (moving processed files), error handling, and logging. It's designed to be run as a standalone script to batch process invoices from a monitored directory.

Source Code

def main():
    """Main entry point for the invoice processing system."""
    parser = argparse.ArgumentParser(description='Process invoices using LLM extraction')
    parser.add_argument('--config', help='Path to config file')
    parser.add_argument('--output', help='Output directory for Excel files')
    parser.add_argument('--log-level', choices=['DEBUG', 'INFO', 'WARNING', 'ERROR'], help='Logging level')
    args = parser.parse_args()
    
    # Load configuration
    config = load_config(args.config)
    
    # Override config with command line arguments
    if args.output:
        config.set("storage.path", args.output)
    
    if args.log_level:
        config.set("logging.log_level", args.log_level)
    
    # Create output directory if it doesn't exist
    output_dir = config.get("storage.path", "output")
    os.makedirs(output_dir, exist_ok=True)
    
    # Create processed directory if it doesn't exist
    inbound_dir = "inbound"
    processed_dir = os.path.join(inbound_dir, "processed")
    os.makedirs(processed_dir, exist_ok=True)
    
    # Initialize processor
    processor = InvoiceProcessor(config)
    
    logger.info(f"Monitoring {inbound_dir} for invoices...")
    
    try:
        # Check for PDF files in the inbound directory
        # Using a case-insensitive glob to catch both .pdf and .PDF files
        pdf_files = []
        pdf_files.extend(Path(inbound_dir).glob("*.pdf"))
        pdf_files.extend(Path(inbound_dir).glob("*.PDF"))
        
        if not pdf_files:
            logger.info(f"No PDF files found in {inbound_dir}")
            return 1
            
        logger.info(f"Found {len(pdf_files)} PDF files to process")
        
        for pdf_file in pdf_files:
            logger.info(f"Processing {pdf_file}")
            result = processor.process_invoice(str(pdf_file))
            
            if result['status'] == 'success':
                logger.info(f"Successfully processed invoice: {pdf_file}")
                logger.info(f"Output saved to: {result['excel_path']}")
                
                # Move the file to processed directory instead of deleting it
                destination = os.path.join(processed_dir, pdf_file.name)
                os.rename(str(pdf_file), destination)
                logger.info(f"Moved {pdf_file} to {destination}")
            else:
                logger.error(f"Failed to process invoice: {pdf_file}")
                logger.error(f"Error: {result.get('error', 'Unknown error')}")
                # Still delete failed files or potentially move to a different location
                os.remove(pdf_file)
        
        # Count how many files were processed
        processed = sum(1 for f in pdf_files if not f.exists())
        failed = len(pdf_files) - processed
        
        logger.info(f"Processed {len(pdf_files)} invoices: {processed} successful, {failed} failed")
        
        return 1
        
    except KeyboardInterrupt:
        logger.info("Processing interrupted by user")
        return 130
    except Exception as e:
        logger.error(f"Unhandled error: {str(e)}")
        logger.debug(traceback.format_exc())
        return 1

Return Value

Returns an integer exit code: 1 for normal completion (whether successful or with failures), 130 for keyboard interrupt (SIGINT), or 1 for unhandled exceptions. Note: The function returns 1 in both success and error cases, which is non-standard (typically 0 indicates success).

Dependencies

  • argparse
  • os
  • pathlib
  • sys
  • traceback
  • typing
  • time

Required Imports

import os
import argparse
import logging
from pathlib import Path
import sys
import traceback
from typing import Dict, Any, List, Optional
import time
from config import load_config, Config
from utils.logging_utils import get_logger, PerformanceLogger
from utils.llm_client import LLMClient
from extractors.uk_extractor import UKExtractor
from extractors.be_extractor import BEExtractor
from extractors.au_extractor import AUExtractor
from validators.uk_validator import UKValidator
from validators.be_validator import BEValidator
from validators.au_validator import AUValidator
from core.document_processor import DocumentProcessor
from core.entity_classifier import EntityClassifier
from core.language_detector import LanguageDetector
from core.excel_generator import ExcelGenerator

Usage Example

# Run from command line:
# python script.py --config config.yaml --output ./results --log-level INFO

# Or call directly in code:
if __name__ == '__main__':
    exit_code = main()
    sys.exit(exit_code)

# Ensure directory structure exists:
# inbound/               <- Place PDF invoices here
# inbound/processed/     <- Processed files moved here
# output/                <- Excel outputs saved here

# The function will:
# 1. Parse command-line arguments
# 2. Load configuration
# 3. Create necessary directories
# 4. Find all PDF files in 'inbound' directory
# 5. Process each invoice using InvoiceProcessor
# 6. Generate Excel output for each invoice
# 7. Move successfully processed PDFs to 'inbound/processed'
# 8. Delete failed PDFs
# 9. Log summary statistics

Best Practices

  • The function returns 1 for both success and failure cases, which is non-standard. Consider returning 0 for success and non-zero for errors.
  • Failed invoice PDFs are deleted (os.remove), which may result in data loss. Consider moving them to a 'failed' directory instead.
  • The function processes files synchronously in a loop. For large batches, consider implementing parallel processing.
  • Ensure the InvoiceProcessor class is properly initialized with all required dependencies before calling this function.
  • The 'inbound' directory path is hardcoded. Consider making it configurable via command-line argument or config file.
  • The function catches KeyboardInterrupt separately, returning exit code 130 (standard for SIGINT), which is good practice.
  • File existence check after processing (f.exists()) may not accurately reflect success since files are moved, not deleted on success.
  • Ensure proper logging configuration is set up before calling this function, as it relies on a pre-configured logger object.
  • The function creates directories with exist_ok=True, which is safe for concurrent execution.
  • Case-insensitive PDF file matching (.pdf and .PDF) ensures cross-platform compatibility.

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class InvoiceProcessor 75.2% similar

    Main orchestrator class that coordinates the complete invoice processing pipeline from PDF extraction through validation to Excel generation.

    From: /tf/active/vicechatdev/invoice_extraction/main.py
  • function main_v51 64.3% similar

    Entry point function that demonstrates document processing workflow by creating an audited, watermarked, and protected PDF/A document from a DOCX file with audit trail data.

    From: /tf/active/vicechatdev/document_auditor/main.py
  • function main_v6 61.5% similar

    Main entry point function for the Contract Validity Analyzer application that orchestrates configuration loading, logging setup, FileCloud connection, and contract analysis execution.

    From: /tf/active/vicechatdev/contract_validity_analyzer/main.py
  • function main_v44 60.3% similar

    Entry point function that parses command-line arguments and orchestrates the FileCloud email processing workflow to find, download, and convert .msg files.

    From: /tf/active/vicechatdev/msg_to_eml.py
  • function main 60.2% similar

    Main entry point function for a Legal Contract Data Extractor application that processes contracts from FileCloud, extracts data, and exports results to multiple formats (CSV, Excel, JSON).

    From: /tf/active/vicechatdev/contract_validity_analyzer/extractor.py
← Back to Browse