function main_v6
Command-line interface function that orchestrates PDF document analysis using OCR and LLM processing, with configurable input/output paths and processing limits.
/tf/active/vicechatdev/mailsearch/document_analyzer.py
563 - 617
moderate
Purpose
Serves as the main entry point for a document analysis application that reads PDF files from a download register, processes them using a DocumentAnalyzer class (which performs OCR and LLM analysis), and saves structured results. Designed for batch processing of PDF documents with progress tracking and error handling.
Source Code
def main():
"""Main execution function"""
import argparse
parser = argparse.ArgumentParser(description="Analyze downloaded PDF documents")
parser.add_argument(
'--register',
default='./output/download_register.csv',
help='Path to download register CSV'
)
parser.add_argument(
'--limit',
type=int,
default=None,
help='Limit number of documents to process (for testing)'
)
parser.add_argument(
'--output-dir',
default='./output',
help='Output directory for results'
)
args = parser.parse_args()
print(f"\n{'='*80}")
print("Document Analyzer - PDF Analysis with OCR and LLM")
print(f"{'='*80}\n")
try:
# Initialize analyzer
analyzer = DocumentAnalyzer(output_dir=args.output_dir)
# Process documents
results = analyzer.process_documents_from_register(
register_path=args.register,
limit=args.limit
)
# Save results
analyzer.save_results(results)
# Summary
successful = sum(1 for r in results if r['success'])
failed = len(results) - successful
print(f"\n{'='*80}")
print(f"Processing Complete!")
print(f" Total documents: {len(results)}")
print(f" Successful: {successful}")
print(f" Failed: {failed}")
print(f"{'='*80}\n")
except Exception as e:
logger.error(f"Fatal error: {e}")
raise
Return Value
Returns None. The function performs side effects including printing status messages to stdout, processing documents through DocumentAnalyzer, and saving results to files. May raise exceptions on fatal errors.
Dependencies
argparseloggingcsvjsonpathlibdatetimetypingnumpypdf2imagepytesseracteasyocrPILopenai
Required Imports
import argparse
import logging
import csv
import json
from pathlib import Path
from datetime import datetime
from typing import Dict, List, Optional, Any
import numpy as np
from pdf2image import convert_from_path
import pytesseract
import easyocr
from PIL import Image
from openai import OpenAI
Conditional/Optional Imports
These imports are only needed under specific conditions:
import argparse
Condition: imported inside the function body, only when main() is called
Required (conditional)Usage Example
# Run from command line with default settings:
# python script.py
# Run with custom register path and limit:
# python script.py --register /path/to/register.csv --limit 10 --output-dir /path/to/output
# In Python code (if calling directly):
if __name__ == '__main__':
main()
# The function expects to be run as a script entry point and will:
# 1. Parse command-line arguments
# 2. Initialize DocumentAnalyzer with output directory
# 3. Process documents from the register CSV
# 4. Save results and print summary statistics
Best Practices
- This function should be called as the main entry point of the script using if __name__ == '__main__': main()
- Ensure DocumentAnalyzer class is properly defined before calling this function
- Configure logging before calling main() to capture all log messages
- The function expects a CSV register file with specific format - ensure compatibility
- Use --limit parameter during testing to avoid processing large document sets
- Ensure sufficient disk space in output directory for results
- Handle keyboard interrupts gracefully if processing large batches
- The function will raise exceptions on fatal errors - wrap in try-except if calling programmatically
- Verify all system dependencies (Tesseract, poppler) are installed before running
- Set appropriate OpenAI API credentials before execution
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function main_v5 73.7% similar
-
function main_v75 67.9% similar
-
function main_v91 67.4% similar
-
class DocumentAnalyzer 65.4% similar
-
function main_v8 65.3% similar