function process_multi_page_pdf
A convenience wrapper function that processes multi-page PDF files and extracts analysis data from each page along with document metadata.
/tf/active/vicechatdev/e-ink-llm/multi_page_processor.py
374 - 386
simple
Purpose
This function provides a simplified interface for processing PDF documents with multiple pages. It instantiates a MultiPagePDFProcessor, extracts content and analysis from all pages (up to a specified maximum), and returns structured data about each page along with overall document metadata. It's designed for use cases requiring automated PDF content extraction, document analysis pipelines, or batch processing of PDF files.
Source Code
def process_multi_page_pdf(pdf_path: str, max_pages: int = 50) -> Tuple[List[PageAnalysis], Dict[str, Any]]:
"""
Convenience function to process multi-page PDF
Args:
pdf_path: Path to PDF file
max_pages: Maximum pages to process
Returns:
Tuple of (page analyses, document metadata)
"""
processor = MultiPagePDFProcessor(max_pages=max_pages)
return processor.extract_all_pages(Path(pdf_path))
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
pdf_path |
str | - | positional_or_keyword |
max_pages |
int | 50 | positional_or_keyword |
Parameter Details
pdf_path: String representing the file system path to the PDF file to be processed. Can be absolute or relative path. The file must exist and be a valid PDF format.
max_pages: Integer specifying the maximum number of pages to process from the PDF. Defaults to 50. This parameter helps control processing time and resource usage for large documents. If the PDF has fewer pages than max_pages, all pages will be processed.
Return Value
Type: Tuple[List[PageAnalysis], Dict[str, Any]]
Returns a tuple containing two elements: (1) A list of PageAnalysis objects, where each object contains analysis data for a single page including extracted text, images, layout information, and other page-specific metadata. (2) A dictionary containing document-level metadata such as total page count, file information, processing statistics, and other document-wide properties. The exact structure of PageAnalysis and metadata dictionary depends on the MultiPagePDFProcessor implementation.
Dependencies
fitzPyMuPDFPillowpathlibtypingdataclassesloggingbase64iosys
Required Imports
from pathlib import Path
from typing import List, Dict, Any, Tuple
Usage Example
from pathlib import Path
from typing import List, Dict, Any, Tuple
# Process a PDF with default settings (max 50 pages)
page_analyses, metadata = process_multi_page_pdf('document.pdf')
# Access results
print(f"Processed {len(page_analyses)} pages")
print(f"Total pages in document: {metadata.get('total_pages')}")
# Iterate through page analyses
for i, page_analysis in enumerate(page_analyses):
print(f"Page {i+1}: {page_analysis}")
# Process with custom max_pages limit
page_analyses, metadata = process_multi_page_pdf(
pdf_path='/path/to/large_document.pdf',
max_pages=10
)
# Using Path object
from pathlib import Path
pdf_file = Path('reports/annual_report.pdf')
page_analyses, metadata = process_multi_page_pdf(str(pdf_file), max_pages=100)
Best Practices
- Ensure the PDF file exists and is readable before calling this function to avoid file not found errors
- Set an appropriate max_pages value based on your memory constraints and processing requirements
- Handle potential exceptions from PDF processing (corrupted files, permission issues, etc.)
- For very large PDFs, consider processing in batches by calling this function multiple times with different page ranges
- The function returns all data in memory, so be cautious with very large documents
- Verify that MultiPagePDFProcessor is properly initialized and configured before using this wrapper
- Consider logging or error handling around this function call in production environments
- The pdf_path parameter accepts strings, so convert Path objects to strings if needed
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
class MultiPagePDFProcessor 68.7% similar
-
function process_single_file 58.3% similar
-
class MultiPageAnalysisResult 57.2% similar
-
class MultiPageLLMHandler 55.4% similar
-
function test_enhanced_pdf_processing 54.7% similar