🔍 Code Extractor

function extract_metadata_pdf

Maturity: 55

Extracts metadata from PDF files including title, author, creation date, page count, and other document properties using PyPDF2 library.

File:
/tf/active/vicechatdev/CDocs/utils/document_processor.py
Lines:
220 - 274
Complexity:
moderate

Purpose

This function reads PDF files and extracts embedded metadata such as title, author, creator, producer, subject, keywords, creation/modification dates, and page count. It handles various edge cases including missing PyPDF2 library, malformed dates, byte-encoded titles, and missing metadata fields. When metadata extraction fails or PyPDF2 is unavailable, it falls back to using the filename as the title.

Source Code

def extract_metadata_pdf(file_path: str) -> Dict[str, Any]:
    """
    Extract metadata from a PDF file.
    
    Args:
        file_path: Path to PDF file
        
    Returns:
        Dictionary with extracted metadata
    """
    if not PYPDF2_AVAILABLE:
        logger.warning("PyPDF2 library not available. Cannot extract PDF metadata.")
        return {'title': os.path.splitext(os.path.basename(file_path))[0]}
        
    try:
        with open(file_path, 'rb') as f:
            reader = PyPDF2.PdfReader(f)
            info = reader.metadata
            
            # PyPDF2 returns metadata as a dictionary-like object
            metadata = {
                'title': info.get('/Title', os.path.splitext(os.path.basename(file_path))[0]),
                'author': info.get('/Author', ''),
                'creator': info.get('/Creator', ''),
                'producer': info.get('/Producer', ''),
                'subject': info.get('/Subject', ''),
                'keywords': info.get('/Keywords', ''),
                'created': info.get('/CreationDate', ''),
                'modified': info.get('/ModDate', ''),
                'pageCount': len(reader.pages)
            }
            
            # Clean up PDF date format if present
            for date_field in ['created', 'modified']:
                if isinstance(metadata[date_field], str) and metadata[date_field].startswith('D:'):
                    try:
                        # PDF dates are in format D:YYYYMMDDHHmmSSOHH'mm'
                        date_str = metadata[date_field][2:14]  # Extract YYYYMMDDHHMM
                        metadata[date_field] = datetime.strptime(date_str, '%Y%m%d%H%M')
                    except:
                        metadata[date_field] = ''
                        
            # Convert string title to proper string if it's bytes
            if isinstance(metadata['title'], bytes):
                metadata['title'] = metadata['title'].decode('utf-8', errors='ignore')
                
            # If still no title, use filename
            if not metadata['title']:
                metadata['title'] = os.path.splitext(os.path.basename(file_path))[0]
                
            return metadata
            
    except Exception as e:
        logger.error(f"Error extracting PDF metadata: {e}")
        return {'title': os.path.splitext(os.path.basename(file_path))[0]}

Parameters

Name Type Default Kind
file_path str - positional_or_keyword

Parameter Details

file_path: String path to the PDF file to extract metadata from. Must be a valid path to an existing PDF file that can be opened in binary read mode. The path can be absolute or relative.

Return Value

Type: Dict[str, Any]

Returns a dictionary (Dict[str, Any]) containing PDF metadata with keys: 'title' (str), 'author' (str), 'creator' (str), 'producer' (str), 'subject' (str), 'keywords' (str), 'created' (datetime or str), 'modified' (datetime or str), and 'pageCount' (int). If extraction fails or PyPDF2 is unavailable, returns a minimal dictionary with only 'title' key set to the filename without extension. Date fields are converted from PDF format (D:YYYYMMDDHHmmSS) to datetime objects when possible, otherwise empty strings.

Dependencies

  • PyPDF2
  • logging
  • os
  • datetime

Required Imports

import logging
import os
from typing import Dict, Any
from datetime import datetime
import PyPDF2

Conditional/Optional Imports

These imports are only needed under specific conditions:

import PyPDF2

Condition: Required for PDF metadata extraction. If not available, function returns fallback metadata with only filename as title. The code checks PYPDF2_AVAILABLE flag before attempting to use PyPDF2.

Optional

Usage Example

import os
import logging
from typing import Dict, Any
from datetime import datetime
import PyPDF2

# Setup logger
logger = logging.getLogger(__name__)
PYPDF2_AVAILABLE = True

def extract_metadata_pdf(file_path: str) -> Dict[str, Any]:
    # ... (function code as provided)
    pass

# Example usage
pdf_path = '/path/to/document.pdf'
metadata = extract_metadata_pdf(pdf_path)

print(f"Title: {metadata.get('title')}")
print(f"Author: {metadata.get('author')}")
print(f"Page Count: {metadata.get('pageCount')}")
print(f"Created: {metadata.get('created')}")

# Handle case when PyPDF2 is not available
if 'pageCount' not in metadata:
    print("Full metadata extraction unavailable, using filename only")

Best Practices

  • Always check if PyPDF2 is available before calling this function in production environments
  • Handle the case where only 'title' key is returned (when PyPDF2 is unavailable or extraction fails)
  • Ensure the file_path points to a valid, readable PDF file before calling
  • Be aware that date fields may be datetime objects or empty strings depending on PDF metadata format
  • The function gracefully degrades to filename-based title when metadata extraction fails
  • Consider wrapping calls in try-except blocks for additional error handling in critical applications
  • Note that metadata quality depends on the PDF creator - some PDFs may have minimal or no embedded metadata
  • The function opens files in binary mode and properly closes them using context manager

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function extract_metadata_docx 68.4% similar

    Extracts comprehensive metadata from Microsoft Word DOCX files, including document properties, statistics, and fallback title extraction from content or filename.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function extract_excel_metadata 63.1% similar

    Extracts comprehensive metadata from Excel files including cell comments, merged regions, named ranges, document properties, and sheet-level information that standard pandas operations miss.

    From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
  • function extract_metadata 62.7% similar

    Extracts metadata from file content by analyzing the file type and computing file properties including hash, size, and type-specific metadata.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function process_document 60.6% similar

    Processes a document file (DOCX, DOC, or PDF) and extracts comprehensive metadata including file information, content metadata, and cryptographic hash.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function read_excel_file 58.0% similar

    Reads Excel files and returns either metadata for all sheets or detailed data for a specific sheet, including format validation, European decimal conversion, and rich metadata extraction.

    From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
← Back to Browse