extract_metadata_pdf - Code Extractor

function extract_metadata_pdf

Maturity: 55

Extracts metadata from PDF files including title, author, creation date, page count, and other document properties using PyPDF2 library.

File:
/tf/active/vicechatdev/CDocs/utils/document_processor.py

Lines:
220 - 274

Complexity:
moderate

Purpose

This function reads PDF files and extracts embedded metadata such as title, author, creator, producer, subject, keywords, creation/modification dates, and page count. It handles various edge cases including missing PyPDF2 library, malformed dates, byte-encoded titles, and missing metadata fields. When metadata extraction fails or PyPDF2 is unavailable, it falls back to using the filename as the title.

Source Code

def extract_metadata_pdf(file_path: str) -> Dict[str, Any]:
    """
    Extract metadata from a PDF file.
    
    Args:
        file_path: Path to PDF file
        
    Returns:
        Dictionary with extracted metadata
    """
    if not PYPDF2_AVAILABLE:
        logger.warning("PyPDF2 library not available. Cannot extract PDF metadata.")
        return {'title': os.path.splitext(os.path.basename(file_path))[0]}
        
    try:
        with open(file_path, 'rb') as f:
            reader = PyPDF2.PdfReader(f)
            info = reader.metadata
            
            # PyPDF2 returns metadata as a dictionary-like object
            metadata = {
                'title': info.get('/Title', os.path.splitext(os.path.basename(file_path))[0]),
                'author': info.get('/Author', ''),
                'creator': info.get('/Creator', ''),
                'producer': info.get('/Producer', ''),
                'subject': info.get('/Subject', ''),
                'keywords': info.get('/Keywords', ''),
                'created': info.get('/CreationDate', ''),
                'modified': info.get('/ModDate', ''),
                'pageCount': len(reader.pages)
            }
            
            # Clean up PDF date format if present
            for date_field in ['created', 'modified']:
                if isinstance(metadata[date_field], str) and metadata[date_field].startswith('D:'):
                    try:
                        # PDF dates are in format D:YYYYMMDDHHmmSSOHH'mm'
                        date_str = metadata[date_field][2:14]  # Extract YYYYMMDDHHMM
                        metadata[date_field] = datetime.strptime(date_str, '%Y%m%d%H%M')
                    except:
                        metadata[date_field] = ''
                        
            # Convert string title to proper string if it's bytes
            if isinstance(metadata['title'], bytes):
                metadata['title'] = metadata['title'].decode('utf-8', errors='ignore')
                
            # If still no title, use filename
            if not metadata['title']:
                metadata['title'] = os.path.splitext(os.path.basename(file_path))[0]
                
            return metadata
            
    except Exception as e:
        logger.error(f"Error extracting PDF metadata: {e}")
        return {'title': os.path.splitext(os.path.basename(file_path))[0]}

Parameters

Name	Type	Default	Kind
`file_path`	str	-	positional_or_keyword

Parameter Details

file_path: String path to the PDF file to extract metadata from. Must be a valid path to an existing PDF file that can be opened in binary read mode. The path can be absolute or relative.

Return Value

Type: Dict[str, Any]

Returns a dictionary (Dict[str, Any]) containing PDF metadata with keys: 'title' (str), 'author' (str), 'creator' (str), 'producer' (str), 'subject' (str), 'keywords' (str), 'created' (datetime or str), 'modified' (datetime or str), and 'pageCount' (int). If extraction fails or PyPDF2 is unavailable, returns a minimal dictionary with only 'title' key set to the filename without extension. Date fields are converted from PDF format (D:YYYYMMDDHHmmSS) to datetime objects when possible, otherwise empty strings.

Dependencies

PyPDF2
logging
os
datetime

Required Imports

import logging
import os
from typing import Dict, Any
from datetime import datetime
import PyPDF2

Conditional/Optional Imports

These imports are only needed under specific conditions:

import PyPDF2

Condition: Required for PDF metadata extraction. If not available, function returns fallback metadata with only filename as title. The code checks PYPDF2_AVAILABLE flag before attempting to use PyPDF2.

Optional

Usage Example

import os
import logging
from typing import Dict, Any
from datetime import datetime
import PyPDF2

# Setup logger
logger = logging.getLogger(__name__)
PYPDF2_AVAILABLE = True

def extract_metadata_pdf(file_path: str) -> Dict[str, Any]:
    # ... (function code as provided)
    pass

# Example usage
pdf_path = '/path/to/document.pdf'
metadata = extract_metadata_pdf(pdf_path)

print(f"Title: {metadata.get('title')}")
print(f"Author: {metadata.get('author')}")
print(f"Page Count: {metadata.get('pageCount')}")
print(f"Created: {metadata.get('created')}")

# Handle case when PyPDF2 is not available
if 'pageCount' not in metadata:
    print("Full metadata extraction unavailable, using filename only")

Best Practices

Always check if PyPDF2 is available before calling this function in production environments
Handle the case where only 'title' key is returned (when PyPDF2 is unavailable or extraction fails)
Ensure the file_path points to a valid, readable PDF file before calling
Be aware that date fields may be datetime objects or empty strings depending on PDF metadata format
The function gracefully degrades to filename-based title when metadata extraction fails
Consider wrapping calls in try-except blocks for additional error handling in critical applications
Note that metadata quality depends on the PDF creator - some PDFs may have minimal or no embedded metadata
The function opens files in binary mode and properly closes them using context manager

Similar Components

AI-powered semantic similarity - components with related functionality:

function extract_metadata_docx 68.4% similar

Extracts comprehensive metadata from Microsoft Word DOCX files, including document properties, statistics, and fallback title extraction from content or filename.
From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
function extract_excel_metadata 63.1% similar

Extracts comprehensive metadata from Excel files including cell comments, merged regions, named ranges, document properties, and sheet-level information that standard pandas operations miss.
From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
function extract_metadata 62.7% similar

Extracts metadata from file content by analyzing the file type and computing file properties including hash, size, and type-specific metadata.
From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
function process_document 60.6% similar

Processes a document file (DOCX, DOC, or PDF) and extracts comprehensive metadata including file information, content metadata, and cryptographic hash.
From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
function read_excel_file 58.0% similar

Reads Excel files and returns either metadata for all sheets or detailed data for a specific sheet, including format validation, European decimal conversion, and rich metadata extraction.
From: /tf/active/vicechatdev/vice_ai/smartstat_service.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def extract_metadata_pdf(file_path: str) -> Dict[str, Any]:
    """
    Extract metadata from a PDF file.
    
    Args:
        file_path: Path to PDF file
        
    Returns:
        Dictionary with extracted metadata
    """
    if not PYPDF2_AVAILABLE:
        logger.warning("PyPDF2 library not available. Cannot extract PDF metadata.")
        return {'title': os.path.splitext(os.path.basename(file_path))[0]}
        
    try:
        with open(file_path, 'rb') as f:
            reader = PyPDF2.PdfReader(f)
            info = reader.metadata
            
            # PyPDF2 returns metadata as a dictionary-like object
            metadata = {
                'title': info.get('/Title', os.path.splitext(os.path.basename(file_path))[0]),
                'author': info.get('/Author', ''),
                'creator': info.get('/Creator', ''),
                'producer': info.get('/Producer', ''),
                'subject': info.get('/Subject', ''),
                'keywords': info.get('/Keywords', ''),
                'created': info.get('/CreationDate', ''),
                'modified': info.get('/ModDate', ''),
                'pageCount': len(reader.pages)
            }
            
            # Clean up PDF date format if present
            for date_field in ['created', 'modified']:
                if isinstance(metadata[date_field], str) and metadata[date_field].startswith('D:'):
                    try:
                        # PDF dates are in format D:YYYYMMDDHHmmSSOHH'mm'
                        date_str = metadata[date_field][2:14]  # Extract YYYYMMDDHHMM
                        metadata[date_field] = datetime.strptime(date_str, '%Y%m%d%H%M')
                    except:
                        metadata[date_field] = ''
                        
            # Convert string title to proper string if it's bytes
            if isinstance(metadata['title'], bytes):
                metadata['title'] = metadata['title'].decode('utf-8', errors='ignore')
                
            # If still no title, use filename
            if not metadata['title']:
                metadata['title'] = os.path.splitext(os.path.basename(file_path))[0]
                
            return metadata
            
    except Exception as e:
        logger.error(f"Error extracting PDF metadata: {e}")
        return {'title': os.path.splitext(os.path.basename(file_path))[0]}
                        

Improved Code

🔍 Code Extractor

function extract_metadata_pdf

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function extract_metadata_docx 68.4% similar

function extract_excel_metadata 63.1% similar

function extract_metadata 62.7% similar

function process_document 60.6% similar

function read_excel_file 58.0% similar

function extract_metadata_pdf

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function extract_metadata_docx 68.4% similar

function extract_excel_metadata 63.1% similar

function extract_metadata 62.7% similar

function process_document 60.6% similar

function read_excel_file 58.0% similar

✨ Improve Code: extract_metadata_pdf

Code Comparison