function extract_metadata_pdf
Extracts metadata from PDF files including title, author, creation date, page count, and other document properties using PyPDF2 library.
/tf/active/vicechatdev/CDocs/utils/document_processor.py
220 - 274
moderate
Purpose
This function reads PDF files and extracts embedded metadata such as title, author, creator, producer, subject, keywords, creation/modification dates, and page count. It handles various edge cases including missing PyPDF2 library, malformed dates, byte-encoded titles, and missing metadata fields. When metadata extraction fails or PyPDF2 is unavailable, it falls back to using the filename as the title.
Source Code
def extract_metadata_pdf(file_path: str) -> Dict[str, Any]:
"""
Extract metadata from a PDF file.
Args:
file_path: Path to PDF file
Returns:
Dictionary with extracted metadata
"""
if not PYPDF2_AVAILABLE:
logger.warning("PyPDF2 library not available. Cannot extract PDF metadata.")
return {'title': os.path.splitext(os.path.basename(file_path))[0]}
try:
with open(file_path, 'rb') as f:
reader = PyPDF2.PdfReader(f)
info = reader.metadata
# PyPDF2 returns metadata as a dictionary-like object
metadata = {
'title': info.get('/Title', os.path.splitext(os.path.basename(file_path))[0]),
'author': info.get('/Author', ''),
'creator': info.get('/Creator', ''),
'producer': info.get('/Producer', ''),
'subject': info.get('/Subject', ''),
'keywords': info.get('/Keywords', ''),
'created': info.get('/CreationDate', ''),
'modified': info.get('/ModDate', ''),
'pageCount': len(reader.pages)
}
# Clean up PDF date format if present
for date_field in ['created', 'modified']:
if isinstance(metadata[date_field], str) and metadata[date_field].startswith('D:'):
try:
# PDF dates are in format D:YYYYMMDDHHmmSSOHH'mm'
date_str = metadata[date_field][2:14] # Extract YYYYMMDDHHMM
metadata[date_field] = datetime.strptime(date_str, '%Y%m%d%H%M')
except:
metadata[date_field] = ''
# Convert string title to proper string if it's bytes
if isinstance(metadata['title'], bytes):
metadata['title'] = metadata['title'].decode('utf-8', errors='ignore')
# If still no title, use filename
if not metadata['title']:
metadata['title'] = os.path.splitext(os.path.basename(file_path))[0]
return metadata
except Exception as e:
logger.error(f"Error extracting PDF metadata: {e}")
return {'title': os.path.splitext(os.path.basename(file_path))[0]}
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
file_path |
str | - | positional_or_keyword |
Parameter Details
file_path: String path to the PDF file to extract metadata from. Must be a valid path to an existing PDF file that can be opened in binary read mode. The path can be absolute or relative.
Return Value
Type: Dict[str, Any]
Returns a dictionary (Dict[str, Any]) containing PDF metadata with keys: 'title' (str), 'author' (str), 'creator' (str), 'producer' (str), 'subject' (str), 'keywords' (str), 'created' (datetime or str), 'modified' (datetime or str), and 'pageCount' (int). If extraction fails or PyPDF2 is unavailable, returns a minimal dictionary with only 'title' key set to the filename without extension. Date fields are converted from PDF format (D:YYYYMMDDHHmmSS) to datetime objects when possible, otherwise empty strings.
Dependencies
PyPDF2loggingosdatetime
Required Imports
import logging
import os
from typing import Dict, Any
from datetime import datetime
import PyPDF2
Conditional/Optional Imports
These imports are only needed under specific conditions:
import PyPDF2
Condition: Required for PDF metadata extraction. If not available, function returns fallback metadata with only filename as title. The code checks PYPDF2_AVAILABLE flag before attempting to use PyPDF2.
OptionalUsage Example
import os
import logging
from typing import Dict, Any
from datetime import datetime
import PyPDF2
# Setup logger
logger = logging.getLogger(__name__)
PYPDF2_AVAILABLE = True
def extract_metadata_pdf(file_path: str) -> Dict[str, Any]:
# ... (function code as provided)
pass
# Example usage
pdf_path = '/path/to/document.pdf'
metadata = extract_metadata_pdf(pdf_path)
print(f"Title: {metadata.get('title')}")
print(f"Author: {metadata.get('author')}")
print(f"Page Count: {metadata.get('pageCount')}")
print(f"Created: {metadata.get('created')}")
# Handle case when PyPDF2 is not available
if 'pageCount' not in metadata:
print("Full metadata extraction unavailable, using filename only")
Best Practices
- Always check if PyPDF2 is available before calling this function in production environments
- Handle the case where only 'title' key is returned (when PyPDF2 is unavailable or extraction fails)
- Ensure the file_path points to a valid, readable PDF file before calling
- Be aware that date fields may be datetime objects or empty strings depending on PDF metadata format
- The function gracefully degrades to filename-based title when metadata extraction fails
- Consider wrapping calls in try-except blocks for additional error handling in critical applications
- Note that metadata quality depends on the PDF creator - some PDFs may have minimal or no embedded metadata
- The function opens files in binary mode and properly closes them using context manager
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function extract_metadata_docx 68.4% similar
-
function extract_excel_metadata 63.1% similar
-
function extract_metadata 62.7% similar
-
function process_document 60.6% similar
-
function read_excel_file 58.0% similar