function extract_metadata
Extracts metadata from file content by analyzing the file type and computing file properties including hash, size, and type-specific metadata.
/tf/active/vicechatdev/CDocs/utils/document_processor.py
103 - 165
moderate
Purpose
This function processes binary file content to extract comprehensive metadata. It handles different file types (DOCX, DOC, PDF) by delegating to specialized extraction functions, creates temporary files for processing, computes SHA-256 hash for file integrity, and returns a standardized metadata dictionary. It includes error handling to ensure basic metadata is always returned even if specialized extraction fails.
Source Code
def extract_metadata(file_content: bytes, file_name: str) -> Dict[str, Any]:
"""
Extract metadata from file content.
Args:
file_content: Binary content of the file
file_name: Name of the file
Returns:
Dictionary with extracted metadata
"""
try:
# Get file extension
_, ext = os.path.splitext(file_name)
ext = ext.lower()
# Create temporary file
with tempfile.NamedTemporaryFile(suffix=ext, delete=False) as temp_file:
temp_file.write(file_content)
temp_path = temp_file.name
# Extract metadata based on file type
metadata = {}
try:
if ext in ['.docx', '.doc']:
metadata = extract_metadata_docx(temp_path)
elif ext == '.pdf':
metadata = extract_metadata_pdf(temp_path)
else:
metadata = {'title': os.path.splitext(os.path.basename(file_name))[0]}
finally:
# Clean up temporary file
try:
os.unlink(temp_path)
except:
pass
# Calculate file hash
file_hash = hashlib.sha256(file_content).hexdigest()
# Add file metadata
file_size = len(file_content)
file_info = {
'fileName': file_name,
'fileSize': file_size,
'fileType': ext[1:], # Remove leading dot
'fileHash': file_hash,
'processedDate': datetime.now()
}
metadata.update(file_info)
return metadata
except Exception as e:
logger.error(f"Error extracting metadata: {e}")
return {
'fileName': file_name,
'fileSize': len(file_content),
'fileHash': hashlib.sha256(file_content).hexdigest(),
'processedDate': datetime.now()
}
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
file_content |
bytes | - | positional_or_keyword |
file_name |
str | - | positional_or_keyword |
Parameter Details
file_content: Binary content of the file as bytes. This is the raw file data that will be analyzed for metadata extraction and used to compute the file hash and size.
file_name: String representing the name of the file including its extension. Used to determine file type and extract the base name. The extension is case-insensitive and drives the metadata extraction strategy.
Return Value
Type: Dict[str, Any]
Returns a dictionary (Dict[str, Any]) containing extracted metadata. Always includes: 'fileName' (str), 'fileSize' (int in bytes), 'fileHash' (str, SHA-256 hexdigest), 'processedDate' (datetime object), and 'fileType' (str, extension without dot). For DOCX/DOC/PDF files, may include additional metadata like 'title' and other document properties extracted by specialized functions. On error, returns minimal metadata with the same guaranteed fields.
Dependencies
oshashlibtempfiletypingdatetimeloggingdocxPyPDF2
Required Imports
import os
import hashlib
import tempfile
from typing import Dict, Any
from datetime import datetime
import logging
Conditional/Optional Imports
These imports are only needed under specific conditions:
import docx
Condition: Required if processing .docx or .doc files and extract_metadata_docx function is called
Required (conditional)import PyPDF2
Condition: Required if processing .pdf files and extract_metadata_pdf function is called
Required (conditional)Usage Example
# Assuming helper functions and logger are defined
import os
import hashlib
import tempfile
from typing import Dict, Any
from datetime import datetime
import logging
logger = logging.getLogger(__name__)
# Define or import helper functions
def extract_metadata_docx(file_path):
return {'title': 'Sample Document', 'author': 'John Doe'}
def extract_metadata_pdf(file_path):
return {'title': 'Sample PDF', 'pages': 10}
# Read file content
with open('document.pdf', 'rb') as f:
file_content = f.read()
# Extract metadata
metadata = extract_metadata(file_content, 'document.pdf')
print(f"File Name: {metadata['fileName']}")
print(f"File Size: {metadata['fileSize']} bytes")
print(f"File Hash: {metadata['fileHash']}")
print(f"File Type: {metadata['fileType']}")
print(f"Processed: {metadata['processedDate']}")
if 'title' in metadata:
print(f"Title: {metadata['title']}")
Best Practices
- Ensure the helper functions extract_metadata_docx and extract_metadata_pdf are properly implemented before using this function
- The function creates temporary files on disk, so ensure adequate permissions and disk space
- Temporary files are automatically cleaned up in the finally block, but orphaned files may remain if the process crashes
- The function returns basic metadata even on error, making it safe to use in pipelines
- File hash computation uses SHA-256, which is suitable for integrity checking but may be slow for very large files
- The processedDate uses datetime.now() without timezone information; consider using timezone-aware datetime for production
- File extensions are case-insensitive, but only .docx, .doc, and .pdf have specialized extraction
- For unsupported file types, only the filename (without extension) is used as the title
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function extract_metadata_docx 68.8% similar
-
function process_document 66.6% similar
-
function extract_document_sections 66.2% similar
-
function extract_metadata_pdf 62.7% similar
-
function extract_metadata_from_filecloud 59.9% similar