🔍 Code Extractor

function extract_metadata

Maturity: 58

Extracts metadata from file content by analyzing the file type and computing file properties including hash, size, and type-specific metadata.

File:
/tf/active/vicechatdev/CDocs/utils/document_processor.py
Lines:
103 - 165
Complexity:
moderate

Purpose

This function processes binary file content to extract comprehensive metadata. It handles different file types (DOCX, DOC, PDF) by delegating to specialized extraction functions, creates temporary files for processing, computes SHA-256 hash for file integrity, and returns a standardized metadata dictionary. It includes error handling to ensure basic metadata is always returned even if specialized extraction fails.

Source Code

def extract_metadata(file_content: bytes, file_name: str) -> Dict[str, Any]:
    """
    Extract metadata from file content.
    
    Args:
        file_content: Binary content of the file
        file_name: Name of the file
        
    Returns:
        Dictionary with extracted metadata
    """
    try:
        # Get file extension
        _, ext = os.path.splitext(file_name)
        ext = ext.lower()
        
        # Create temporary file
        with tempfile.NamedTemporaryFile(suffix=ext, delete=False) as temp_file:
            temp_file.write(file_content)
            temp_path = temp_file.name
            
        # Extract metadata based on file type
        metadata = {}
        try:
            if ext in ['.docx', '.doc']:
                metadata = extract_metadata_docx(temp_path)
            elif ext == '.pdf':
                metadata = extract_metadata_pdf(temp_path)
            else:
                metadata = {'title': os.path.splitext(os.path.basename(file_name))[0]}
                
        finally:
            # Clean up temporary file
            try:
                os.unlink(temp_path)
            except:
                pass
                
        # Calculate file hash
        file_hash = hashlib.sha256(file_content).hexdigest()
            
        # Add file metadata
        file_size = len(file_content)
        file_info = {
            'fileName': file_name,
            'fileSize': file_size,
            'fileType': ext[1:],  # Remove leading dot
            'fileHash': file_hash,
            'processedDate': datetime.now()
        }
        
        metadata.update(file_info)
        
        return metadata
        
    except Exception as e:
        logger.error(f"Error extracting metadata: {e}")
        return {
            'fileName': file_name,
            'fileSize': len(file_content),
            'fileHash': hashlib.sha256(file_content).hexdigest(),
            'processedDate': datetime.now()
        }

Parameters

Name Type Default Kind
file_content bytes - positional_or_keyword
file_name str - positional_or_keyword

Parameter Details

file_content: Binary content of the file as bytes. This is the raw file data that will be analyzed for metadata extraction and used to compute the file hash and size.

file_name: String representing the name of the file including its extension. Used to determine file type and extract the base name. The extension is case-insensitive and drives the metadata extraction strategy.

Return Value

Type: Dict[str, Any]

Returns a dictionary (Dict[str, Any]) containing extracted metadata. Always includes: 'fileName' (str), 'fileSize' (int in bytes), 'fileHash' (str, SHA-256 hexdigest), 'processedDate' (datetime object), and 'fileType' (str, extension without dot). For DOCX/DOC/PDF files, may include additional metadata like 'title' and other document properties extracted by specialized functions. On error, returns minimal metadata with the same guaranteed fields.

Dependencies

  • os
  • hashlib
  • tempfile
  • typing
  • datetime
  • logging
  • docx
  • PyPDF2

Required Imports

import os
import hashlib
import tempfile
from typing import Dict, Any
from datetime import datetime
import logging

Conditional/Optional Imports

These imports are only needed under specific conditions:

import docx

Condition: Required if processing .docx or .doc files and extract_metadata_docx function is called

Required (conditional)
import PyPDF2

Condition: Required if processing .pdf files and extract_metadata_pdf function is called

Required (conditional)

Usage Example

# Assuming helper functions and logger are defined
import os
import hashlib
import tempfile
from typing import Dict, Any
from datetime import datetime
import logging

logger = logging.getLogger(__name__)

# Define or import helper functions
def extract_metadata_docx(file_path):
    return {'title': 'Sample Document', 'author': 'John Doe'}

def extract_metadata_pdf(file_path):
    return {'title': 'Sample PDF', 'pages': 10}

# Read file content
with open('document.pdf', 'rb') as f:
    file_content = f.read()

# Extract metadata
metadata = extract_metadata(file_content, 'document.pdf')

print(f"File Name: {metadata['fileName']}")
print(f"File Size: {metadata['fileSize']} bytes")
print(f"File Hash: {metadata['fileHash']}")
print(f"File Type: {metadata['fileType']}")
print(f"Processed: {metadata['processedDate']}")
if 'title' in metadata:
    print(f"Title: {metadata['title']}")

Best Practices

  • Ensure the helper functions extract_metadata_docx and extract_metadata_pdf are properly implemented before using this function
  • The function creates temporary files on disk, so ensure adequate permissions and disk space
  • Temporary files are automatically cleaned up in the finally block, but orphaned files may remain if the process crashes
  • The function returns basic metadata even on error, making it safe to use in pipelines
  • File hash computation uses SHA-256, which is suitable for integrity checking but may be slow for very large files
  • The processedDate uses datetime.now() without timezone information; consider using timezone-aware datetime for production
  • File extensions are case-insensitive, but only .docx, .doc, and .pdf have specialized extraction
  • For unsupported file types, only the filename (without extension) is used as the title

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function extract_metadata_docx 68.8% similar

    Extracts comprehensive metadata from Microsoft Word DOCX files, including document properties, statistics, and fallback title extraction from content or filename.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function process_document 66.6% similar

    Processes a document file (DOCX, DOC, or PDF) and extracts comprehensive metadata including file information, content metadata, and cryptographic hash.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function extract_document_sections 66.2% similar

    Extracts structured sections from a DOCX document by parsing headings and organizing content under each heading into a dictionary.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function extract_metadata_pdf 62.7% similar

    Extracts metadata from PDF files including title, author, creation date, page count, and other document properties using PyPDF2 library.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function extract_metadata_from_filecloud 59.9% similar

    Extracts and normalizes metadata from FileCloud for document creation, providing default values and generating document numbers when needed.

    From: /tf/active/vicechatdev/CDocs/FC_sync.py
← Back to Browse