extract_excel_metadata - Code Extractor

function extract_excel_metadata

Maturity: 68

Extracts comprehensive metadata from Excel files including cell comments, merged regions, named ranges, document properties, and sheet-level information that standard pandas operations miss.

File:
/tf/active/vicechatdev/vice_ai/smartstat_service.py

Lines:
338 - 453

Complexity:
moderate

Purpose

This function provides deep introspection of Excel files beyond raw data extraction. It captures contextual information like author comments, document metadata, named ranges, merged cell regions, and sheet properties. This is valuable for understanding the structure and context of Excel files before processing data, preserving important annotations, or auditing spreadsheet contents. It complements pandas.read_excel() by extracting metadata that would otherwise be lost.

Source Code

def extract_excel_metadata(file_path: str, sheet_name: str = None) -> Dict[str, Any]:
    """
    Extract rich metadata from Excel files including:
    - Cell comments and notes
    - Merged cell regions
    - Named ranges
    - Document properties (author, title, subject, keywords, etc.)
    - Sheet-level information
    
    This captures information that pandas.read_excel() misses.
    
    Args:
        file_path: Path to Excel file
        sheet_name: Specific sheet to extract metadata from (if None, extracts workbook-level metadata)
    
    Returns:
        Dict with metadata including comments, document properties, named ranges, etc.
    """
    metadata = {
        'file_path': file_path,
        'extraction_timestamp': datetime.now().isoformat(),
        'document_properties': {},
        'comments': [],
        'merged_cells': [],
        'named_ranges': [],
        'sheet_info': {}
    }
    
    try:
        # Load workbook with openpyxl to access rich metadata
        wb = openpyxl.load_workbook(file_path, data_only=False, keep_links=False)
        
        # Extract document properties
        props = wb.properties
        if props:
            metadata['document_properties'] = {
                'title': props.title or '',
                'author': props.creator or '',
                'subject': props.subject or '',
                'description': props.description or '',
                'keywords': props.keywords or '',
                'category': props.category or '',
                'comments': props.comments or '',
                'created': props.created.isoformat() if props.created else None,
                'modified': props.modified.isoformat() if props.modified else None,
                'last_modified_by': props.lastModifiedBy or ''
            }
            logger.info(f"Extracted document properties: {metadata['document_properties']}")
        
        # Extract named ranges (these often contain important context)
        if wb.defined_names:
            for name, defn in wb.defined_names.items():
                try:
                    metadata['named_ranges'].append({
                        'name': name,
                        'definition': str(defn.value) if defn.value else '',
                        'scope': defn.localSheetId if hasattr(defn, 'localSheetId') else 'workbook'
                    })
                except Exception as e:
                    logger.warning(f"Could not extract named range '{name}': {e}")
        
        # If specific sheet requested, extract sheet-level metadata
        if sheet_name and sheet_name in wb.sheetnames:
            ws = wb[sheet_name]
            
            # Extract comments/notes from cells
            for row in ws.iter_rows():
                for cell in row:
                    if cell.comment:
                        comment_text = cell.comment.text if hasattr(cell.comment, 'text') else str(cell.comment)
                        metadata['comments'].append({
                            'cell': cell.coordinate,
                            'value': str(cell.value) if cell.value else '',
                            'comment': comment_text,
                            'author': cell.comment.author if hasattr(cell.comment, 'author') else 'Unknown'
                        })
            
            # Extract merged cell regions (often used for headers/titles)
            if ws.merged_cells:
                for merged_range in ws.merged_cells.ranges:
                    # Get the value from the top-left cell of merged region
                    top_left = ws[merged_range.min_row][merged_range.min_col - 1]
                    metadata['merged_cells'].append({
                        'range': str(merged_range),
                        'value': str(top_left.value) if top_left.value else ''
                    })
            
            # Extract sheet-level info
            metadata['sheet_info'] = {
                'name': sheet_name,
                'hidden': ws.sheet_state == 'hidden',
                'max_row': ws.max_row,
                'max_column': ws.max_column,
                'has_auto_filter': ws.auto_filter is not None,
                'has_conditional_formatting': len(ws.conditional_formatting) > 0 if hasattr(ws, 'conditional_formatting') else False
            }
            
            logger.info(f"Sheet '{sheet_name}': Found {len(metadata['comments'])} comments, "
                       f"{len(metadata['merged_cells'])} merged regions")
        
        # Extract all sheet names and basic info
        metadata['all_sheets'] = []
        for ws_name in wb.sheetnames:
            ws = wb[ws_name]
            metadata['all_sheets'].append({
                'name': ws_name,
                'hidden': ws.sheet_state == 'hidden',
                'dimensions': f"{ws.max_row}×{ws.max_column}"
            })
        
        wb.close()
        return metadata
        
    except Exception as e:
        logger.error(f"Error extracting Excel metadata from {file_path}: {e}")
        return metadata  # Return partial metadata even if error occurs

Parameters

Name	Type	Default	Kind
`file_path`	str	-	positional_or_keyword
`sheet_name`	str	None	positional_or_keyword

Parameter Details

file_path: String path to the Excel file (.xlsx format) to extract metadata from. Must be a valid path to an existing Excel file that openpyxl can read.

sheet_name: Optional string specifying the name of a specific sheet to extract detailed metadata from. If None, only workbook-level metadata (document properties, named ranges, all sheet names) is extracted. If provided, also extracts sheet-specific metadata like comments and merged cells.

Return Value

Type: Dict[str, Any]

Returns a dictionary containing structured metadata with keys: 'file_path' (input path), 'extraction_timestamp' (ISO format datetime), 'document_properties' (dict with title, author, subject, description, keywords, category, comments, created, modified, last_modified_by), 'comments' (list of dicts with cell, value, comment, author), 'merged_cells' (list of dicts with range and value), 'named_ranges' (list of dicts with name, definition, scope), 'sheet_info' (dict with name, hidden status, dimensions, filters, formatting), and 'all_sheets' (list of dicts with name, hidden status, dimensions for all sheets). Returns partial metadata if errors occur during extraction.

Dependencies

openpyxl
datetime
typing

Required Imports

import openpyxl
from datetime import datetime
from typing import Dict, Any

Usage Example

import openpyxl
from datetime import datetime
from typing import Dict, Any
import logging

# Setup logger
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Extract workbook-level metadata only
metadata = extract_excel_metadata('data/report.xlsx')
print(f"Author: {metadata['document_properties']['author']}")
print(f"Named ranges: {len(metadata['named_ranges'])}")
print(f"All sheets: {[s['name'] for s in metadata['all_sheets']]}")

# Extract detailed metadata for specific sheet
sheet_metadata = extract_excel_metadata('data/report.xlsx', sheet_name='Summary')
print(f"Comments found: {len(sheet_metadata['comments'])}")
for comment in sheet_metadata['comments']:
    print(f"Cell {comment['cell']}: {comment['comment']}")
print(f"Merged cells: {len(sheet_metadata['merged_cells'])}")
for merged in sheet_metadata['merged_cells']:
    print(f"Range {merged['range']}: {merged['value']}")

Best Practices

Always check if the returned metadata contains the expected keys before accessing nested values, as partial metadata may be returned on errors
Use sheet_name parameter when you need detailed cell-level metadata; omit it for faster workbook-level extraction
The function loads the entire workbook into memory, so be cautious with very large Excel files
Comments and merged cells are only extracted when sheet_name is specified
The function uses data_only=False to preserve formulas, but this means calculated values may not be available
Named ranges with complex definitions may fail to parse; check the logs for warnings
Ensure a logger is configured before calling this function to capture extraction details and errors
The function gracefully handles errors and returns partial metadata, so always check the logs for warnings
Close the workbook after extraction (handled automatically by the function)
For production use, consider adding file existence and format validation before calling this function

Similar Components

AI-powered semantic similarity - components with related functionality:

function read_excel_file 77.2% similar

Reads Excel files and returns either metadata for all sheets or detailed data for a specific sheet, including format validation, European decimal conversion, and rich metadata extraction.
From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
function extract_metadata_docx 63.6% similar

Extracts comprehensive metadata from Microsoft Word DOCX files, including document properties, statistics, and fallback title extraction from content or filename.
From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
function extract_metadata_pdf 63.1% similar

Extracts metadata from PDF files including title, author, creation date, page count, and other document properties using PyPDF2 library.
From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
function extract_sheet_context 59.2% similar

Extracts comprehensive text context from Excel DataFrame sheets that contain mixed structured and unstructured content, converting them into markdown-formatted text while preserving table structures, key-value pairs, and section headers.
From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
function extract_metadata 54.8% similar

Extracts metadata from file content by analyzing the file type and computing file properties including hash, size, and type-specific metadata.
From: /tf/active/vicechatdev/CDocs/utils/document_processor.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def extract_excel_metadata(file_path: str, sheet_name: str = None) -> Dict[str, Any]:
    """
    Extract rich metadata from Excel files including:
    - Cell comments and notes
    - Merged cell regions
    - Named ranges
    - Document properties (author, title, subject, keywords, etc.)
    - Sheet-level information
    
    This captures information that pandas.read_excel() misses.
    
    Args:
        file_path: Path to Excel file
        sheet_name: Specific sheet to extract metadata from (if None, extracts workbook-level metadata)
    
    Returns:
        Dict with metadata including comments, document properties, named ranges, etc.
    """
    metadata = {
        'file_path': file_path,
        'extraction_timestamp': datetime.now().isoformat(),
        'document_properties': {},
        'comments': [],
        'merged_cells': [],
        'named_ranges': [],
        'sheet_info': {}
    }
    
    try:
        # Load workbook with openpyxl to access rich metadata
        wb = openpyxl.load_workbook(file_path, data_only=False, keep_links=False)
        
        # Extract document properties
        props = wb.properties
        if props:
            metadata['document_properties'] = {
                'title': props.title or '',
                'author': props.creator or '',
                'subject': props.subject or '',
                'description': props.description or '',
                'keywords': props.keywords or '',
                'category': props.category or '',
                'comments': props.comments or '',
                'created': props.created.isoformat() if props.created else None,
                'modified': props.modified.isoformat() if props.modified else None,
                'last_modified_by': props.lastModifiedBy or ''
            }
            logger.info(f"Extracted document properties: {metadata['document_properties']}")
        
        # Extract named ranges (these often contain important context)
        if wb.defined_names:
            for name, defn in wb.defined_names.items():
                try:
                    metadata['named_ranges'].append({
                        'name': name,
                        'definition': str(defn.value) if defn.value else '',
                        'scope': defn.localSheetId if hasattr(defn, 'localSheetId') else 'workbook'
                    })
                except Exception as e:
                    logger.warning(f"Could not extract named range '{name}': {e}")
        
        # If specific sheet requested, extract sheet-level metadata
        if sheet_name and sheet_name in wb.sheetnames:
            ws = wb[sheet_name]
            
            # Extract comments/notes from cells
            for row in ws.iter_rows():
                for cell in row:
                    if cell.comment:
                        comment_text = cell.comment.text if hasattr(cell.comment, 'text') else str(cell.comment)
                        metadata['comments'].append({
                            'cell': cell.coordinate,
                            'value': str(cell.value) if cell.value else '',
                            'comment': comment_text,
                            'author': cell.comment.author if hasattr(cell.comment, 'author') else 'Unknown'
                        })
            
            # Extract merged cell regions (often used for headers/titles)
            if ws.merged_cells:
                for merged_range in ws.merged_cells.ranges:
                    # Get the value from the top-left cell of merged region
                    top_left = ws[merged_range.min_row][merged_range.min_col - 1]
                    metadata['merged_cells'].append({
                        'range': str(merged_range),
                        'value': str(top_left.value) if top_left.value else ''
                    })
            
            # Extract sheet-level info
            metadata['sheet_info'] = {
                'name': sheet_name,
                'hidden': ws.sheet_state == 'hidden',
                'max_row': ws.max_row,
                'max_column': ws.max_column,
                'has_auto_filter': ws.auto_filter is not None,
                'has_conditional_formatting': len(ws.conditional_formatting) > 0 if hasattr(ws, 'conditional_formatting') else False
            }
            
            logger.info(f"Sheet '{sheet_name}': Found {len(metadata['comments'])} comments, "
                       f"{len(metadata['merged_cells'])} merged regions")
        
        # Extract all sheet names and basic info
        metadata['all_sheets'] = []
        for ws_name in wb.sheetnames:
            ws = wb[ws_name]
            metadata['all_sheets'].append({
                'name': ws_name,
                'hidden': ws.sheet_state == 'hidden',
                'dimensions': f"{ws.max_row}×{ws.max_column}"
            })
        
        wb.close()
        return metadata
        
    except Exception as e:
        logger.error(f"Error extracting Excel metadata from {file_path}: {e}")
        return metadata  # Return partial metadata even if error occurs
                        

Improved Code

🔍 Code Extractor

function extract_excel_metadata

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function read_excel_file 77.2% similar

function extract_metadata_docx 63.6% similar

function extract_metadata_pdf 63.1% similar

function extract_sheet_context 59.2% similar

function extract_metadata 54.8% similar

function extract_excel_metadata

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function read_excel_file 77.2% similar

function extract_metadata_docx 63.6% similar

function extract_metadata_pdf 63.1% similar

function extract_sheet_context 59.2% similar

function extract_metadata 54.8% similar

✨ Improve Code: extract_excel_metadata

Code Comparison