function extract_excel_metadata
Extracts comprehensive metadata from Excel files including cell comments, merged regions, named ranges, document properties, and sheet-level information that standard pandas operations miss.
/tf/active/vicechatdev/vice_ai/smartstat_service.py
338 - 453
moderate
Purpose
This function provides deep introspection of Excel files beyond raw data extraction. It captures contextual information like author comments, document metadata, named ranges, merged cell regions, and sheet properties. This is valuable for understanding the structure and context of Excel files before processing data, preserving important annotations, or auditing spreadsheet contents. It complements pandas.read_excel() by extracting metadata that would otherwise be lost.
Source Code
def extract_excel_metadata(file_path: str, sheet_name: str = None) -> Dict[str, Any]:
"""
Extract rich metadata from Excel files including:
- Cell comments and notes
- Merged cell regions
- Named ranges
- Document properties (author, title, subject, keywords, etc.)
- Sheet-level information
This captures information that pandas.read_excel() misses.
Args:
file_path: Path to Excel file
sheet_name: Specific sheet to extract metadata from (if None, extracts workbook-level metadata)
Returns:
Dict with metadata including comments, document properties, named ranges, etc.
"""
metadata = {
'file_path': file_path,
'extraction_timestamp': datetime.now().isoformat(),
'document_properties': {},
'comments': [],
'merged_cells': [],
'named_ranges': [],
'sheet_info': {}
}
try:
# Load workbook with openpyxl to access rich metadata
wb = openpyxl.load_workbook(file_path, data_only=False, keep_links=False)
# Extract document properties
props = wb.properties
if props:
metadata['document_properties'] = {
'title': props.title or '',
'author': props.creator or '',
'subject': props.subject or '',
'description': props.description or '',
'keywords': props.keywords or '',
'category': props.category or '',
'comments': props.comments or '',
'created': props.created.isoformat() if props.created else None,
'modified': props.modified.isoformat() if props.modified else None,
'last_modified_by': props.lastModifiedBy or ''
}
logger.info(f"Extracted document properties: {metadata['document_properties']}")
# Extract named ranges (these often contain important context)
if wb.defined_names:
for name, defn in wb.defined_names.items():
try:
metadata['named_ranges'].append({
'name': name,
'definition': str(defn.value) if defn.value else '',
'scope': defn.localSheetId if hasattr(defn, 'localSheetId') else 'workbook'
})
except Exception as e:
logger.warning(f"Could not extract named range '{name}': {e}")
# If specific sheet requested, extract sheet-level metadata
if sheet_name and sheet_name in wb.sheetnames:
ws = wb[sheet_name]
# Extract comments/notes from cells
for row in ws.iter_rows():
for cell in row:
if cell.comment:
comment_text = cell.comment.text if hasattr(cell.comment, 'text') else str(cell.comment)
metadata['comments'].append({
'cell': cell.coordinate,
'value': str(cell.value) if cell.value else '',
'comment': comment_text,
'author': cell.comment.author if hasattr(cell.comment, 'author') else 'Unknown'
})
# Extract merged cell regions (often used for headers/titles)
if ws.merged_cells:
for merged_range in ws.merged_cells.ranges:
# Get the value from the top-left cell of merged region
top_left = ws[merged_range.min_row][merged_range.min_col - 1]
metadata['merged_cells'].append({
'range': str(merged_range),
'value': str(top_left.value) if top_left.value else ''
})
# Extract sheet-level info
metadata['sheet_info'] = {
'name': sheet_name,
'hidden': ws.sheet_state == 'hidden',
'max_row': ws.max_row,
'max_column': ws.max_column,
'has_auto_filter': ws.auto_filter is not None,
'has_conditional_formatting': len(ws.conditional_formatting) > 0 if hasattr(ws, 'conditional_formatting') else False
}
logger.info(f"Sheet '{sheet_name}': Found {len(metadata['comments'])} comments, "
f"{len(metadata['merged_cells'])} merged regions")
# Extract all sheet names and basic info
metadata['all_sheets'] = []
for ws_name in wb.sheetnames:
ws = wb[ws_name]
metadata['all_sheets'].append({
'name': ws_name,
'hidden': ws.sheet_state == 'hidden',
'dimensions': f"{ws.max_row}×{ws.max_column}"
})
wb.close()
return metadata
except Exception as e:
logger.error(f"Error extracting Excel metadata from {file_path}: {e}")
return metadata # Return partial metadata even if error occurs
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
file_path |
str | - | positional_or_keyword |
sheet_name |
str | None | positional_or_keyword |
Parameter Details
file_path: String path to the Excel file (.xlsx format) to extract metadata from. Must be a valid path to an existing Excel file that openpyxl can read.
sheet_name: Optional string specifying the name of a specific sheet to extract detailed metadata from. If None, only workbook-level metadata (document properties, named ranges, all sheet names) is extracted. If provided, also extracts sheet-specific metadata like comments and merged cells.
Return Value
Type: Dict[str, Any]
Returns a dictionary containing structured metadata with keys: 'file_path' (input path), 'extraction_timestamp' (ISO format datetime), 'document_properties' (dict with title, author, subject, description, keywords, category, comments, created, modified, last_modified_by), 'comments' (list of dicts with cell, value, comment, author), 'merged_cells' (list of dicts with range and value), 'named_ranges' (list of dicts with name, definition, scope), 'sheet_info' (dict with name, hidden status, dimensions, filters, formatting), and 'all_sheets' (list of dicts with name, hidden status, dimensions for all sheets). Returns partial metadata if errors occur during extraction.
Dependencies
openpyxldatetimetyping
Required Imports
import openpyxl
from datetime import datetime
from typing import Dict, Any
Usage Example
import openpyxl
from datetime import datetime
from typing import Dict, Any
import logging
# Setup logger
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
# Extract workbook-level metadata only
metadata = extract_excel_metadata('data/report.xlsx')
print(f"Author: {metadata['document_properties']['author']}")
print(f"Named ranges: {len(metadata['named_ranges'])}")
print(f"All sheets: {[s['name'] for s in metadata['all_sheets']]}")
# Extract detailed metadata for specific sheet
sheet_metadata = extract_excel_metadata('data/report.xlsx', sheet_name='Summary')
print(f"Comments found: {len(sheet_metadata['comments'])}")
for comment in sheet_metadata['comments']:
print(f"Cell {comment['cell']}: {comment['comment']}")
print(f"Merged cells: {len(sheet_metadata['merged_cells'])}")
for merged in sheet_metadata['merged_cells']:
print(f"Range {merged['range']}: {merged['value']}")
Best Practices
- Always check if the returned metadata contains the expected keys before accessing nested values, as partial metadata may be returned on errors
- Use sheet_name parameter when you need detailed cell-level metadata; omit it for faster workbook-level extraction
- The function loads the entire workbook into memory, so be cautious with very large Excel files
- Comments and merged cells are only extracted when sheet_name is specified
- The function uses data_only=False to preserve formulas, but this means calculated values may not be available
- Named ranges with complex definitions may fail to parse; check the logs for warnings
- Ensure a logger is configured before calling this function to capture extraction details and errors
- The function gracefully handles errors and returns partial metadata, so always check the logs for warnings
- Close the workbook after extraction (handled automatically by the function)
- For production use, consider adding file existence and format validation before calling this function
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function read_excel_file 77.2% similar
-
function extract_metadata_docx 63.6% similar
-
function extract_metadata_pdf 63.1% similar
-
function extract_sheet_context 59.2% similar
-
function extract_metadata 54.8% similar