function extract_document_sections
Extracts structured sections from a DOCX document by parsing headings and organizing content under each heading into a dictionary.
/tf/active/vicechatdev/CDocs/utils/document_processor.py
366 - 428
moderate
Purpose
This function processes Microsoft Word (.docx) files to extract document structure based on heading styles. It reads binary file content, creates a temporary file for parsing, identifies paragraphs with heading styles, and organizes non-heading content under their respective headings. This is useful for document analysis, content extraction, and structured data retrieval from Word documents.
Source Code
def extract_document_sections(file_content: bytes, file_name: str) -> Dict[str, str]:
"""
Extract document sections based on headings.
Args:
file_content: Binary content of the file
file_name: Name of the file
Returns:
Dictionary mapping section headings to section content
"""
sections = {}
try:
# Get file extension
_, ext = os.path.splitext(file_name)
ext = ext.lower()
if ext != '.docx' or not DOCX_AVAILABLE:
return sections
# Create temporary file
with tempfile.NamedTemporaryFile(suffix='.docx', delete=False) as temp_file:
temp_file.write(file_content)
temp_path = temp_file.name
try:
# Parse document
doc = DocxDocument(temp_path)
# Process paragraphs
current_heading = "Document"
current_content = []
for para in doc.paragraphs:
if para.style and 'heading' in para.style.name.lower():
# Save previous section
if current_content:
sections[current_heading] = '\n'.join(current_content)
current_content = []
# Start new section
current_heading = para.text
else:
# Add paragraph to current section
if para.text.strip():
current_content.append(para.text)
# Save last section
if current_content:
sections[current_heading] = '\n'.join(current_content)
finally:
# Clean up temporary file
try:
os.unlink(temp_path)
except:
pass
except Exception as e:
logger.error(f"Error extracting document sections: {e}")
return sections
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
file_content |
bytes | - | positional_or_keyword |
file_name |
str | - | positional_or_keyword |
Parameter Details
file_content: Binary content of the document file (bytes). This should be the raw bytes read from a .docx file, typically obtained from file.read() or similar file reading operations.
file_name: Name of the file including extension (string). Used to determine file type through extension checking. Must include the .docx extension for processing to occur.
Return Value
Type: Dict[str, str]
Returns a dictionary (Dict[str, str]) where keys are section heading texts and values are the concatenated content under each heading. If the file is not a .docx file, if the docx library is unavailable (DOCX_AVAILABLE flag), or if an error occurs, returns an empty dictionary. The default heading 'Document' is used for content before the first heading.
Dependencies
ostempfileloggingtypingpython-docx
Required Imports
import os
import tempfile
import logging
from typing import Dict
from docx import Document as DocxDocument
Conditional/Optional Imports
These imports are only needed under specific conditions:
from docx import Document as DocxDocument
Condition: Required for parsing .docx files. The function checks DOCX_AVAILABLE flag before attempting to use this import
Required (conditional)Usage Example
# Assuming DOCX_AVAILABLE and logger are defined in module scope
import logging
from typing import Dict
import os
import tempfile
from docx import Document as DocxDocument
logger = logging.getLogger(__name__)
DOCX_AVAILABLE = True
# Read a Word document
with open('example.docx', 'rb') as f:
file_content = f.read()
# Extract sections
sections = extract_document_sections(file_content, 'example.docx')
# Process extracted sections
for heading, content in sections.items():
print(f"Section: {heading}")
print(f"Content: {content[:100]}...") # Print first 100 chars
print("-" * 50)
# Example output structure:
# sections = {
# 'Introduction': 'This is the introduction text...',
# 'Chapter 1': 'Content of chapter 1...',
# 'Conclusion': 'Final thoughts...'
# }
Best Practices
- Ensure the DOCX_AVAILABLE flag is properly set in the module scope before calling this function
- The function creates temporary files that are cleaned up automatically, but ensure sufficient disk space is available
- Only .docx files are processed; other formats return an empty dictionary without error
- The function uses paragraph styles to detect headings, so documents must have proper heading styles applied (Heading 1, Heading 2, etc.)
- Error handling is built-in with logging, but check the returned dictionary for emptiness to detect processing failures
- Content before the first heading is stored under the default 'Document' key
- The function strips whitespace from paragraphs, so empty paragraphs are automatically excluded
- Temporary files are deleted in a finally block, but in rare cases of system crashes, manual cleanup may be needed
- For large documents, consider memory usage as the entire file content is loaded into memory
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function extract_metadata 66.2% similar
-
function validate_document_structure 65.1% similar
-
function export_to_docx_v1 64.0% similar
-
function extract_metadata_docx 62.8% similar
-
function export_to_docx 60.8% similar