extract_document_sections - Code Extractor

function extract_document_sections

Maturity: 61

Extracts structured sections from a DOCX document by parsing headings and organizing content under each heading into a dictionary.

File:
/tf/active/vicechatdev/CDocs/utils/document_processor.py

Lines:
366 - 428

Complexity:
moderate

Purpose

This function processes Microsoft Word (.docx) files to extract document structure based on heading styles. It reads binary file content, creates a temporary file for parsing, identifies paragraphs with heading styles, and organizes non-heading content under their respective headings. This is useful for document analysis, content extraction, and structured data retrieval from Word documents.

Source Code

def extract_document_sections(file_content: bytes, file_name: str) -> Dict[str, str]:
    """
    Extract document sections based on headings.
    
    Args:
        file_content: Binary content of the file
        file_name: Name of the file
        
    Returns:
        Dictionary mapping section headings to section content
    """
    sections = {}
    
    try:
        # Get file extension
        _, ext = os.path.splitext(file_name)
        ext = ext.lower()
        
        if ext != '.docx' or not DOCX_AVAILABLE:
            return sections
            
        # Create temporary file
        with tempfile.NamedTemporaryFile(suffix='.docx', delete=False) as temp_file:
            temp_file.write(file_content)
            temp_path = temp_file.name
            
        try:
            # Parse document
            doc = DocxDocument(temp_path)
            
            # Process paragraphs
            current_heading = "Document"
            current_content = []
            
            for para in doc.paragraphs:
                if para.style and 'heading' in para.style.name.lower():
                    # Save previous section
                    if current_content:
                        sections[current_heading] = '\n'.join(current_content)
                        current_content = []
                        
                    # Start new section
                    current_heading = para.text
                else:
                    # Add paragraph to current section
                    if para.text.strip():
                        current_content.append(para.text)
            
            # Save last section
            if current_content:
                sections[current_heading] = '\n'.join(current_content)
                
        finally:
            # Clean up temporary file
            try:
                os.unlink(temp_path)
            except:
                pass
                
    except Exception as e:
        logger.error(f"Error extracting document sections: {e}")
        
    return sections

Parameters

Name	Type	Default	Kind
`file_content`	bytes	-	positional_or_keyword
`file_name`	str	-	positional_or_keyword

Parameter Details

file_content: Binary content of the document file (bytes). This should be the raw bytes read from a .docx file, typically obtained from file.read() or similar file reading operations.

file_name: Name of the file including extension (string). Used to determine file type through extension checking. Must include the .docx extension for processing to occur.

Return Value

Type: Dict[str, str]

Returns a dictionary (Dict[str, str]) where keys are section heading texts and values are the concatenated content under each heading. If the file is not a .docx file, if the docx library is unavailable (DOCX_AVAILABLE flag), or if an error occurs, returns an empty dictionary. The default heading 'Document' is used for content before the first heading.

Dependencies

os
tempfile
logging
typing
python-docx

Required Imports

import os
import tempfile
import logging
from typing import Dict
from docx import Document as DocxDocument

Conditional/Optional Imports

These imports are only needed under specific conditions:

from docx import Document as DocxDocument

Condition: Required for parsing .docx files. The function checks DOCX_AVAILABLE flag before attempting to use this import

Required (conditional)

Usage Example

# Assuming DOCX_AVAILABLE and logger are defined in module scope
import logging
from typing import Dict
import os
import tempfile
from docx import Document as DocxDocument

logger = logging.getLogger(__name__)
DOCX_AVAILABLE = True

# Read a Word document
with open('example.docx', 'rb') as f:
    file_content = f.read()

# Extract sections
sections = extract_document_sections(file_content, 'example.docx')

# Process extracted sections
for heading, content in sections.items():
    print(f"Section: {heading}")
    print(f"Content: {content[:100]}...")  # Print first 100 chars
    print("-" * 50)

# Example output structure:
# sections = {
#     'Introduction': 'This is the introduction text...',
#     'Chapter 1': 'Content of chapter 1...',
#     'Conclusion': 'Final thoughts...'
# }

Best Practices

Ensure the DOCX_AVAILABLE flag is properly set in the module scope before calling this function
The function creates temporary files that are cleaned up automatically, but ensure sufficient disk space is available
Only .docx files are processed; other formats return an empty dictionary without error
The function uses paragraph styles to detect headings, so documents must have proper heading styles applied (Heading 1, Heading 2, etc.)
Error handling is built-in with logging, but check the returned dictionary for emptiness to detect processing failures
Content before the first heading is stored under the default 'Document' key
The function strips whitespace from paragraphs, so empty paragraphs are automatically excluded
Temporary files are deleted in a finally block, but in rare cases of system crashes, manual cleanup may be needed
For large documents, consider memory usage as the entire file content is loaded into memory

Similar Components

AI-powered semantic similarity - components with related functionality:

function extract_metadata 66.2% similar

Extracts metadata from file content by analyzing the file type and computing file properties including hash, size, and type-specific metadata.
From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
function validate_document_structure 65.1% similar

Validates the structural integrity of a DOCX document by checking if it contains all required sections specified in the document type template configuration.
From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
function export_to_docx_v1 64.0% similar

Exports a document object to Microsoft Word DOCX format, converting sections, content, and references into a formatted Word document with proper styling and structure.
From: /tf/active/vicechatdev/vice_ai/complex_app.py
function extract_metadata_docx 62.8% similar

Extracts comprehensive metadata from Microsoft Word DOCX files, including document properties, statistics, and fallback title extraction from content or filename.
From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
function export_to_docx 60.8% similar

Exports a document with text and data sections to Microsoft Word DOCX format, preserving formatting, structure, and metadata.
From: /tf/active/vicechatdev/vice_ai/new_app.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def extract_document_sections(file_content: bytes, file_name: str) -> Dict[str, str]:
    """
    Extract document sections based on headings.
    
    Args:
        file_content: Binary content of the file
        file_name: Name of the file
        
    Returns:
        Dictionary mapping section headings to section content
    """
    sections = {}
    
    try:
        # Get file extension
        _, ext = os.path.splitext(file_name)
        ext = ext.lower()
        
        if ext != '.docx' or not DOCX_AVAILABLE:
            return sections
            
        # Create temporary file
        with tempfile.NamedTemporaryFile(suffix='.docx', delete=False) as temp_file:
            temp_file.write(file_content)
            temp_path = temp_file.name
            
        try:
            # Parse document
            doc = DocxDocument(temp_path)
            
            # Process paragraphs
            current_heading = "Document"
            current_content = []
            
            for para in doc.paragraphs:
                if para.style and 'heading' in para.style.name.lower():
                    # Save previous section
                    if current_content:
                        sections[current_heading] = '\n'.join(current_content)
                        current_content = []
                        
                    # Start new section
                    current_heading = para.text
                else:
                    # Add paragraph to current section
                    if para.text.strip():
                        current_content.append(para.text)
            
            # Save last section
            if current_content:
                sections[current_heading] = '\n'.join(current_content)
                
        finally:
            # Clean up temporary file
            try:
                os.unlink(temp_path)
            except:
                pass
                
    except Exception as e:
        logger.error(f"Error extracting document sections: {e}")
        
    return sections
                        

Improved Code

🔍 Code Extractor

function extract_document_sections

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function extract_metadata 66.2% similar

function validate_document_structure 65.1% similar

function export_to_docx_v1 64.0% similar

function extract_metadata_docx 62.8% similar

function export_to_docx 60.8% similar

function extract_document_sections

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function extract_metadata 66.2% similar

function validate_document_structure 65.1% similar

function export_to_docx_v1 64.0% similar

function extract_metadata_docx 62.8% similar

function export_to_docx 60.8% similar

✨ Improve Code: extract_document_sections

Code Comparison