🔍 Code Extractor

function extract_document_sections

Maturity: 61

Extracts structured sections from a DOCX document by parsing headings and organizing content under each heading into a dictionary.

File:
/tf/active/vicechatdev/CDocs/utils/document_processor.py
Lines:
366 - 428
Complexity:
moderate

Purpose

This function processes Microsoft Word (.docx) files to extract document structure based on heading styles. It reads binary file content, creates a temporary file for parsing, identifies paragraphs with heading styles, and organizes non-heading content under their respective headings. This is useful for document analysis, content extraction, and structured data retrieval from Word documents.

Source Code

def extract_document_sections(file_content: bytes, file_name: str) -> Dict[str, str]:
    """
    Extract document sections based on headings.
    
    Args:
        file_content: Binary content of the file
        file_name: Name of the file
        
    Returns:
        Dictionary mapping section headings to section content
    """
    sections = {}
    
    try:
        # Get file extension
        _, ext = os.path.splitext(file_name)
        ext = ext.lower()
        
        if ext != '.docx' or not DOCX_AVAILABLE:
            return sections
            
        # Create temporary file
        with tempfile.NamedTemporaryFile(suffix='.docx', delete=False) as temp_file:
            temp_file.write(file_content)
            temp_path = temp_file.name
            
        try:
            # Parse document
            doc = DocxDocument(temp_path)
            
            # Process paragraphs
            current_heading = "Document"
            current_content = []
            
            for para in doc.paragraphs:
                if para.style and 'heading' in para.style.name.lower():
                    # Save previous section
                    if current_content:
                        sections[current_heading] = '\n'.join(current_content)
                        current_content = []
                        
                    # Start new section
                    current_heading = para.text
                else:
                    # Add paragraph to current section
                    if para.text.strip():
                        current_content.append(para.text)
            
            # Save last section
            if current_content:
                sections[current_heading] = '\n'.join(current_content)
                
        finally:
            # Clean up temporary file
            try:
                os.unlink(temp_path)
            except:
                pass
                
    except Exception as e:
        logger.error(f"Error extracting document sections: {e}")
        
    return sections

Parameters

Name Type Default Kind
file_content bytes - positional_or_keyword
file_name str - positional_or_keyword

Parameter Details

file_content: Binary content of the document file (bytes). This should be the raw bytes read from a .docx file, typically obtained from file.read() or similar file reading operations.

file_name: Name of the file including extension (string). Used to determine file type through extension checking. Must include the .docx extension for processing to occur.

Return Value

Type: Dict[str, str]

Returns a dictionary (Dict[str, str]) where keys are section heading texts and values are the concatenated content under each heading. If the file is not a .docx file, if the docx library is unavailable (DOCX_AVAILABLE flag), or if an error occurs, returns an empty dictionary. The default heading 'Document' is used for content before the first heading.

Dependencies

  • os
  • tempfile
  • logging
  • typing
  • python-docx

Required Imports

import os
import tempfile
import logging
from typing import Dict
from docx import Document as DocxDocument

Conditional/Optional Imports

These imports are only needed under specific conditions:

from docx import Document as DocxDocument

Condition: Required for parsing .docx files. The function checks DOCX_AVAILABLE flag before attempting to use this import

Required (conditional)

Usage Example

# Assuming DOCX_AVAILABLE and logger are defined in module scope
import logging
from typing import Dict
import os
import tempfile
from docx import Document as DocxDocument

logger = logging.getLogger(__name__)
DOCX_AVAILABLE = True

# Read a Word document
with open('example.docx', 'rb') as f:
    file_content = f.read()

# Extract sections
sections = extract_document_sections(file_content, 'example.docx')

# Process extracted sections
for heading, content in sections.items():
    print(f"Section: {heading}")
    print(f"Content: {content[:100]}...")  # Print first 100 chars
    print("-" * 50)

# Example output structure:
# sections = {
#     'Introduction': 'This is the introduction text...',
#     'Chapter 1': 'Content of chapter 1...',
#     'Conclusion': 'Final thoughts...'
# }

Best Practices

  • Ensure the DOCX_AVAILABLE flag is properly set in the module scope before calling this function
  • The function creates temporary files that are cleaned up automatically, but ensure sufficient disk space is available
  • Only .docx files are processed; other formats return an empty dictionary without error
  • The function uses paragraph styles to detect headings, so documents must have proper heading styles applied (Heading 1, Heading 2, etc.)
  • Error handling is built-in with logging, but check the returned dictionary for emptiness to detect processing failures
  • Content before the first heading is stored under the default 'Document' key
  • The function strips whitespace from paragraphs, so empty paragraphs are automatically excluded
  • Temporary files are deleted in a finally block, but in rare cases of system crashes, manual cleanup may be needed
  • For large documents, consider memory usage as the entire file content is loaded into memory

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function extract_metadata 66.2% similar

    Extracts metadata from file content by analyzing the file type and computing file properties including hash, size, and type-specific metadata.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function validate_document_structure 65.1% similar

    Validates the structural integrity of a DOCX document by checking if it contains all required sections specified in the document type template configuration.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function export_to_docx_v1 64.0% similar

    Exports a document object to Microsoft Word DOCX format, converting sections, content, and references into a formatted Word document with proper styling and structure.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
  • function extract_metadata_docx 62.8% similar

    Extracts comprehensive metadata from Microsoft Word DOCX files, including document properties, statistics, and fallback title extraction from content or filename.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function export_to_docx 60.8% similar

    Exports a document with text and data sections to Microsoft Word DOCX format, preserving formatting, structure, and metadata.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
← Back to Browse