🔍 Code Extractor

function validate_document_structure

Maturity: 58

Validates the structural integrity of a DOCX document by checking if it contains all required sections specified in the document type template configuration.

File:
/tf/active/vicechatdev/CDocs/utils/document_processor.py
Lines:
310 - 364
Complexity:
moderate

Purpose

This function ensures that uploaded documents conform to predefined structural templates based on their document type. It extracts headings from a DOCX file and verifies that all required sections (as defined in the document type configuration) are present. This is useful for enforcing document standards, ensuring compliance with organizational templates, and providing feedback to users about missing sections before document submission or processing.

Source Code

def validate_document_structure(file_content: bytes, doc_type: str) -> List[str]:
    """
    Validate document structure based on document type.
    
    Args:
        file_content: Binary content of the file
        doc_type: Document type code
        
    Returns:
        List of validation issues
    """
    issues = []
    
    try:
        # Get document type template requirements
        doc_type_info = settings.get_document_type(doc_type)
        if not doc_type_info or 'template_sections' not in doc_type_info:
            return issues  # No template requirements for this type
            
        required_sections = doc_type_info['template_sections']
        
        # Create temporary file
        with tempfile.NamedTemporaryFile(suffix='.docx', delete=False) as temp_file:
            temp_file.write(file_content)
            temp_path = temp_file.name
            
        try:
            # Parse document
            doc = DocxDocument(temp_path)
            
            # Extract headings
            headings = []
            for para in doc.paragraphs:
                if para.style and 'heading' in para.style.name.lower():
                    heading_level = int(para.style.name.replace('Heading', '')) if para.style.name.replace('Heading', '').isdigit() else 0
                    headings.append((para.text, heading_level))
            
            # Check required sections
            for section in required_sections:
                # Simple text matching for basic validation
                if not any(section.lower() in heading.lower() for heading, _ in headings):
                    issues.append(f"Required section '{section}' not found in document")
                    
        finally:
            # Clean up temporary file
            try:
                os.unlink(temp_path)
            except:
                pass
                
    except Exception as e:
        logger.error(f"Error validating document structure: {e}")
        issues.append(f"Could not validate document structure: {e}")
        
    return issues

Parameters

Name Type Default Kind
file_content bytes - positional_or_keyword
doc_type str - positional_or_keyword

Parameter Details

file_content: Binary content of the DOCX file to be validated. This should be the raw bytes read from a file upload or file system (e.g., result of file.read() or open(path, 'rb').read()). The function will write this to a temporary file for processing.

doc_type: A string code identifying the document type (e.g., 'contract', 'report', 'proposal'). This code is used to look up the corresponding template requirements from the settings configuration, which defines what sections must be present in documents of this type.

Return Value

Type: List[str]

Returns a List[str] containing validation issue messages. Each string describes a specific problem found during validation, such as missing required sections (e.g., 'Required section Introduction not found in document'). If the list is empty, the document passed all structural validations. If document type has no template requirements or an error occurs during validation, appropriate messages are added to the list or it returns empty.

Dependencies

  • logging
  • os
  • tempfile
  • typing
  • docx
  • CDocs

Required Imports

import logging
import os
import tempfile
from typing import List
from docx import Document as DocxDocument
from CDocs.config import settings

Usage Example

# Assuming CDocs settings and logger are properly configured
import logging
from typing import List
from docx import Document as DocxDocument
import tempfile
import os

# Setup logger
logger = logging.getLogger(__name__)

# Read a DOCX file
with open('document.docx', 'rb') as f:
    file_content = f.read()

# Validate the document structure
doc_type = 'technical_report'  # Document type code
issues = validate_document_structure(file_content, doc_type)

# Check validation results
if issues:
    print('Validation failed with the following issues:')
    for issue in issues:
        print(f'  - {issue}')
else:
    print('Document structure is valid!')

# Example with settings configuration:
# settings.get_document_type('technical_report') should return:
# {
#     'template_sections': ['Introduction', 'Methodology', 'Results', 'Conclusion']
# }

Best Practices

  • Ensure the settings.get_document_type() method is properly configured with template_sections for each document type you want to validate
  • The function creates temporary files on disk - ensure the application has write permissions to the system's temp directory
  • File cleanup is handled in a finally block, but in rare cases of system crashes, temporary files may persist in the temp directory
  • The function only validates DOCX files; passing other file formats will result in validation errors
  • Section matching is case-insensitive and uses substring matching, so 'Introduction' will match '1. Introduction' or 'INTRODUCTION'
  • The function catches all exceptions and returns them as validation issues rather than raising them, making it safe to use in validation pipelines
  • For large documents, the heading extraction process may take time as it iterates through all paragraphs
  • Ensure a logger is configured at module level before calling this function to capture error messages properly

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function validate_document 80.5% similar

    Validates document files by checking file size, extension, and optionally performing type-specific structural validation for supported document formats.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function extract_document_sections 65.1% similar

    Extracts structured sections from a DOCX document by parsing headings and organizing content under each heading into a dictionary.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function test_docx_file 63.8% similar

    Tests the ability to open and read a Microsoft Word (.docx) document file, validating file existence, size, and content extraction capabilities.

    From: /tf/active/vicechatdev/docchat/test_problematic_files.py
  • function process_document 56.6% similar

    Processes a document file (DOCX, DOC, or PDF) and extracts comprehensive metadata including file information, content metadata, and cryptographic hash.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function is_valid_document_file 56.4% similar

    Validates whether a given filename has an extension corresponding to a supported document type by checking against a predefined list of valid document extensions.

    From: /tf/active/vicechatdev/CDocs/utils/__init__.py
← Back to Browse