validate_document_structure

function validate_document_structure

Maturity: 58

Validates the structural integrity of a DOCX document by checking if it contains all required sections specified in the document type template configuration.

File:
/tf/active/vicechatdev/CDocs/utils/document_processor.py

Lines:
310 - 364

Complexity:
moderate

Purpose

This function ensures that uploaded documents conform to predefined structural templates based on their document type. It extracts headings from a DOCX file and verifies that all required sections (as defined in the document type configuration) are present. This is useful for enforcing document standards, ensuring compliance with organizational templates, and providing feedback to users about missing sections before document submission or processing.

Source Code

def validate_document_structure(file_content: bytes, doc_type: str) -> List[str]:
    """
    Validate document structure based on document type.
    
    Args:
        file_content: Binary content of the file
        doc_type: Document type code
        
    Returns:
        List of validation issues
    """
    issues = []
    
    try:
        # Get document type template requirements
        doc_type_info = settings.get_document_type(doc_type)
        if not doc_type_info or 'template_sections' not in doc_type_info:
            return issues  # No template requirements for this type
            
        required_sections = doc_type_info['template_sections']
        
        # Create temporary file
        with tempfile.NamedTemporaryFile(suffix='.docx', delete=False) as temp_file:
            temp_file.write(file_content)
            temp_path = temp_file.name
            
        try:
            # Parse document
            doc = DocxDocument(temp_path)
            
            # Extract headings
            headings = []
            for para in doc.paragraphs:
                if para.style and 'heading' in para.style.name.lower():
                    heading_level = int(para.style.name.replace('Heading', '')) if para.style.name.replace('Heading', '').isdigit() else 0
                    headings.append((para.text, heading_level))
            
            # Check required sections
            for section in required_sections:
                # Simple text matching for basic validation
                if not any(section.lower() in heading.lower() for heading, _ in headings):
                    issues.append(f"Required section '{section}' not found in document")
                    
        finally:
            # Clean up temporary file
            try:
                os.unlink(temp_path)
            except:
                pass
                
    except Exception as e:
        logger.error(f"Error validating document structure: {e}")
        issues.append(f"Could not validate document structure: {e}")
        
    return issues

Parameters

Name	Type	Default	Kind
`file_content`	bytes	-	positional_or_keyword
`doc_type`	str	-	positional_or_keyword

Parameter Details

file_content: Binary content of the DOCX file to be validated. This should be the raw bytes read from a file upload or file system (e.g., result of file.read() or open(path, 'rb').read()). The function will write this to a temporary file for processing.

doc_type: A string code identifying the document type (e.g., 'contract', 'report', 'proposal'). This code is used to look up the corresponding template requirements from the settings configuration, which defines what sections must be present in documents of this type.

Return Value

Type: List[str]

Returns a List[str] containing validation issue messages. Each string describes a specific problem found during validation, such as missing required sections (e.g., 'Required section Introduction not found in document'). If the list is empty, the document passed all structural validations. If document type has no template requirements or an error occurs during validation, appropriate messages are added to the list or it returns empty.

Dependencies

logging
os
tempfile
typing
docx
CDocs

Required Imports

import logging
import os
import tempfile
from typing import List
from docx import Document as DocxDocument
from CDocs.config import settings

Usage Example

# Assuming CDocs settings and logger are properly configured
import logging
from typing import List
from docx import Document as DocxDocument
import tempfile
import os

# Setup logger
logger = logging.getLogger(__name__)

# Read a DOCX file
with open('document.docx', 'rb') as f:
    file_content = f.read()

# Validate the document structure
doc_type = 'technical_report'  # Document type code
issues = validate_document_structure(file_content, doc_type)

# Check validation results
if issues:
    print('Validation failed with the following issues:')
    for issue in issues:
        print(f'  - {issue}')
else:
    print('Document structure is valid!')

# Example with settings configuration:
# settings.get_document_type('technical_report') should return:
# {
#     'template_sections': ['Introduction', 'Methodology', 'Results', 'Conclusion']
# }

Best Practices

Ensure the settings.get_document_type() method is properly configured with template_sections for each document type you want to validate
The function creates temporary files on disk - ensure the application has write permissions to the system's temp directory
File cleanup is handled in a finally block, but in rare cases of system crashes, temporary files may persist in the temp directory
The function only validates DOCX files; passing other file formats will result in validation errors
Section matching is case-insensitive and uses substring matching, so 'Introduction' will match '1. Introduction' or 'INTRODUCTION'
The function catches all exceptions and returns them as validation issues rather than raising them, making it safe to use in validation pipelines
For large documents, the heading extraction process may take time as it iterates through all paragraphs
Ensure a logger is configured at module level before calling this function to capture error messages properly

Similar Components

AI-powered semantic similarity - components with related functionality:

function validate_document 80.5% similar

Validates document files by checking file size, extension, and optionally performing type-specific structural validation for supported document formats.
From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
function extract_document_sections 65.1% similar

Extracts structured sections from a DOCX document by parsing headings and organizing content under each heading into a dictionary.
From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
function test_docx_file 63.8% similar

Tests the ability to open and read a Microsoft Word (.docx) document file, validating file existence, size, and content extraction capabilities.
From: /tf/active/vicechatdev/docchat/test_problematic_files.py
function process_document 56.6% similar

Processes a document file (DOCX, DOC, or PDF) and extracts comprehensive metadata including file information, content metadata, and cryptographic hash.
From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
function is_valid_document_file 56.4% similar

Validates whether a given filename has an extension corresponding to a supported document type by checking against a predefined list of valid document extensions.
From: /tf/active/vicechatdev/CDocs/utils/__init__.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def validate_document_structure(file_content: bytes, doc_type: str) -> List[str]:
    """
    Validate document structure based on document type.
    
    Args:
        file_content: Binary content of the file
        doc_type: Document type code
        
    Returns:
        List of validation issues
    """
    issues = []
    
    try:
        # Get document type template requirements
        doc_type_info = settings.get_document_type(doc_type)
        if not doc_type_info or 'template_sections' not in doc_type_info:
            return issues  # No template requirements for this type
            
        required_sections = doc_type_info['template_sections']
        
        # Create temporary file
        with tempfile.NamedTemporaryFile(suffix='.docx', delete=False) as temp_file:
            temp_file.write(file_content)
            temp_path = temp_file.name
            
        try:
            # Parse document
            doc = DocxDocument(temp_path)
            
            # Extract headings
            headings = []
            for para in doc.paragraphs:
                if para.style and 'heading' in para.style.name.lower():
                    heading_level = int(para.style.name.replace('Heading', '')) if para.style.name.replace('Heading', '').isdigit() else 0
                    headings.append((para.text, heading_level))
            
            # Check required sections
            for section in required_sections:
                # Simple text matching for basic validation
                if not any(section.lower() in heading.lower() for heading, _ in headings):
                    issues.append(f"Required section '{section}' not found in document")
                    
        finally:
            # Clean up temporary file
            try:
                os.unlink(temp_path)
            except:
                pass
                
    except Exception as e:
        logger.error(f"Error validating document structure: {e}")
        issues.append(f"Could not validate document structure: {e}")
        
    return issues
                        

Improved Code

🔍 Code Extractor

function validate_document_structure

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function validate_document 80.5% similar

function extract_document_sections 65.1% similar

function test_docx_file 63.8% similar

function process_document 56.6% similar

function is_valid_document_file 56.4% similar

function validate_document_structure

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function validate_document 80.5% similar

function extract_document_sections 65.1% similar

function test_docx_file 63.8% similar

function process_document 56.6% similar

function is_valid_document_file 56.4% similar

✨ Improve Code: validate_document_structure

Code Comparison