function validate_document_structure
Validates the structural integrity of a DOCX document by checking if it contains all required sections specified in the document type template configuration.
/tf/active/vicechatdev/CDocs/utils/document_processor.py
310 - 364
moderate
Purpose
This function ensures that uploaded documents conform to predefined structural templates based on their document type. It extracts headings from a DOCX file and verifies that all required sections (as defined in the document type configuration) are present. This is useful for enforcing document standards, ensuring compliance with organizational templates, and providing feedback to users about missing sections before document submission or processing.
Source Code
def validate_document_structure(file_content: bytes, doc_type: str) -> List[str]:
"""
Validate document structure based on document type.
Args:
file_content: Binary content of the file
doc_type: Document type code
Returns:
List of validation issues
"""
issues = []
try:
# Get document type template requirements
doc_type_info = settings.get_document_type(doc_type)
if not doc_type_info or 'template_sections' not in doc_type_info:
return issues # No template requirements for this type
required_sections = doc_type_info['template_sections']
# Create temporary file
with tempfile.NamedTemporaryFile(suffix='.docx', delete=False) as temp_file:
temp_file.write(file_content)
temp_path = temp_file.name
try:
# Parse document
doc = DocxDocument(temp_path)
# Extract headings
headings = []
for para in doc.paragraphs:
if para.style and 'heading' in para.style.name.lower():
heading_level = int(para.style.name.replace('Heading', '')) if para.style.name.replace('Heading', '').isdigit() else 0
headings.append((para.text, heading_level))
# Check required sections
for section in required_sections:
# Simple text matching for basic validation
if not any(section.lower() in heading.lower() for heading, _ in headings):
issues.append(f"Required section '{section}' not found in document")
finally:
# Clean up temporary file
try:
os.unlink(temp_path)
except:
pass
except Exception as e:
logger.error(f"Error validating document structure: {e}")
issues.append(f"Could not validate document structure: {e}")
return issues
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
file_content |
bytes | - | positional_or_keyword |
doc_type |
str | - | positional_or_keyword |
Parameter Details
file_content: Binary content of the DOCX file to be validated. This should be the raw bytes read from a file upload or file system (e.g., result of file.read() or open(path, 'rb').read()). The function will write this to a temporary file for processing.
doc_type: A string code identifying the document type (e.g., 'contract', 'report', 'proposal'). This code is used to look up the corresponding template requirements from the settings configuration, which defines what sections must be present in documents of this type.
Return Value
Type: List[str]
Returns a List[str] containing validation issue messages. Each string describes a specific problem found during validation, such as missing required sections (e.g., 'Required section Introduction not found in document'). If the list is empty, the document passed all structural validations. If document type has no template requirements or an error occurs during validation, appropriate messages are added to the list or it returns empty.
Dependencies
loggingostempfiletypingdocxCDocs
Required Imports
import logging
import os
import tempfile
from typing import List
from docx import Document as DocxDocument
from CDocs.config import settings
Usage Example
# Assuming CDocs settings and logger are properly configured
import logging
from typing import List
from docx import Document as DocxDocument
import tempfile
import os
# Setup logger
logger = logging.getLogger(__name__)
# Read a DOCX file
with open('document.docx', 'rb') as f:
file_content = f.read()
# Validate the document structure
doc_type = 'technical_report' # Document type code
issues = validate_document_structure(file_content, doc_type)
# Check validation results
if issues:
print('Validation failed with the following issues:')
for issue in issues:
print(f' - {issue}')
else:
print('Document structure is valid!')
# Example with settings configuration:
# settings.get_document_type('technical_report') should return:
# {
# 'template_sections': ['Introduction', 'Methodology', 'Results', 'Conclusion']
# }
Best Practices
- Ensure the settings.get_document_type() method is properly configured with template_sections for each document type you want to validate
- The function creates temporary files on disk - ensure the application has write permissions to the system's temp directory
- File cleanup is handled in a finally block, but in rare cases of system crashes, temporary files may persist in the temp directory
- The function only validates DOCX files; passing other file formats will result in validation errors
- Section matching is case-insensitive and uses substring matching, so 'Introduction' will match '1. Introduction' or 'INTRODUCTION'
- The function catches all exceptions and returns them as validation issues rather than raising them, making it safe to use in validation pipelines
- For large documents, the heading extraction process may take time as it iterates through all paragraphs
- Ensure a logger is configured at module level before calling this function to capture error messages properly
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function validate_document 80.5% similar
-
function extract_document_sections 65.1% similar
-
function test_docx_file 63.8% similar
-
function process_document 56.6% similar
-
function is_valid_document_file 56.4% similar