🔍 Code Extractor

function validate_document

Maturity: 59

Validates document files by checking file size, extension, and optionally performing type-specific structural validation for supported document formats.

File:
/tf/active/vicechatdev/CDocs/utils/document_processor.py
Lines:
276 - 308
Complexity:
moderate

Purpose

This function serves as a comprehensive document validation utility for file upload systems. It ensures uploaded documents meet size constraints, have allowed file extensions, and optionally validates internal document structure for specific document types (e.g., .docx files). It returns validation status and a list of specific issues found, making it suitable for user-facing error reporting in document management systems.

Source Code

def validate_document(file_content: bytes, file_name: str, 
                     doc_type: Optional[str] = None) -> Tuple[bool, List[str]]:
    """
    Validate document content and structure.
    
    Args:
        file_content: Binary content of the file
        file_name: Name of the file
        doc_type: Optional document type for type-specific validation
        
    Returns:
        Tuple of (is_valid, list_of_issues)
    """
    issues = []
    
    # Check file size
    file_size = len(file_content)
    max_size = settings.MAX_DOCUMENT_SIZE_MB * 1024 * 1024
    if file_size > max_size:
        issues.append(f"File size ({file_size / (1024*1024):.2f} MB) exceeds maximum allowed size ({settings.MAX_DOCUMENT_SIZE_MB} MB)")
    
    # Check file extension
    _, ext = os.path.splitext(file_name)
    ext = ext.lower()
    if ext not in settings.ALLOWED_DOCUMENT_EXTENSIONS:
        issues.append(f"File type {ext} is not allowed. Allowed types: {', '.join(settings.ALLOWED_DOCUMENT_EXTENSIONS)}")
    
    # Type-specific validation
    if doc_type and ext == '.docx' and DOCX_AVAILABLE:
        type_issues = validate_document_structure(file_content, doc_type)
        issues.extend(type_issues)
    
    return len(issues) == 0, issues

Parameters

Name Type Default Kind
file_content bytes - positional_or_keyword
file_name str - positional_or_keyword
doc_type Optional[str] None positional_or_keyword

Parameter Details

file_content: Binary content of the document file as bytes. This is the raw file data that will be validated for size and potentially parsed for structural validation. Should be the complete file content read in binary mode.

file_name: String containing the name of the file including its extension (e.g., 'report.docx'). Used to extract and validate the file extension against allowed types. The actual path is not required, just the filename with extension.

doc_type: Optional string specifying the document type for type-specific validation rules. When provided along with a .docx file and DOCX_AVAILABLE flag is True, triggers additional structural validation via validate_document_structure function. Can be None to skip type-specific validation.

Return Value

Type: Tuple[bool, List[str]]

Returns a tuple containing two elements: (1) A boolean indicating whether the document is valid (True if no issues found, False otherwise), and (2) A list of strings where each string describes a specific validation issue found. An empty list indicates no issues. Issues can include file size violations, unsupported file extensions, or type-specific structural problems.

Dependencies

  • os
  • typing
  • docx
  • PyPDF2

Required Imports

import os
from typing import Tuple, List, Optional

Conditional/Optional Imports

These imports are only needed under specific conditions:

from CDocs.config import settings

Condition: Required for accessing MAX_DOCUMENT_SIZE_MB and ALLOWED_DOCUMENT_EXTENSIONS configuration

Required (conditional)
import docx

Condition: Only used if DOCX_AVAILABLE flag is True and doc_type is provided with .docx files

Optional
from docx import Document as DocxDocument

Condition: Only used if DOCX_AVAILABLE flag is True and doc_type is provided with .docx files

Optional

Usage Example

# Basic usage without type-specific validation
with open('document.pdf', 'rb') as f:
    file_content = f.read()

is_valid, issues = validate_document(
    file_content=file_content,
    file_name='document.pdf'
)

if is_valid:
    print('Document is valid')
else:
    print('Validation issues:')
    for issue in issues:
        print(f'  - {issue}')

# Usage with type-specific validation
with open('report.docx', 'rb') as f:
    file_content = f.read()

is_valid, issues = validate_document(
    file_content=file_content,
    file_name='report.docx',
    doc_type='technical_report'
)

if not is_valid:
    for issue in issues:
        print(f'Error: {issue}')

Best Practices

  • Always read files in binary mode ('rb') when passing file_content to this function
  • Handle the returned issues list to provide meaningful feedback to users about why validation failed
  • Ensure settings.MAX_DOCUMENT_SIZE_MB and settings.ALLOWED_DOCUMENT_EXTENSIONS are properly configured before using this function
  • The DOCX_AVAILABLE flag should be set based on whether the docx library is successfully imported to avoid runtime errors
  • Consider implementing the validate_document_structure function if you need type-specific validation for .docx files
  • File extensions are case-insensitive (automatically converted to lowercase)
  • The function does not modify the file_content, making it safe to use the same content after validation

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function validate_document_structure 80.5% similar

    Validates the structural integrity of a DOCX document by checking if it contains all required sections specified in the document type template configuration.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function is_valid_document_file 67.9% similar

    Validates whether a given filename has an extension corresponding to a supported document type by checking against a predefined list of valid document extensions.

    From: /tf/active/vicechatdev/CDocs/utils/__init__.py
  • function validate_document_number 59.2% similar

    Validates a custom document number by checking its format, length constraints, and uniqueness in the database, returning a dictionary with validation results.

    From: /tf/active/vicechatdev/CDocs/controllers/document_controller.py
  • function test_docx_file 56.9% similar

    Tests the ability to open and read a Microsoft Word (.docx) document file, validating file existence, size, and content extraction capabilities.

    From: /tf/active/vicechatdev/docchat/test_problematic_files.py
  • function validate_and_fix_document_permissions 56.8% similar

    Validates and optionally fixes document sharing permissions for controlled documents in a Neo4j database, processing documents in configurable batches with detailed progress tracking and error handling.

    From: /tf/active/vicechatdev/CDocs/utils/sharing_validator.py
← Back to Browse