function validate_document
Validates document files by checking file size, extension, and optionally performing type-specific structural validation for supported document formats.
/tf/active/vicechatdev/CDocs/utils/document_processor.py
276 - 308
moderate
Purpose
This function serves as a comprehensive document validation utility for file upload systems. It ensures uploaded documents meet size constraints, have allowed file extensions, and optionally validates internal document structure for specific document types (e.g., .docx files). It returns validation status and a list of specific issues found, making it suitable for user-facing error reporting in document management systems.
Source Code
def validate_document(file_content: bytes, file_name: str,
doc_type: Optional[str] = None) -> Tuple[bool, List[str]]:
"""
Validate document content and structure.
Args:
file_content: Binary content of the file
file_name: Name of the file
doc_type: Optional document type for type-specific validation
Returns:
Tuple of (is_valid, list_of_issues)
"""
issues = []
# Check file size
file_size = len(file_content)
max_size = settings.MAX_DOCUMENT_SIZE_MB * 1024 * 1024
if file_size > max_size:
issues.append(f"File size ({file_size / (1024*1024):.2f} MB) exceeds maximum allowed size ({settings.MAX_DOCUMENT_SIZE_MB} MB)")
# Check file extension
_, ext = os.path.splitext(file_name)
ext = ext.lower()
if ext not in settings.ALLOWED_DOCUMENT_EXTENSIONS:
issues.append(f"File type {ext} is not allowed. Allowed types: {', '.join(settings.ALLOWED_DOCUMENT_EXTENSIONS)}")
# Type-specific validation
if doc_type and ext == '.docx' and DOCX_AVAILABLE:
type_issues = validate_document_structure(file_content, doc_type)
issues.extend(type_issues)
return len(issues) == 0, issues
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
file_content |
bytes | - | positional_or_keyword |
file_name |
str | - | positional_or_keyword |
doc_type |
Optional[str] | None | positional_or_keyword |
Parameter Details
file_content: Binary content of the document file as bytes. This is the raw file data that will be validated for size and potentially parsed for structural validation. Should be the complete file content read in binary mode.
file_name: String containing the name of the file including its extension (e.g., 'report.docx'). Used to extract and validate the file extension against allowed types. The actual path is not required, just the filename with extension.
doc_type: Optional string specifying the document type for type-specific validation rules. When provided along with a .docx file and DOCX_AVAILABLE flag is True, triggers additional structural validation via validate_document_structure function. Can be None to skip type-specific validation.
Return Value
Type: Tuple[bool, List[str]]
Returns a tuple containing two elements: (1) A boolean indicating whether the document is valid (True if no issues found, False otherwise), and (2) A list of strings where each string describes a specific validation issue found. An empty list indicates no issues. Issues can include file size violations, unsupported file extensions, or type-specific structural problems.
Dependencies
ostypingdocxPyPDF2
Required Imports
import os
from typing import Tuple, List, Optional
Conditional/Optional Imports
These imports are only needed under specific conditions:
from CDocs.config import settings
Condition: Required for accessing MAX_DOCUMENT_SIZE_MB and ALLOWED_DOCUMENT_EXTENSIONS configuration
Required (conditional)import docx
Condition: Only used if DOCX_AVAILABLE flag is True and doc_type is provided with .docx files
Optionalfrom docx import Document as DocxDocument
Condition: Only used if DOCX_AVAILABLE flag is True and doc_type is provided with .docx files
OptionalUsage Example
# Basic usage without type-specific validation
with open('document.pdf', 'rb') as f:
file_content = f.read()
is_valid, issues = validate_document(
file_content=file_content,
file_name='document.pdf'
)
if is_valid:
print('Document is valid')
else:
print('Validation issues:')
for issue in issues:
print(f' - {issue}')
# Usage with type-specific validation
with open('report.docx', 'rb') as f:
file_content = f.read()
is_valid, issues = validate_document(
file_content=file_content,
file_name='report.docx',
doc_type='technical_report'
)
if not is_valid:
for issue in issues:
print(f'Error: {issue}')
Best Practices
- Always read files in binary mode ('rb') when passing file_content to this function
- Handle the returned issues list to provide meaningful feedback to users about why validation failed
- Ensure settings.MAX_DOCUMENT_SIZE_MB and settings.ALLOWED_DOCUMENT_EXTENSIONS are properly configured before using this function
- The DOCX_AVAILABLE flag should be set based on whether the docx library is successfully imported to avoid runtime errors
- Consider implementing the validate_document_structure function if you need type-specific validation for .docx files
- File extensions are case-insensitive (automatically converted to lowercase)
- The function does not modify the file_content, making it safe to use the same content after validation
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function validate_document_structure 80.5% similar
-
function is_valid_document_file 67.9% similar
-
function validate_document_number 59.2% similar
-
function test_docx_file 56.9% similar
-
function validate_and_fix_document_permissions 56.8% similar