function api_upload_document
Flask API endpoint that handles document upload, validates file type and size, processes the document to extract text content, and stores the document metadata in the system.
/tf/active/vicechatdev/vice_ai/app.py
1345 - 1426
complex
Purpose
This endpoint serves as the primary document ingestion point for the application. It accepts file uploads via HTTP POST, validates them against allowed formats and size limits, extracts text content using a document processor, generates unique identifiers, and persists the document information for later retrieval. It's designed for authenticated users to upload business documents (PDF, Office formats, etc.) for processing by the RAG system.
Source Code
def api_upload_document():
"""Upload and process a document"""
try:
if 'file' not in request.files:
return jsonify({'error': 'No file provided'}), 400
file = request.files['file']
if file.filename == '':
return jsonify({'error': 'No file selected'}), 400
# Validate file type
allowed_extensions = {'.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx', '.rtf', '.odt'}
file_ext = os.path.splitext(file.filename)[1].lower()
if file_ext not in allowed_extensions:
return jsonify({'error': f'File type not supported: {file_ext}'}), 400
# Validate file size (10MB limit)
file.seek(0, os.SEEK_END)
file_size = file.tell()
file.seek(0)
if file_size > 10 * 1024 * 1024: # 10MB
return jsonify({'error': 'File too large (max 10MB)'}), 400
# Generate unique document ID and secure filename
document_id = str(uuid.uuid4())
filename = secure_filename(file.filename)
# Create temp file
temp_dir = tempfile.mkdtemp()
file_path = os.path.join(temp_dir, f"{document_id}_{filename}")
# Save file
file.save(file_path)
# Process document
logger.info(f"Processing uploaded document: {filename}")
result = document_processor.process_document(file_path)
if 'error' in result:
# Clean up file on error
try:
os.remove(file_path)
os.rmdir(temp_dir)
except:
pass
return jsonify({'error': result['error']}), 500
# Get combined text content
text_content = document_processor.get_combined_text(result)
if not text_content:
# Clean up file if no content extracted
try:
os.remove(file_path)
os.rmdir(temp_dir)
except:
pass
return jsonify({'error': 'No text content could be extracted from the document'}), 400
# Store document information
user_email = session['user'].get('email', 'unknown')
metadata = result.get('metadata', {})
metadata['original_filename'] = file.filename
store_document(user_email, document_id, file_path, text_content, metadata)
logger.info(f"✅ Document processed successfully: {filename} ({len(text_content)} characters)")
return jsonify({
'document_id': document_id,
'filename': filename,
'text_content': text_content[:500] + '...' if len(text_content) > 500 else text_content, # Preview
'size': file_size,
'text_length': len(text_content),
'metadata': metadata
})
except Exception as e:
logger.error(f"Document upload error: {e}")
return jsonify({'error': 'Failed to process document'}), 500
Return Value
Returns a Flask JSON response tuple. On success (200): {'document_id': str (UUID), 'filename': str, 'text_content': str (preview up to 500 chars), 'size': int (bytes), 'text_length': int (full text length), 'metadata': dict}. On error: {'error': str (error message)} with status codes 400 (validation errors), 500 (processing errors).
Dependencies
flaskwerkzeuguuidostempfilelogging
Required Imports
from flask import Flask, request, jsonify, session
from werkzeug.utils import secure_filename
import os
import uuid
import tempfile
import logging
Conditional/Optional Imports
These imports are only needed under specific conditions:
from document_processor import DocumentProcessor
Condition: Required for document processing functionality - must be available in the application context as 'document_processor' instance
Required (conditional)from auth.azure_auth import AzureSSO
Condition: Required for the @require_auth decorator to function - must be configured for authentication
Required (conditional)from hybrid_rag_engine import OneCo_hybrid_RAG
Condition: May be used by document_processor or store_document function
OptionalUsage Example
# Client-side usage example (JavaScript fetch)
const formData = new FormData();
formData.append('file', fileInput.files[0]);
fetch('/api/upload-document', {
method: 'POST',
body: formData,
credentials: 'include'
})
.then(response => response.json())
.then(data => {
if (data.error) {
console.error('Upload failed:', data.error);
} else {
console.log('Document uploaded:', data.document_id);
console.log('Preview:', data.text_content);
}
});
# Python requests example
import requests
with open('document.pdf', 'rb') as f:
files = {'file': f}
response = requests.post(
'http://localhost:5000/api/upload-document',
files=files,
cookies={'session': session_cookie}
)
result = response.json()
print(f"Document ID: {result.get('document_id')}")
Best Practices
- Always send files as multipart/form-data with the key 'file'
- Ensure user is authenticated before calling this endpoint (handled by @require_auth decorator)
- Supported file types: .pdf, .doc, .docx, .xls, .xlsx, .ppt, .pptx, .rtf, .odt
- Maximum file size is 10MB - larger files will be rejected
- The endpoint automatically cleans up temporary files on errors
- Document IDs are UUIDs and should be stored for later reference
- Text content preview is limited to 500 characters in response, full content is stored
- Handle both validation errors (400) and processing errors (500) appropriately
- The function requires document_processor and store_document to be properly initialized in the application context
- Temporary files are stored in system temp directory - ensure adequate disk space
- Original filename is preserved in metadata but stored file uses UUID prefix for security
- Empty files or files with no extractable text will be rejected with 400 error
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function api_upload_document_v1 89.1% similar
-
function api_upload 87.7% similar
-
function upload_document 85.7% similar
-
function api_chat_upload_document 78.9% similar
-
function api_delete_chat_uploaded_document 73.7% similar