api_upload_document - Code Extractor

function api_upload_document

Maturity: 54

Flask API endpoint that handles document upload, validates file type and size, processes the document to extract text content, and stores the document metadata in the system.

File:
/tf/active/vicechatdev/vice_ai/app.py

Lines:
1345 - 1426

Complexity:
complex

Purpose

This endpoint serves as the primary document ingestion point for the application. It accepts file uploads via HTTP POST, validates them against allowed formats and size limits, extracts text content using a document processor, generates unique identifiers, and persists the document information for later retrieval. It's designed for authenticated users to upload business documents (PDF, Office formats, etc.) for processing by the RAG system.

Source Code

def api_upload_document():
    """Upload and process a document"""
    try:
        if 'file' not in request.files:
            return jsonify({'error': 'No file provided'}), 400
        
        file = request.files['file']
        if file.filename == '':
            return jsonify({'error': 'No file selected'}), 400
        
        # Validate file type
        allowed_extensions = {'.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx', '.rtf', '.odt'}
        file_ext = os.path.splitext(file.filename)[1].lower()
        
        if file_ext not in allowed_extensions:
            return jsonify({'error': f'File type not supported: {file_ext}'}), 400
        
        # Validate file size (10MB limit)
        file.seek(0, os.SEEK_END)
        file_size = file.tell()
        file.seek(0)
        
        if file_size > 10 * 1024 * 1024:  # 10MB
            return jsonify({'error': 'File too large (max 10MB)'}), 400
        
        # Generate unique document ID and secure filename
        document_id = str(uuid.uuid4())
        filename = secure_filename(file.filename)
        
        # Create temp file
        temp_dir = tempfile.mkdtemp()
        file_path = os.path.join(temp_dir, f"{document_id}_{filename}")
        
        # Save file
        file.save(file_path)
        
        # Process document
        logger.info(f"Processing uploaded document: {filename}")
        result = document_processor.process_document(file_path)
        
        if 'error' in result:
            # Clean up file on error
            try:
                os.remove(file_path)
                os.rmdir(temp_dir)
            except:
                pass
            return jsonify({'error': result['error']}), 500
        
        # Get combined text content
        text_content = document_processor.get_combined_text(result)
        
        if not text_content:
            # Clean up file if no content extracted
            try:
                os.remove(file_path)
                os.rmdir(temp_dir)
            except:
                pass
            return jsonify({'error': 'No text content could be extracted from the document'}), 400
        
        # Store document information
        user_email = session['user'].get('email', 'unknown')
        metadata = result.get('metadata', {})
        metadata['original_filename'] = file.filename
        
        store_document(user_email, document_id, file_path, text_content, metadata)
        
        logger.info(f"✅ Document processed successfully: {filename} ({len(text_content)} characters)")
        
        return jsonify({
            'document_id': document_id,
            'filename': filename,
            'text_content': text_content[:500] + '...' if len(text_content) > 500 else text_content,  # Preview
            'size': file_size,
            'text_length': len(text_content),
            'metadata': metadata
        })
        
    except Exception as e:
        logger.error(f"Document upload error: {e}")
        return jsonify({'error': 'Failed to process document'}), 500

Return Value

Returns a Flask JSON response tuple. On success (200): {'document_id': str (UUID), 'filename': str, 'text_content': str (preview up to 500 chars), 'size': int (bytes), 'text_length': int (full text length), 'metadata': dict}. On error: {'error': str (error message)} with status codes 400 (validation errors), 500 (processing errors).

Dependencies

flask
werkzeug
uuid
os
tempfile
logging

Required Imports

from flask import Flask, request, jsonify, session
from werkzeug.utils import secure_filename
import os
import uuid
import tempfile
import logging

Conditional/Optional Imports

These imports are only needed under specific conditions:

from document_processor import DocumentProcessor

Condition: Required for document processing functionality - must be available in the application context as 'document_processor' instance

Required (conditional)

from auth.azure_auth import AzureSSO

Condition: Required for the @require_auth decorator to function - must be configured for authentication

Required (conditional)

from hybrid_rag_engine import OneCo_hybrid_RAG

Condition: May be used by document_processor or store_document function

Optional

Usage Example

# Client-side usage example (JavaScript fetch)
const formData = new FormData();
formData.append('file', fileInput.files[0]);

fetch('/api/upload-document', {
  method: 'POST',
  body: formData,
  credentials: 'include'
})
.then(response => response.json())
.then(data => {
  if (data.error) {
    console.error('Upload failed:', data.error);
  } else {
    console.log('Document uploaded:', data.document_id);
    console.log('Preview:', data.text_content);
  }
});

# Python requests example
import requests

with open('document.pdf', 'rb') as f:
    files = {'file': f}
    response = requests.post(
        'http://localhost:5000/api/upload-document',
        files=files,
        cookies={'session': session_cookie}
    )
    result = response.json()
    print(f"Document ID: {result.get('document_id')}")

Best Practices

Always send files as multipart/form-data with the key 'file'
Ensure user is authenticated before calling this endpoint (handled by @require_auth decorator)
Supported file types: .pdf, .doc, .docx, .xls, .xlsx, .ppt, .pptx, .rtf, .odt
Maximum file size is 10MB - larger files will be rejected
The endpoint automatically cleans up temporary files on errors
Document IDs are UUIDs and should be stored for later reference
Text content preview is limited to 500 characters in response, full content is stored
Handle both validation errors (400) and processing errors (500) appropriately
The function requires document_processor and store_document to be properly initialized in the application context
Temporary files are stored in system temp directory - ensure adequate disk space
Original filename is preserved in metadata but stored file uses UUID prefix for security
Empty files or files with no extractable text will be rejected with 400 error

Similar Components

AI-powered semantic similarity - components with related functionality:

function api_upload_document_v1 89.1% similar

Flask API endpoint that handles document file uploads, validates file type and size, stores the file temporarily, and extracts basic text content for processing.
From: /tf/active/vicechatdev/vice_ai/new_app.py
function api_upload 87.7% similar

Flask API endpoint that handles file uploads, validates file types, saves files to a configured directory structure, and automatically indexes the uploaded document for search/retrieval.
From: /tf/active/vicechatdev/docchat/app.py
function upload_document 85.7% similar

Flask route handler that processes file uploads, saves them securely to disk, and indexes the document content for retrieval-augmented generation (RAG) search.
From: /tf/active/vicechatdev/docchat/blueprint.py
function api_chat_upload_document 78.9% similar

Flask API endpoint that handles document upload for chat context, processes the document to extract text content, and stores it for later retrieval in chat sessions.
From: /tf/active/vicechatdev/vice_ai/complex_app.py
function api_delete_chat_uploaded_document 73.7% similar

Flask API endpoint that deletes a user's uploaded document by document ID, requiring authentication and returning success/error responses.
From: /tf/active/vicechatdev/vice_ai/complex_app.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def api_upload_document():
    """Upload and process a document"""
    try:
        if 'file' not in request.files:
            return jsonify({'error': 'No file provided'}), 400
        
        file = request.files['file']
        if file.filename == '':
            return jsonify({'error': 'No file selected'}), 400
        
        # Validate file type
        allowed_extensions = {'.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx', '.rtf', '.odt'}
        file_ext = os.path.splitext(file.filename)[1].lower()
        
        if file_ext not in allowed_extensions:
            return jsonify({'error': f'File type not supported: {file_ext}'}), 400
        
        # Validate file size (10MB limit)
        file.seek(0, os.SEEK_END)
        file_size = file.tell()
        file.seek(0)
        
        if file_size > 10 * 1024 * 1024:  # 10MB
            return jsonify({'error': 'File too large (max 10MB)'}), 400
        
        # Generate unique document ID and secure filename
        document_id = str(uuid.uuid4())
        filename = secure_filename(file.filename)
        
        # Create temp file
        temp_dir = tempfile.mkdtemp()
        file_path = os.path.join(temp_dir, f"{document_id}_{filename}")
        
        # Save file
        file.save(file_path)
        
        # Process document
        logger.info(f"Processing uploaded document: {filename}")
        result = document_processor.process_document(file_path)
        
        if 'error' in result:
            # Clean up file on error
            try:
                os.remove(file_path)
                os.rmdir(temp_dir)
            except:
                pass
            return jsonify({'error': result['error']}), 500
        
        # Get combined text content
        text_content = document_processor.get_combined_text(result)
        
        if not text_content:
            # Clean up file if no content extracted
            try:
                os.remove(file_path)
                os.rmdir(temp_dir)
            except:
                pass
            return jsonify({'error': 'No text content could be extracted from the document'}), 400
        
        # Store document information
        user_email = session['user'].get('email', 'unknown')
        metadata = result.get('metadata', {})
        metadata['original_filename'] = file.filename
        
        store_document(user_email, document_id, file_path, text_content, metadata)
        
        logger.info(f"✅ Document processed successfully: {filename} ({len(text_content)} characters)")
        
        return jsonify({
            'document_id': document_id,
            'filename': filename,
            'text_content': text_content[:500] + '...' if len(text_content) > 500 else text_content,  # Preview
            'size': file_size,
            'text_length': len(text_content),
            'metadata': metadata
        })
        
    except Exception as e:
        logger.error(f"Document upload error: {e}")
        return jsonify({'error': 'Failed to process document'}), 500
                        

Improved Code

🔍 Code Extractor

function api_upload_document

Purpose

Source Code

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function api_upload_document_v1 89.1% similar

function api_upload 87.7% similar

function upload_document 85.7% similar

function api_chat_upload_document 78.9% similar

function api_delete_chat_uploaded_document 73.7% similar

function api_upload_document

Purpose

Source Code

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function api_upload_document_v1 89.1% similar

function api_upload 87.7% similar

function upload_document 85.7% similar

function api_chat_upload_document 78.9% similar

function api_delete_chat_uploaded_document 73.7% similar

✨ Improve Code: api_upload_document

Code Comparison