🔍 Code Extractor

function api_upload_document

Maturity: 54

Flask API endpoint that handles document upload, validates file type and size, processes the document to extract text content, and stores the document metadata in the system.

File:
/tf/active/vicechatdev/vice_ai/app.py
Lines:
1345 - 1426
Complexity:
complex

Purpose

This endpoint serves as the primary document ingestion point for the application. It accepts file uploads via HTTP POST, validates them against allowed formats and size limits, extracts text content using a document processor, generates unique identifiers, and persists the document information for later retrieval. It's designed for authenticated users to upload business documents (PDF, Office formats, etc.) for processing by the RAG system.

Source Code

def api_upload_document():
    """Upload and process a document"""
    try:
        if 'file' not in request.files:
            return jsonify({'error': 'No file provided'}), 400
        
        file = request.files['file']
        if file.filename == '':
            return jsonify({'error': 'No file selected'}), 400
        
        # Validate file type
        allowed_extensions = {'.pdf', '.doc', '.docx', '.xls', '.xlsx', '.ppt', '.pptx', '.rtf', '.odt'}
        file_ext = os.path.splitext(file.filename)[1].lower()
        
        if file_ext not in allowed_extensions:
            return jsonify({'error': f'File type not supported: {file_ext}'}), 400
        
        # Validate file size (10MB limit)
        file.seek(0, os.SEEK_END)
        file_size = file.tell()
        file.seek(0)
        
        if file_size > 10 * 1024 * 1024:  # 10MB
            return jsonify({'error': 'File too large (max 10MB)'}), 400
        
        # Generate unique document ID and secure filename
        document_id = str(uuid.uuid4())
        filename = secure_filename(file.filename)
        
        # Create temp file
        temp_dir = tempfile.mkdtemp()
        file_path = os.path.join(temp_dir, f"{document_id}_{filename}")
        
        # Save file
        file.save(file_path)
        
        # Process document
        logger.info(f"Processing uploaded document: {filename}")
        result = document_processor.process_document(file_path)
        
        if 'error' in result:
            # Clean up file on error
            try:
                os.remove(file_path)
                os.rmdir(temp_dir)
            except:
                pass
            return jsonify({'error': result['error']}), 500
        
        # Get combined text content
        text_content = document_processor.get_combined_text(result)
        
        if not text_content:
            # Clean up file if no content extracted
            try:
                os.remove(file_path)
                os.rmdir(temp_dir)
            except:
                pass
            return jsonify({'error': 'No text content could be extracted from the document'}), 400
        
        # Store document information
        user_email = session['user'].get('email', 'unknown')
        metadata = result.get('metadata', {})
        metadata['original_filename'] = file.filename
        
        store_document(user_email, document_id, file_path, text_content, metadata)
        
        logger.info(f"✅ Document processed successfully: {filename} ({len(text_content)} characters)")
        
        return jsonify({
            'document_id': document_id,
            'filename': filename,
            'text_content': text_content[:500] + '...' if len(text_content) > 500 else text_content,  # Preview
            'size': file_size,
            'text_length': len(text_content),
            'metadata': metadata
        })
        
    except Exception as e:
        logger.error(f"Document upload error: {e}")
        return jsonify({'error': 'Failed to process document'}), 500

Return Value

Returns a Flask JSON response tuple. On success (200): {'document_id': str (UUID), 'filename': str, 'text_content': str (preview up to 500 chars), 'size': int (bytes), 'text_length': int (full text length), 'metadata': dict}. On error: {'error': str (error message)} with status codes 400 (validation errors), 500 (processing errors).

Dependencies

  • flask
  • werkzeug
  • uuid
  • os
  • tempfile
  • logging

Required Imports

from flask import Flask, request, jsonify, session
from werkzeug.utils import secure_filename
import os
import uuid
import tempfile
import logging

Conditional/Optional Imports

These imports are only needed under specific conditions:

from document_processor import DocumentProcessor

Condition: Required for document processing functionality - must be available in the application context as 'document_processor' instance

Required (conditional)
from auth.azure_auth import AzureSSO

Condition: Required for the @require_auth decorator to function - must be configured for authentication

Required (conditional)
from hybrid_rag_engine import OneCo_hybrid_RAG

Condition: May be used by document_processor or store_document function

Optional

Usage Example

# Client-side usage example (JavaScript fetch)
const formData = new FormData();
formData.append('file', fileInput.files[0]);

fetch('/api/upload-document', {
  method: 'POST',
  body: formData,
  credentials: 'include'
})
.then(response => response.json())
.then(data => {
  if (data.error) {
    console.error('Upload failed:', data.error);
  } else {
    console.log('Document uploaded:', data.document_id);
    console.log('Preview:', data.text_content);
  }
});

# Python requests example
import requests

with open('document.pdf', 'rb') as f:
    files = {'file': f}
    response = requests.post(
        'http://localhost:5000/api/upload-document',
        files=files,
        cookies={'session': session_cookie}
    )
    result = response.json()
    print(f"Document ID: {result.get('document_id')}")

Best Practices

  • Always send files as multipart/form-data with the key 'file'
  • Ensure user is authenticated before calling this endpoint (handled by @require_auth decorator)
  • Supported file types: .pdf, .doc, .docx, .xls, .xlsx, .ppt, .pptx, .rtf, .odt
  • Maximum file size is 10MB - larger files will be rejected
  • The endpoint automatically cleans up temporary files on errors
  • Document IDs are UUIDs and should be stored for later reference
  • Text content preview is limited to 500 characters in response, full content is stored
  • Handle both validation errors (400) and processing errors (500) appropriately
  • The function requires document_processor and store_document to be properly initialized in the application context
  • Temporary files are stored in system temp directory - ensure adequate disk space
  • Original filename is preserved in metadata but stored file uses UUID prefix for security
  • Empty files or files with no extractable text will be rejected with 400 error

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function api_upload_document_v1 89.1% similar

    Flask API endpoint that handles document file uploads, validates file type and size, stores the file temporarily, and extracts basic text content for processing.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function api_upload 87.7% similar

    Flask API endpoint that handles file uploads, validates file types, saves files to a configured directory structure, and automatically indexes the uploaded document for search/retrieval.

    From: /tf/active/vicechatdev/docchat/app.py
  • function upload_document 85.7% similar

    Flask route handler that processes file uploads, saves them securely to disk, and indexes the document content for retrieval-augmented generation (RAG) search.

    From: /tf/active/vicechatdev/docchat/blueprint.py
  • function api_chat_upload_document 78.9% similar

    Flask API endpoint that handles document upload for chat context, processes the document to extract text content, and stores it for later retrieval in chat sessions.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
  • function api_delete_chat_uploaded_document 73.7% similar

    Flask API endpoint that deletes a user's uploaded document by document ID, requiring authentication and returning success/error responses.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
← Back to Browse