šŸ” Code Extractor

function debug_download

Maturity: 46

A diagnostic function that downloads a PDF document from FileCloud, analyzes its content to verify it's a valid PDF, and tests text extraction capabilities.

File:
/tf/active/vicechatdev/contract_validity_analyzer/debug_download.py
Lines:
16 - 119
Complexity:
moderate

Purpose

This debugging utility helps troubleshoot FileCloud document download issues by: 1) Connecting to FileCloud and downloading the first available PDF, 2) Analyzing whether the downloaded content is binary PDF data or unexpected text/HTML, 3) Checking for common error patterns (HTML responses, browser messages), 4) Validating PDF magic numbers, 5) Testing the DocumentProcessor's ability to extract text from the downloaded file. It provides detailed console output and saves a temporary file for manual inspection.

Source Code

def debug_download():
    """Debug what's actually being downloaded."""
    print("=" * 60)
    print("FileCloud Download Debug")
    print("=" * 60)
    
    # Load config
    config = Config()
    fc_config = config.config['filecloud']
    
    # Connect to FileCloud
    with FileCloudClient(fc_config) as fc_client:
        print("āœ“ Connected to FileCloud")
        
        # Search for documents
        documents = fc_client.search_documents(
            path=fc_config['base_path'],
            extensions=['.pdf']
        )
        
        if not documents:
            print("No documents found")
            return
        
        # Take the first document for debugging
        doc = documents[0]
        print(f"\nDebugging document: {doc['filename']}")
        print(f"Path: {doc['full_path']}")
        print(f"Size reported: {doc['size']} bytes")
        
        # Download the document
        print("\nDownloading document...")
        content = fc_client.download_document(doc['full_path'])
        
        if content is None:
            print("āŒ Download failed - no content returned")
            return
        
        print(f"āœ“ Downloaded {len(content)} bytes")
        
        # Examine the content
        print("\n" + "=" * 40)
        print("CONTENT ANALYSIS")
        print("=" * 40)
        
        # Check if it's binary (PDF) or text
        try:
            # Try to decode as text
            text_content = content.decode('utf-8')
            print("šŸ“„ Content appears to be TEXT (not binary PDF)")
            print(f"First 500 characters:")
            print("-" * 40)
            print(text_content[:500])
            print("-" * 40)
            
            # Check for common FileCloud error patterns
            if "Adobe Reader" in text_content:
                print("🚨 ISSUE: Content contains Adobe Reader message")
            if "browser" in text_content.lower():
                print("🚨 ISSUE: Content mentions browser")
            if "<html" in text_content.lower() or "<!doctype" in text_content.lower():
                print("🚨 ISSUE: Content appears to be HTML")
                
        except UnicodeDecodeError:
            print("šŸ“ Content appears to be BINARY (likely valid PDF)")
            print(f"First 50 bytes (hex): {content[:50].hex()}")
            
            # Check PDF magic number
            if content.startswith(b'%PDF'):
                print("āœ“ Valid PDF header found")
            else:
                print("āŒ Invalid PDF header")
        
        # Save to temporary file for inspection
        temp_file = f"debug_download_{doc['filename']}"
        with open(temp_file, 'wb') as f:
            f.write(content)
        
        print(f"\nšŸ“ Content saved to: {temp_file}")
        print("You can inspect this file manually")
        
        # Try to process with our document processor
        print("\n" + "=" * 40)
        print("DOCUMENT PROCESSOR TEST")
        print("=" * 40)
        
        from utils.document_processor import DocumentProcessor
        processor = DocumentProcessor({})
        
        result = processor.process_document(temp_file, doc['filename'])
        
        if result['success']:
            text = result['text']
            print(f"āœ“ Text extraction successful")
            print(f"Extracted {len(text)} characters")
            print("First 200 characters:")
            print("-" * 40)
            print(text[:200])
            print("-" * 40)
        else:
            print(f"āŒ Text extraction failed: {result['error']}")
        
        # Clean up
        os.unlink(temp_file)

Return Value

This function returns None. It is a diagnostic utility that prints debug information to the console and performs side effects (downloading files, creating temporary files). The function may return early with None if no documents are found or if the download fails.

Dependencies

  • os
  • sys
  • pathlib
  • config.config
  • utils.filecloud_client
  • utils.document_processor

Required Imports

import os
import sys
from pathlib import Path
from config.config import Config
from utils.filecloud_client import FileCloudClient

Conditional/Optional Imports

These imports are only needed under specific conditions:

from utils.document_processor import DocumentProcessor

Condition: imported lazily inside the function when testing document processing capabilities

Required (conditional)

Usage Example

# Ensure config is set up with FileCloud credentials
# config/config.py or config.yaml should have:
# filecloud:
#   base_path: '/path/to/documents'
#   host: 'filecloud.example.com'
#   username: 'user'
#   password: 'pass'

from debug_script import debug_download

# Run the debug function
# This will:
# - Connect to FileCloud
# - Download the first PDF found
# - Analyze the content
# - Test text extraction
# - Print detailed diagnostic information
debug_download()

# Output will show:
# - Connection status
# - Document details (filename, path, size)
# - Content analysis (binary vs text)
# - PDF validation results
# - Text extraction test results
# - Location of temporary debug file

Best Practices

  • This function is intended for debugging only and should not be used in production code
  • Ensure FileCloud credentials are properly configured before running
  • The function creates temporary files in the current directory - ensure write permissions exist
  • Temporary files are automatically cleaned up after processing
  • Review console output carefully for error patterns like HTML responses or browser messages
  • The function only processes the first PDF found - modify if you need to debug specific documents
  • Check that the PDF magic number (%PDF) is present in downloaded content to verify valid PDFs
  • If content decodes as UTF-8 text instead of binary, this indicates the download is returning HTML or error pages instead of the actual PDF

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function explore_documents 76.1% similar

    Explores and tests document accessibility across multiple FileCloud directory paths, attempting to download and validate document content from various locations in a hierarchical search pattern.

    From: /tf/active/vicechatdev/contract_validity_analyzer/explore_documents.py
  • function test_filecloud_connection 65.1% similar

    Tests the connection to a FileCloud server by establishing a client connection and performing a document search operation to verify functionality.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_implementation.py
  • function _download_current_version 63.4% similar

    Downloads the current version of a document from either FileCloud storage or standard storage, handling different storage types and triggering a browser download.

    From: /tf/active/vicechatdev/document_controller_backup.py
  • function test_extraction_debugging 61.7% similar

    A test function that validates the extraction debugging functionality of a DocumentProcessor by creating test files, simulating document extraction, and verifying debug log creation.

    From: /tf/active/vicechatdev/vice_ai/test_extraction_debug.py
  • function check_filecloud_structure 61.1% similar

    Diagnostic function that checks the FileCloud server structure and verifies accessibility of various paths including root, SHARED, and configured base paths.

    From: /tf/active/vicechatdev/SPFCsync/check_filecloud_structure.py
← Back to Browse