debug_download - Code Extractor

function debug_download

Maturity: 46

A diagnostic function that downloads a PDF document from FileCloud, analyzes its content to verify it's a valid PDF, and tests text extraction capabilities.

File:
/tf/active/vicechatdev/contract_validity_analyzer/debug_download.py

Lines:
16 - 119

Complexity:
moderate

Purpose

This debugging utility helps troubleshoot FileCloud document download issues by: 1) Connecting to FileCloud and downloading the first available PDF, 2) Analyzing whether the downloaded content is binary PDF data or unexpected text/HTML, 3) Checking for common error patterns (HTML responses, browser messages), 4) Validating PDF magic numbers, 5) Testing the DocumentProcessor's ability to extract text from the downloaded file. It provides detailed console output and saves a temporary file for manual inspection.

Source Code

def debug_download():
    """Debug what's actually being downloaded."""
    print("=" * 60)
    print("FileCloud Download Debug")
    print("=" * 60)
    
    # Load config
    config = Config()
    fc_config = config.config['filecloud']
    
    # Connect to FileCloud
    with FileCloudClient(fc_config) as fc_client:
        print("✓ Connected to FileCloud")
        
        # Search for documents
        documents = fc_client.search_documents(
            path=fc_config['base_path'],
            extensions=['.pdf']
        )
        
        if not documents:
            print("No documents found")
            return
        
        # Take the first document for debugging
        doc = documents[0]
        print(f"\nDebugging document: {doc['filename']}")
        print(f"Path: {doc['full_path']}")
        print(f"Size reported: {doc['size']} bytes")
        
        # Download the document
        print("\nDownloading document...")
        content = fc_client.download_document(doc['full_path'])
        
        if content is None:
            print("❌ Download failed - no content returned")
            return
        
        print(f"✓ Downloaded {len(content)} bytes")
        
        # Examine the content
        print("\n" + "=" * 40)
        print("CONTENT ANALYSIS")
        print("=" * 40)
        
        # Check if it's binary (PDF) or text
        try:
            # Try to decode as text
            text_content = content.decode('utf-8')
            print("📄 Content appears to be TEXT (not binary PDF)")
            print(f"First 500 characters:")
            print("-" * 40)
            print(text_content[:500])
            print("-" * 40)
            
            # Check for common FileCloud error patterns
            if "Adobe Reader" in text_content:
                print("🚨 ISSUE: Content contains Adobe Reader message")
            if "browser" in text_content.lower():
                print("🚨 ISSUE: Content mentions browser")
            if "<html" in text_content.lower() or "<!doctype" in text_content.lower():
                print("🚨 ISSUE: Content appears to be HTML")
                
        except UnicodeDecodeError:
            print("📁 Content appears to be BINARY (likely valid PDF)")
            print(f"First 50 bytes (hex): {content[:50].hex()}")
            
            # Check PDF magic number
            if content.startswith(b'%PDF'):
                print("✓ Valid PDF header found")
            else:
                print("❌ Invalid PDF header")
        
        # Save to temporary file for inspection
        temp_file = f"debug_download_{doc['filename']}"
        with open(temp_file, 'wb') as f:
            f.write(content)
        
        print(f"\n📁 Content saved to: {temp_file}")
        print("You can inspect this file manually")
        
        # Try to process with our document processor
        print("\n" + "=" * 40)
        print("DOCUMENT PROCESSOR TEST")
        print("=" * 40)
        
        from utils.document_processor import DocumentProcessor
        processor = DocumentProcessor({})
        
        result = processor.process_document(temp_file, doc['filename'])
        
        if result['success']:
            text = result['text']
            print(f"✓ Text extraction successful")
            print(f"Extracted {len(text)} characters")
            print("First 200 characters:")
            print("-" * 40)
            print(text[:200])
            print("-" * 40)
        else:
            print(f"❌ Text extraction failed: {result['error']}")
        
        # Clean up
        os.unlink(temp_file)

Return Value

This function returns None. It is a diagnostic utility that prints debug information to the console and performs side effects (downloading files, creating temporary files). The function may return early with None if no documents are found or if the download fails.

Dependencies

os
sys
pathlib
config.config
utils.filecloud_client
utils.document_processor

Required Imports

import os
import sys
from pathlib import Path
from config.config import Config
from utils.filecloud_client import FileCloudClient

Conditional/Optional Imports

These imports are only needed under specific conditions:

from utils.document_processor import DocumentProcessor

Condition: imported lazily inside the function when testing document processing capabilities

Required (conditional)

Usage Example

# Ensure config is set up with FileCloud credentials
# config/config.py or config.yaml should have:
# filecloud:
#   base_path: '/path/to/documents'
#   host: 'filecloud.example.com'
#   username: 'user'
#   password: 'pass'

from debug_script import debug_download

# Run the debug function
# This will:
# - Connect to FileCloud
# - Download the first PDF found
# - Analyze the content
# - Test text extraction
# - Print detailed diagnostic information
debug_download()

# Output will show:
# - Connection status
# - Document details (filename, path, size)
# - Content analysis (binary vs text)
# - PDF validation results
# - Text extraction test results
# - Location of temporary debug file

Best Practices

This function is intended for debugging only and should not be used in production code
Ensure FileCloud credentials are properly configured before running
The function creates temporary files in the current directory - ensure write permissions exist
Temporary files are automatically cleaned up after processing
Review console output carefully for error patterns like HTML responses or browser messages
The function only processes the first PDF found - modify if you need to debug specific documents
Check that the PDF magic number (%PDF) is present in downloaded content to verify valid PDFs
If content decodes as UTF-8 text instead of binary, this indicates the download is returning HTML or error pages instead of the actual PDF

Similar Components

AI-powered semantic similarity - components with related functionality:

function explore_documents 76.1% similar

Explores and tests document accessibility across multiple FileCloud directory paths, attempting to download and validate document content from various locations in a hierarchical search pattern.
From: /tf/active/vicechatdev/contract_validity_analyzer/explore_documents.py
function test_filecloud_connection 65.1% similar

Tests the connection to a FileCloud server by establishing a client connection and performing a document search operation to verify functionality.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_implementation.py
function _download_current_version 63.4% similar

Downloads the current version of a document from either FileCloud storage or standard storage, handling different storage types and triggering a browser download.
From: /tf/active/vicechatdev/document_controller_backup.py
function test_extraction_debugging 61.7% similar

A test function that validates the extraction debugging functionality of a DocumentProcessor by creating test files, simulating document extraction, and verifying debug log creation.
From: /tf/active/vicechatdev/vice_ai/test_extraction_debug.py
function check_filecloud_structure 61.1% similar

Diagnostic function that checks the FileCloud server structure and verifies accessibility of various paths including root, SHARED, and configured base paths.
From: /tf/active/vicechatdev/SPFCsync/check_filecloud_structure.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def debug_download():
    """Debug what's actually being downloaded."""
    print("=" * 60)
    print("FileCloud Download Debug")
    print("=" * 60)
    
    # Load config
    config = Config()
    fc_config = config.config['filecloud']
    
    # Connect to FileCloud
    with FileCloudClient(fc_config) as fc_client:
        print("✓ Connected to FileCloud")
        
        # Search for documents
        documents = fc_client.search_documents(
            path=fc_config['base_path'],
            extensions=['.pdf']
        )
        
        if not documents:
            print("No documents found")
            return
        
        # Take the first document for debugging
        doc = documents[0]
        print(f"\nDebugging document: {doc['filename']}")
        print(f"Path: {doc['full_path']}")
        print(f"Size reported: {doc['size']} bytes")
        
        # Download the document
        print("\nDownloading document...")
        content = fc_client.download_document(doc['full_path'])
        
        if content is None:
            print("❌ Download failed - no content returned")
            return
        
        print(f"✓ Downloaded {len(content)} bytes")
        
        # Examine the content
        print("\n" + "=" * 40)
        print("CONTENT ANALYSIS")
        print("=" * 40)
        
        # Check if it's binary (PDF) or text
        try:
            # Try to decode as text
            text_content = content.decode('utf-8')
            print("📄 Content appears to be TEXT (not binary PDF)")
            print(f"First 500 characters:")
            print("-" * 40)
            print(text_content[:500])
            print("-" * 40)
            
            # Check for common FileCloud error patterns
            if "Adobe Reader" in text_content:
                print("🚨 ISSUE: Content contains Adobe Reader message")
            if "browser" in text_content.lower():
                print("🚨 ISSUE: Content mentions browser")
            if "<html" in text_content.lower() or "<!doctype" in text_content.lower():
                print("🚨 ISSUE: Content appears to be HTML")
                
        except UnicodeDecodeError:
            print("📁 Content appears to be BINARY (likely valid PDF)")
            print(f"First 50 bytes (hex): {content[:50].hex()}")
            
            # Check PDF magic number
            if content.startswith(b'%PDF'):
                print("✓ Valid PDF header found")
            else:
                print("❌ Invalid PDF header")
        
        # Save to temporary file for inspection
        temp_file = f"debug_download_{doc['filename']}"
        with open(temp_file, 'wb') as f:
            f.write(content)
        
        print(f"\n📁 Content saved to: {temp_file}")
        print("You can inspect this file manually")
        
        # Try to process with our document processor
        print("\n" + "=" * 40)
        print("DOCUMENT PROCESSOR TEST")
        print("=" * 40)
        
        from utils.document_processor import DocumentProcessor
        processor = DocumentProcessor({})
        
        result = processor.process_document(temp_file, doc['filename'])
        
        if result['success']:
            text = result['text']
            print(f"✓ Text extraction successful")
            print(f"Extracted {len(text)} characters")
            print("First 200 characters:")
            print("-" * 40)
            print(text[:200])
            print("-" * 40)
        else:
            print(f"❌ Text extraction failed: {result['error']}")
        
        # Clean up
        os.unlink(temp_file)
                        

Improved Code

🔍 Code Extractor

function debug_download

Purpose

Source Code

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function explore_documents 76.1% similar

function test_filecloud_connection 65.1% similar

function _download_current_version 63.4% similar

function test_extraction_debugging 61.7% similar

function check_filecloud_structure 61.1% similar

function debug_download

Purpose

Source Code

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function explore_documents 76.1% similar

function test_filecloud_connection 65.1% similar

function _download_current_version 63.4% similar

function test_extraction_debugging 61.7% similar

function check_filecloud_structure 61.1% similar

✨ Improve Code: debug_download

Code Comparison