function debug_download
A diagnostic function that downloads a PDF document from FileCloud, analyzes its content to verify it's a valid PDF, and tests text extraction capabilities.
/tf/active/vicechatdev/contract_validity_analyzer/debug_download.py
16 - 119
moderate
Purpose
This debugging utility helps troubleshoot FileCloud document download issues by: 1) Connecting to FileCloud and downloading the first available PDF, 2) Analyzing whether the downloaded content is binary PDF data or unexpected text/HTML, 3) Checking for common error patterns (HTML responses, browser messages), 4) Validating PDF magic numbers, 5) Testing the DocumentProcessor's ability to extract text from the downloaded file. It provides detailed console output and saves a temporary file for manual inspection.
Source Code
def debug_download():
"""Debug what's actually being downloaded."""
print("=" * 60)
print("FileCloud Download Debug")
print("=" * 60)
# Load config
config = Config()
fc_config = config.config['filecloud']
# Connect to FileCloud
with FileCloudClient(fc_config) as fc_client:
print("ā Connected to FileCloud")
# Search for documents
documents = fc_client.search_documents(
path=fc_config['base_path'],
extensions=['.pdf']
)
if not documents:
print("No documents found")
return
# Take the first document for debugging
doc = documents[0]
print(f"\nDebugging document: {doc['filename']}")
print(f"Path: {doc['full_path']}")
print(f"Size reported: {doc['size']} bytes")
# Download the document
print("\nDownloading document...")
content = fc_client.download_document(doc['full_path'])
if content is None:
print("ā Download failed - no content returned")
return
print(f"ā Downloaded {len(content)} bytes")
# Examine the content
print("\n" + "=" * 40)
print("CONTENT ANALYSIS")
print("=" * 40)
# Check if it's binary (PDF) or text
try:
# Try to decode as text
text_content = content.decode('utf-8')
print("š Content appears to be TEXT (not binary PDF)")
print(f"First 500 characters:")
print("-" * 40)
print(text_content[:500])
print("-" * 40)
# Check for common FileCloud error patterns
if "Adobe Reader" in text_content:
print("šØ ISSUE: Content contains Adobe Reader message")
if "browser" in text_content.lower():
print("šØ ISSUE: Content mentions browser")
if "<html" in text_content.lower() or "<!doctype" in text_content.lower():
print("šØ ISSUE: Content appears to be HTML")
except UnicodeDecodeError:
print("š Content appears to be BINARY (likely valid PDF)")
print(f"First 50 bytes (hex): {content[:50].hex()}")
# Check PDF magic number
if content.startswith(b'%PDF'):
print("ā Valid PDF header found")
else:
print("ā Invalid PDF header")
# Save to temporary file for inspection
temp_file = f"debug_download_{doc['filename']}"
with open(temp_file, 'wb') as f:
f.write(content)
print(f"\nš Content saved to: {temp_file}")
print("You can inspect this file manually")
# Try to process with our document processor
print("\n" + "=" * 40)
print("DOCUMENT PROCESSOR TEST")
print("=" * 40)
from utils.document_processor import DocumentProcessor
processor = DocumentProcessor({})
result = processor.process_document(temp_file, doc['filename'])
if result['success']:
text = result['text']
print(f"ā Text extraction successful")
print(f"Extracted {len(text)} characters")
print("First 200 characters:")
print("-" * 40)
print(text[:200])
print("-" * 40)
else:
print(f"ā Text extraction failed: {result['error']}")
# Clean up
os.unlink(temp_file)
Return Value
This function returns None. It is a diagnostic utility that prints debug information to the console and performs side effects (downloading files, creating temporary files). The function may return early with None if no documents are found or if the download fails.
Dependencies
ossyspathlibconfig.configutils.filecloud_clientutils.document_processor
Required Imports
import os
import sys
from pathlib import Path
from config.config import Config
from utils.filecloud_client import FileCloudClient
Conditional/Optional Imports
These imports are only needed under specific conditions:
from utils.document_processor import DocumentProcessor
Condition: imported lazily inside the function when testing document processing capabilities
Required (conditional)Usage Example
# Ensure config is set up with FileCloud credentials
# config/config.py or config.yaml should have:
# filecloud:
# base_path: '/path/to/documents'
# host: 'filecloud.example.com'
# username: 'user'
# password: 'pass'
from debug_script import debug_download
# Run the debug function
# This will:
# - Connect to FileCloud
# - Download the first PDF found
# - Analyze the content
# - Test text extraction
# - Print detailed diagnostic information
debug_download()
# Output will show:
# - Connection status
# - Document details (filename, path, size)
# - Content analysis (binary vs text)
# - PDF validation results
# - Text extraction test results
# - Location of temporary debug file
Best Practices
- This function is intended for debugging only and should not be used in production code
- Ensure FileCloud credentials are properly configured before running
- The function creates temporary files in the current directory - ensure write permissions exist
- Temporary files are automatically cleaned up after processing
- Review console output carefully for error patterns like HTML responses or browser messages
- The function only processes the first PDF found - modify if you need to debug specific documents
- Check that the PDF magic number (%PDF) is present in downloaded content to verify valid PDFs
- If content decodes as UTF-8 text instead of binary, this indicates the download is returning HTML or error pages instead of the actual PDF
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function explore_documents 76.1% similar
-
function test_filecloud_connection 65.1% similar
-
function _download_current_version 63.4% similar
-
function test_extraction_debugging 61.7% similar
-
function check_filecloud_structure 61.1% similar