explore_documents - Code Extractor

function explore_documents

Maturity: 46

Explores and tests document accessibility across multiple FileCloud directory paths, attempting to download and validate document content from various locations in a hierarchical search pattern.

File:
/tf/active/vicechatdev/contract_validity_analyzer/explore_documents.py

Lines:
16 - 115

Complexity:
complex

Purpose

This diagnostic function systematically searches through a predefined list of FileCloud directory paths to discover documents (PDF, DOC, DOCX), test their downloadability, and validate their content integrity. It's designed to troubleshoot document access issues by checking for redirect pages, Adobe Reader messages, or actual document content. The function provides detailed console output showing the search progress, document counts, folder structures, and content validation results for each path tested.

Source Code

def explore_documents():
    """Explore different document types and locations."""
    print("=" * 60)
    print("FileCloud Document Explorer")
    print("=" * 60)
    
    # Load config
    config = Config()
    fc_config = config.config['filecloud']
    
    # Try different paths
    paths_to_try = [
        "/SHARED/vicebio_shares/00_Company_Governance/08_Third Parties Management/Third Parties",
        "/SHARED/vicebio_shares/00_Company_Governance/08_Third Parties Management",
        "/SHARED/vicebio_shares/00_Company_Governance",
        "/SHARED/vicebio_shares",
    ]
    
    with FileCloudClient(fc_config) as fc_client:
        print("✓ Connected to FileCloud")
        
        for path in paths_to_try:
            print(f"\n🔍 Searching in: {path}")
            try:
                documents = fc_client.search_documents(
                    path=path,
                    extensions=['.pdf', '.doc', '.docx']
                )
                
                print(f"Found {len(documents)} documents")
                
                if documents:
                    # Group by folder
                    folders = {}
                    for doc in documents[:20]:  # First 20 documents
                        folder = doc['path']
                        if folder not in folders:
                            folders[folder] = []
                        folders[folder].append(doc)
                    
                    print("Sample documents by folder:")
                    for folder, docs in list(folders.items())[:5]:  # First 5 folders
                        print(f"  📁 {folder}")
                        for doc in docs[:3]:  # First 3 docs per folder
                            print(f"    📄 {doc['filename']} ({doc['size']} bytes)")
                        
                        # Test download from this folder
                        if docs:
                            test_doc = docs[0]
                            print(f"    🔍 Testing download of: {test_doc['filename']}")
                            content = fc_client.download_document(test_doc['full_path'])
                            
                            if content:
                                print(f"      Downloaded {len(content)} bytes")
                                
                                # Quick content check
                                try:
                                    text_preview = content.decode('utf-8')[:100]
                                    if "Adobe Reader" in text_preview:
                                        print("      ⚠️  Contains Adobe Reader message")
                                    elif "ansarada" in text_preview.lower():
                                        print("      ⚠️  Contains Ansarada redirect")
                                    else:
                                        print("      ✓ May contain actual document content")
                                        print(f"      Preview: {text_preview[:50]}...")
                                except UnicodeDecodeError:
                                    # Binary content - likely a real PDF
                                    if content.startswith(b'%PDF'):
                                        print("      ✓ Valid PDF binary content")
                                        
                                        # Try to extract a bit of text
                                        from utils.document_processor import DocumentProcessor
                                        processor = DocumentProcessor({})
                                        
                                        temp_file = f"test_{test_doc['filename']}"
                                        try:
                                            with open(temp_file, 'wb') as f:
                                                f.write(content)
                                            
                                            result = processor.process_document(temp_file)
                                            if result['success'] and result['text']:
                                                text = result['text']
                                                if "Adobe Reader" not in text and len(text) > 500:
                                                    print(f"      🎉 FOUND GOOD DOCUMENT! Extracted {len(text)} chars")
                                                    print(f"      Preview: {text[:100]}...")
                                                else:
                                                    print(f"      ⚠️  Extracted text contains Adobe Reader message")
                                            else:
                                                print(f"      ❌ Text extraction failed")
                                        finally:
                                            if os.path.exists(temp_file):
                                                os.unlink(temp_file)
                                    else:
                                        print("      ❌ Invalid PDF header")
                            else:
                                print("      ❌ Download failed")
                        print()
                        
            except Exception as e:
                print(f"❌ Error searching {path}: {e}")

Return Value

This function does not return any value (implicitly returns None). It outputs diagnostic information directly to the console, including connection status, document counts, folder structures, download test results, and content validation messages.

Dependencies

os
sys
pathlib
config.config
utils.filecloud_client
utils.document_processor

Required Imports

import os
import sys
from pathlib import Path
from config.config import Config
from utils.filecloud_client import FileCloudClient

Conditional/Optional Imports

These imports are only needed under specific conditions:

from utils.document_processor import DocumentProcessor

Condition: only when a valid PDF binary is found and text extraction is attempted

Required (conditional)

Usage Example

# Ensure config.py or config file is properly set up with FileCloud credentials
# Example config structure:
# config = {
#     'filecloud': {
#         'server': 'https://filecloud.example.com',
#         'username': 'user@example.com',
#         'password': 'password'
#     }
# }

from explore_documents import explore_documents

# Run the exploration - outputs diagnostic information to console
explore_documents()

# The function will:
# 1. Connect to FileCloud
# 2. Search through predefined paths
# 3. Display found documents grouped by folder
# 4. Test download and content validation
# 5. Report on document accessibility and content quality

Best Practices

Ensure FileCloud credentials are properly configured before running this function
The function creates temporary files during PDF validation - ensure write permissions in the current directory
This is a diagnostic/exploration tool and should not be used in production pipelines
The function limits output to first 20 documents and 5 folders to prevent overwhelming console output
Temporary test files are automatically cleaned up, but ensure proper error handling in production use
The function tests multiple paths in order from most specific to most general - adjust paths_to_try list for different directory structures
Console output uses Unicode characters (✓, ❌, 🔍, etc.) - ensure terminal supports UTF-8 encoding
The function performs actual file downloads for testing - be mindful of network bandwidth and FileCloud rate limits

Similar Components

AI-powered semantic similarity - components with related functionality:

function debug_download 76.1% similar

A diagnostic function that downloads a PDF document from FileCloud, analyzes its content to verify it's a valid PDF, and tests text extraction capabilities.
From: /tf/active/vicechatdev/contract_validity_analyzer/debug_download.py
function test_filecloud_connection 74.9% similar

Tests the connection to a FileCloud server by establishing a client connection and performing a document search operation to verify functionality.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_implementation.py
function check_filecloud_structure 72.0% similar

Diagnostic function that checks the FileCloud server structure and verifies accessibility of various paths including root, SHARED, and configured base paths.
From: /tf/active/vicechatdev/SPFCsync/check_filecloud_structure.py
function test_filecloud_operations 62.9% similar

Tests FileCloud basic operations by creating a test folder to verify connectivity and authentication with a FileCloud server.
From: /tf/active/vicechatdev/SPFCsync/test_connections.py
function test_docx_file 62.2% similar

Tests the ability to open and read a Microsoft Word (.docx) document file, validating file existence, size, and content extraction capabilities.
From: /tf/active/vicechatdev/docchat/test_problematic_files.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def explore_documents():
    """Explore different document types and locations."""
    print("=" * 60)
    print("FileCloud Document Explorer")
    print("=" * 60)
    
    # Load config
    config = Config()
    fc_config = config.config['filecloud']
    
    # Try different paths
    paths_to_try = [
        "/SHARED/vicebio_shares/00_Company_Governance/08_Third Parties Management/Third Parties",
        "/SHARED/vicebio_shares/00_Company_Governance/08_Third Parties Management",
        "/SHARED/vicebio_shares/00_Company_Governance",
        "/SHARED/vicebio_shares",
    ]
    
    with FileCloudClient(fc_config) as fc_client:
        print("✓ Connected to FileCloud")
        
        for path in paths_to_try:
            print(f"\n🔍 Searching in: {path}")
            try:
                documents = fc_client.search_documents(
                    path=path,
                    extensions=['.pdf', '.doc', '.docx']
                )
                
                print(f"Found {len(documents)} documents")
                
                if documents:
                    # Group by folder
                    folders = {}
                    for doc in documents[:20]:  # First 20 documents
                        folder = doc['path']
                        if folder not in folders:
                            folders[folder] = []
                        folders[folder].append(doc)
                    
                    print("Sample documents by folder:")
                    for folder, docs in list(folders.items())[:5]:  # First 5 folders
                        print(f"  📁 {folder}")
                        for doc in docs[:3]:  # First 3 docs per folder
                            print(f"    📄 {doc['filename']} ({doc['size']} bytes)")
                        
                        # Test download from this folder
                        if docs:
                            test_doc = docs[0]
                            print(f"    🔍 Testing download of: {test_doc['filename']}")
                            content = fc_client.download_document(test_doc['full_path'])
                            
                            if content:
                                print(f"      Downloaded {len(content)} bytes")
                                
                                # Quick content check
                                try:
                                    text_preview = content.decode('utf-8')[:100]
                                    if "Adobe Reader" in text_preview:
                                        print("      ⚠️  Contains Adobe Reader message")
                                    elif "ansarada" in text_preview.lower():
                                        print("      ⚠️  Contains Ansarada redirect")
                                    else:
                                        print("      ✓ May contain actual document content")
                                        print(f"      Preview: {text_preview[:50]}...")
                                except UnicodeDecodeError:
                                    # Binary content - likely a real PDF
                                    if content.startswith(b'%PDF'):
                                        print("      ✓ Valid PDF binary content")
                                        
                                        # Try to extract a bit of text
                                        from utils.document_processor import DocumentProcessor
                                        processor = DocumentProcessor({})
                                        
                                        temp_file = f"test_{test_doc['filename']}"
                                        try:
                                            with open(temp_file, 'wb') as f:
                                                f.write(content)
                                            
                                            result = processor.process_document(temp_file)
                                            if result['success'] and result['text']:
                                                text = result['text']
                                                if "Adobe Reader" not in text and len(text) > 500:
                                                    print(f"      🎉 FOUND GOOD DOCUMENT! Extracted {len(text)} chars")
                                                    print(f"      Preview: {text[:100]}...")
                                                else:
                                                    print(f"      ⚠️  Extracted text contains Adobe Reader message")
                                            else:
                                                print(f"      ❌ Text extraction failed")
                                        finally:
                                            if os.path.exists(temp_file):
                                                os.unlink(temp_file)
                                    else:
                                        print("      ❌ Invalid PDF header")
                            else:
                                print("      ❌ Download failed")
                        print()
                        
            except Exception as e:
                print(f"❌ Error searching {path}: {e}")
                        

Improved Code

🔍 Code Extractor

function explore_documents

Purpose

Source Code

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function debug_download 76.1% similar

function test_filecloud_connection 74.9% similar

function check_filecloud_structure 72.0% similar

function test_filecloud_operations 62.9% similar

function test_docx_file 62.2% similar

function explore_documents

Purpose

Source Code

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function debug_download 76.1% similar

function test_filecloud_connection 74.9% similar

function check_filecloud_structure 72.0% similar

function test_filecloud_operations 62.9% similar

function test_docx_file 62.2% similar

✨ Improve Code: explore_documents

Code Comparison