function explore_documents
Explores and tests document accessibility across multiple FileCloud directory paths, attempting to download and validate document content from various locations in a hierarchical search pattern.
/tf/active/vicechatdev/contract_validity_analyzer/explore_documents.py
16 - 115
complex
Purpose
This diagnostic function systematically searches through a predefined list of FileCloud directory paths to discover documents (PDF, DOC, DOCX), test their downloadability, and validate their content integrity. It's designed to troubleshoot document access issues by checking for redirect pages, Adobe Reader messages, or actual document content. The function provides detailed console output showing the search progress, document counts, folder structures, and content validation results for each path tested.
Source Code
def explore_documents():
"""Explore different document types and locations."""
print("=" * 60)
print("FileCloud Document Explorer")
print("=" * 60)
# Load config
config = Config()
fc_config = config.config['filecloud']
# Try different paths
paths_to_try = [
"/SHARED/vicebio_shares/00_Company_Governance/08_Third Parties Management/Third Parties",
"/SHARED/vicebio_shares/00_Company_Governance/08_Third Parties Management",
"/SHARED/vicebio_shares/00_Company_Governance",
"/SHARED/vicebio_shares",
]
with FileCloudClient(fc_config) as fc_client:
print("ā Connected to FileCloud")
for path in paths_to_try:
print(f"\nš Searching in: {path}")
try:
documents = fc_client.search_documents(
path=path,
extensions=['.pdf', '.doc', '.docx']
)
print(f"Found {len(documents)} documents")
if documents:
# Group by folder
folders = {}
for doc in documents[:20]: # First 20 documents
folder = doc['path']
if folder not in folders:
folders[folder] = []
folders[folder].append(doc)
print("Sample documents by folder:")
for folder, docs in list(folders.items())[:5]: # First 5 folders
print(f" š {folder}")
for doc in docs[:3]: # First 3 docs per folder
print(f" š {doc['filename']} ({doc['size']} bytes)")
# Test download from this folder
if docs:
test_doc = docs[0]
print(f" š Testing download of: {test_doc['filename']}")
content = fc_client.download_document(test_doc['full_path'])
if content:
print(f" Downloaded {len(content)} bytes")
# Quick content check
try:
text_preview = content.decode('utf-8')[:100]
if "Adobe Reader" in text_preview:
print(" ā ļø Contains Adobe Reader message")
elif "ansarada" in text_preview.lower():
print(" ā ļø Contains Ansarada redirect")
else:
print(" ā May contain actual document content")
print(f" Preview: {text_preview[:50]}...")
except UnicodeDecodeError:
# Binary content - likely a real PDF
if content.startswith(b'%PDF'):
print(" ā Valid PDF binary content")
# Try to extract a bit of text
from utils.document_processor import DocumentProcessor
processor = DocumentProcessor({})
temp_file = f"test_{test_doc['filename']}"
try:
with open(temp_file, 'wb') as f:
f.write(content)
result = processor.process_document(temp_file)
if result['success'] and result['text']:
text = result['text']
if "Adobe Reader" not in text and len(text) > 500:
print(f" š FOUND GOOD DOCUMENT! Extracted {len(text)} chars")
print(f" Preview: {text[:100]}...")
else:
print(f" ā ļø Extracted text contains Adobe Reader message")
else:
print(f" ā Text extraction failed")
finally:
if os.path.exists(temp_file):
os.unlink(temp_file)
else:
print(" ā Invalid PDF header")
else:
print(" ā Download failed")
print()
except Exception as e:
print(f"ā Error searching {path}: {e}")
Return Value
This function does not return any value (implicitly returns None). It outputs diagnostic information directly to the console, including connection status, document counts, folder structures, download test results, and content validation messages.
Dependencies
ossyspathlibconfig.configutils.filecloud_clientutils.document_processor
Required Imports
import os
import sys
from pathlib import Path
from config.config import Config
from utils.filecloud_client import FileCloudClient
Conditional/Optional Imports
These imports are only needed under specific conditions:
from utils.document_processor import DocumentProcessor
Condition: only when a valid PDF binary is found and text extraction is attempted
Required (conditional)Usage Example
# Ensure config.py or config file is properly set up with FileCloud credentials
# Example config structure:
# config = {
# 'filecloud': {
# 'server': 'https://filecloud.example.com',
# 'username': 'user@example.com',
# 'password': 'password'
# }
# }
from explore_documents import explore_documents
# Run the exploration - outputs diagnostic information to console
explore_documents()
# The function will:
# 1. Connect to FileCloud
# 2. Search through predefined paths
# 3. Display found documents grouped by folder
# 4. Test download and content validation
# 5. Report on document accessibility and content quality
Best Practices
- Ensure FileCloud credentials are properly configured before running this function
- The function creates temporary files during PDF validation - ensure write permissions in the current directory
- This is a diagnostic/exploration tool and should not be used in production pipelines
- The function limits output to first 20 documents and 5 folders to prevent overwhelming console output
- Temporary test files are automatically cleaned up, but ensure proper error handling in production use
- The function tests multiple paths in order from most specific to most general - adjust paths_to_try list for different directory structures
- Console output uses Unicode characters (ā, ā, š, etc.) - ensure terminal supports UTF-8 encoding
- The function performs actual file downloads for testing - be mindful of network bandwidth and FileCloud rate limits
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function debug_download 76.1% similar
-
function test_filecloud_connection 74.9% similar
-
function check_filecloud_structure 72.0% similar
-
function test_filecloud_operations 62.9% similar
-
function test_docx_file 62.2% similar