function scan_wuxi2_folder
Recursively scans a wuxi2 folder for PDF documents, extracts document codes from filenames, and organizes them into a dictionary mapping codes to file information.
/tf/active/vicechatdev/mailsearch/compare_documents.py
119 - 162
moderate
Purpose
This function is designed to index and catalog PDF documents in a wuxi2 repository by extracting structured document codes from filenames. It walks through the directory tree, filters PDF files, extracts codes using a separate function, and collects metadata about each coded document. The function skips hidden directories and 'summary' folders, provides progress feedback during scanning, and returns a comprehensive mapping of document codes to their associated files with metadata.
Source Code
def scan_wuxi2_folder(wuxi2_folder: str) -> Dict[str, List[Dict]]:
"""
Recursively scan wuxi2 folder for documents with codes
Args:
wuxi2_folder: Path to wuxi2 repository
Returns:
Dictionary mapping document codes to list of matching files
"""
print(f"\nScanning wuxi2 repository: {wuxi2_folder}")
coded_documents = defaultdict(list)
total_files = 0
matched_files = 0
for root, dirs, files in os.walk(wuxi2_folder):
# Skip hidden directories and summary folder
dirs[:] = [d for d in dirs if not d.startswith('.') and d != 'summary']
for filename in files:
# Only process PDF files
if not filename.lower().endswith('.pdf'):
continue
total_files += 1
filepath = os.path.join(root, filename)
# Extract document code
code = extract_document_code(filename)
if code:
file_info = get_file_info(filepath)
file_info['filename'] = filename
file_info['relative_path'] = os.path.relpath(filepath, wuxi2_folder)
coded_documents[code].append(file_info)
matched_files += 1
if matched_files % 50 == 0:
print(f" Processed {matched_files} coded documents...")
print(f"\nTotal PDF files scanned: {total_files}")
print(f"Total coded documents found: {matched_files}")
print(f"Unique document codes: {len(coded_documents)}")
return coded_documents
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
wuxi2_folder |
str | - | positional_or_keyword |
Parameter Details
wuxi2_folder: String path to the root directory of the wuxi2 repository to scan. Should be an absolute or relative path to a valid directory containing PDF documents. The function will recursively traverse all subdirectories except hidden ones (starting with '.') and 'summary' folders.
Return Value
Type: Dict[str, List[Dict]]
Returns a dictionary where keys are document codes (strings) extracted from filenames, and values are lists of dictionaries containing file information. Each file info dictionary includes 'filename' (original filename), 'relative_path' (path relative to wuxi2_folder), and additional metadata from get_file_info() function (likely including file size, modification time, hash, etc.). Multiple files can share the same document code, hence the list structure.
Dependencies
oscollections
Required Imports
import os
from collections import defaultdict
from typing import Dict, List
Usage Example
# Assuming helper functions are defined:
# def extract_document_code(filename: str) -> Optional[str]:
# # Extract code logic
# pass
#
# def get_file_info(filepath: str) -> Dict:
# # Return file metadata
# return {'size': os.path.getsize(filepath)}
from collections import defaultdict
import os
from typing import Dict, List
# Scan the wuxi2 repository
wuxi2_path = '/path/to/wuxi2/repository'
results = scan_wuxi2_folder(wuxi2_path)
# Access documents by code
for code, files in results.items():
print(f"Code: {code}")
for file_info in files:
print(f" File: {file_info['filename']}")
print(f" Path: {file_info['relative_path']}")
# Get all files for a specific code
if 'DOC-12345' in results:
doc_files = results['DOC-12345']
print(f"Found {len(doc_files)} files for DOC-12345")
Best Practices
- Ensure the wuxi2_folder path exists before calling this function to avoid errors
- The function depends on extract_document_code() and get_file_info() helper functions - ensure these are properly implemented
- For large repositories, be aware that the function prints progress every 50 files, which may produce significant console output
- The function uses defaultdict(list) to automatically handle new document codes without key errors
- Hidden directories (starting with '.') and 'summary' folders are automatically excluded from scanning
- Only PDF files (case-insensitive .pdf extension) are processed; other file types are ignored
- The function modifies the dirs list in-place during os.walk to control directory traversal
- Consider the memory implications when scanning very large repositories as all results are held in memory
- The relative_path field allows for easy reconstruction of full paths when needed
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function scan_wuxi2_folder_v1 91.7% similar
-
function find_best_folder 70.3% similar
-
function scan_output_folder 64.6% similar
-
function compare_documents 62.8% similar
-
function scan_output_folder_v1 62.2% similar