🔍 Code Extractor

function scan_wuxi2_folder

Maturity: 58

Recursively scans a wuxi2 folder for PDF documents, extracts document codes from filenames, and organizes them into a dictionary mapping codes to file information.

File:
/tf/active/vicechatdev/mailsearch/compare_documents.py
Lines:
119 - 162
Complexity:
moderate

Purpose

This function is designed to index and catalog PDF documents in a wuxi2 repository by extracting structured document codes from filenames. It walks through the directory tree, filters PDF files, extracts codes using a separate function, and collects metadata about each coded document. The function skips hidden directories and 'summary' folders, provides progress feedback during scanning, and returns a comprehensive mapping of document codes to their associated files with metadata.

Source Code

def scan_wuxi2_folder(wuxi2_folder: str) -> Dict[str, List[Dict]]:
    """
    Recursively scan wuxi2 folder for documents with codes
    
    Args:
        wuxi2_folder: Path to wuxi2 repository
        
    Returns:
        Dictionary mapping document codes to list of matching files
    """
    print(f"\nScanning wuxi2 repository: {wuxi2_folder}")
    coded_documents = defaultdict(list)
    total_files = 0
    matched_files = 0
    
    for root, dirs, files in os.walk(wuxi2_folder):
        # Skip hidden directories and summary folder
        dirs[:] = [d for d in dirs if not d.startswith('.') and d != 'summary']
        
        for filename in files:
            # Only process PDF files
            if not filename.lower().endswith('.pdf'):
                continue
            
            total_files += 1
            filepath = os.path.join(root, filename)
            
            # Extract document code
            code = extract_document_code(filename)
            if code:
                file_info = get_file_info(filepath)
                file_info['filename'] = filename
                file_info['relative_path'] = os.path.relpath(filepath, wuxi2_folder)
                coded_documents[code].append(file_info)
                matched_files += 1
                
                if matched_files % 50 == 0:
                    print(f"  Processed {matched_files} coded documents...")
    
    print(f"\nTotal PDF files scanned: {total_files}")
    print(f"Total coded documents found: {matched_files}")
    print(f"Unique document codes: {len(coded_documents)}")
    
    return coded_documents

Parameters

Name Type Default Kind
wuxi2_folder str - positional_or_keyword

Parameter Details

wuxi2_folder: String path to the root directory of the wuxi2 repository to scan. Should be an absolute or relative path to a valid directory containing PDF documents. The function will recursively traverse all subdirectories except hidden ones (starting with '.') and 'summary' folders.

Return Value

Type: Dict[str, List[Dict]]

Returns a dictionary where keys are document codes (strings) extracted from filenames, and values are lists of dictionaries containing file information. Each file info dictionary includes 'filename' (original filename), 'relative_path' (path relative to wuxi2_folder), and additional metadata from get_file_info() function (likely including file size, modification time, hash, etc.). Multiple files can share the same document code, hence the list structure.

Dependencies

  • os
  • collections

Required Imports

import os
from collections import defaultdict
from typing import Dict, List

Usage Example

# Assuming helper functions are defined:
# def extract_document_code(filename: str) -> Optional[str]:
#     # Extract code logic
#     pass
# 
# def get_file_info(filepath: str) -> Dict:
#     # Return file metadata
#     return {'size': os.path.getsize(filepath)}

from collections import defaultdict
import os
from typing import Dict, List

# Scan the wuxi2 repository
wuxi2_path = '/path/to/wuxi2/repository'
results = scan_wuxi2_folder(wuxi2_path)

# Access documents by code
for code, files in results.items():
    print(f"Code: {code}")
    for file_info in files:
        print(f"  File: {file_info['filename']}")
        print(f"  Path: {file_info['relative_path']}")

# Get all files for a specific code
if 'DOC-12345' in results:
    doc_files = results['DOC-12345']
    print(f"Found {len(doc_files)} files for DOC-12345")

Best Practices

  • Ensure the wuxi2_folder path exists before calling this function to avoid errors
  • The function depends on extract_document_code() and get_file_info() helper functions - ensure these are properly implemented
  • For large repositories, be aware that the function prints progress every 50 files, which may produce significant console output
  • The function uses defaultdict(list) to automatically handle new document codes without key errors
  • Hidden directories (starting with '.') and 'summary' folders are automatically excluded from scanning
  • Only PDF files (case-insensitive .pdf extension) are processed; other file types are ignored
  • The function modifies the dirs list in-place during os.walk to control directory traversal
  • Consider the memory implications when scanning very large repositories as all results are held in memory
  • The relative_path field allows for easy reconstruction of full paths when needed

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function scan_wuxi2_folder_v1 91.7% similar

    Recursively scans a directory for PDF files, extracts document codes from filenames, and returns a dictionary mapping each unique document code to a list of file metadata dictionaries.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function find_best_folder 70.3% similar

    Finds the best matching folder in a directory tree by comparing hierarchical document codes with folder names containing numeric codes.

    From: /tf/active/vicechatdev/mailsearch/copy_signed_documents.py
  • function scan_output_folder 64.6% similar

    Scans a specified output folder for PDF files containing document codes, extracts those codes, and returns a dictionary mapping each code to its associated file information.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function compare_documents 62.8% similar

    Compares documents from an output folder with documents in a wuxi2 repository by matching document codes, file hashes, sizes, and filenames to identify identical, similar, or missing documents.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function scan_output_folder_v1 62.2% similar

    Scans a specified folder for PDF documents with embedded codes in their filenames, extracting metadata and signature information for each coded document found.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
← Back to Browse