scan_wuxi2_folder - Code Extractor

function scan_wuxi2_folder

Maturity: 58

Recursively scans a wuxi2 folder for PDF documents, extracts document codes from filenames, and organizes them into a dictionary mapping codes to file information.

File:
/tf/active/vicechatdev/mailsearch/compare_documents.py

Lines:
119 - 162

Complexity:
moderate

Purpose

This function is designed to index and catalog PDF documents in a wuxi2 repository by extracting structured document codes from filenames. It walks through the directory tree, filters PDF files, extracts codes using a separate function, and collects metadata about each coded document. The function skips hidden directories and 'summary' folders, provides progress feedback during scanning, and returns a comprehensive mapping of document codes to their associated files with metadata.

Source Code

def scan_wuxi2_folder(wuxi2_folder: str) -> Dict[str, List[Dict]]:
    """
    Recursively scan wuxi2 folder for documents with codes
    
    Args:
        wuxi2_folder: Path to wuxi2 repository
        
    Returns:
        Dictionary mapping document codes to list of matching files
    """
    print(f"\nScanning wuxi2 repository: {wuxi2_folder}")
    coded_documents = defaultdict(list)
    total_files = 0
    matched_files = 0
    
    for root, dirs, files in os.walk(wuxi2_folder):
        # Skip hidden directories and summary folder
        dirs[:] = [d for d in dirs if not d.startswith('.') and d != 'summary']
        
        for filename in files:
            # Only process PDF files
            if not filename.lower().endswith('.pdf'):
                continue
            
            total_files += 1
            filepath = os.path.join(root, filename)
            
            # Extract document code
            code = extract_document_code(filename)
            if code:
                file_info = get_file_info(filepath)
                file_info['filename'] = filename
                file_info['relative_path'] = os.path.relpath(filepath, wuxi2_folder)
                coded_documents[code].append(file_info)
                matched_files += 1
                
                if matched_files % 50 == 0:
                    print(f"  Processed {matched_files} coded documents...")
    
    print(f"\nTotal PDF files scanned: {total_files}")
    print(f"Total coded documents found: {matched_files}")
    print(f"Unique document codes: {len(coded_documents)}")
    
    return coded_documents

Parameters

Name	Type	Default	Kind
`wuxi2_folder`	str	-	positional_or_keyword

Parameter Details

wuxi2_folder: String path to the root directory of the wuxi2 repository to scan. Should be an absolute or relative path to a valid directory containing PDF documents. The function will recursively traverse all subdirectories except hidden ones (starting with '.') and 'summary' folders.

Return Value

Type: Dict[str, List[Dict]]

Returns a dictionary where keys are document codes (strings) extracted from filenames, and values are lists of dictionaries containing file information. Each file info dictionary includes 'filename' (original filename), 'relative_path' (path relative to wuxi2_folder), and additional metadata from get_file_info() function (likely including file size, modification time, hash, etc.). Multiple files can share the same document code, hence the list structure.

Dependencies

os
collections

Required Imports

import os
from collections import defaultdict
from typing import Dict, List

Usage Example

# Assuming helper functions are defined:
# def extract_document_code(filename: str) -> Optional[str]:
#     # Extract code logic
#     pass
# 
# def get_file_info(filepath: str) -> Dict:
#     # Return file metadata
#     return {'size': os.path.getsize(filepath)}

from collections import defaultdict
import os
from typing import Dict, List

# Scan the wuxi2 repository
wuxi2_path = '/path/to/wuxi2/repository'
results = scan_wuxi2_folder(wuxi2_path)

# Access documents by code
for code, files in results.items():
    print(f"Code: {code}")
    for file_info in files:
        print(f"  File: {file_info['filename']}")
        print(f"  Path: {file_info['relative_path']}")

# Get all files for a specific code
if 'DOC-12345' in results:
    doc_files = results['DOC-12345']
    print(f"Found {len(doc_files)} files for DOC-12345")

Best Practices

Ensure the wuxi2_folder path exists before calling this function to avoid errors
The function depends on extract_document_code() and get_file_info() helper functions - ensure these are properly implemented
For large repositories, be aware that the function prints progress every 50 files, which may produce significant console output
The function uses defaultdict(list) to automatically handle new document codes without key errors
Hidden directories (starting with '.') and 'summary' folders are automatically excluded from scanning
Only PDF files (case-insensitive .pdf extension) are processed; other file types are ignored
The function modifies the dirs list in-place during os.walk to control directory traversal
Consider the memory implications when scanning very large repositories as all results are held in memory
The relative_path field allows for easy reconstruction of full paths when needed

Similar Components

AI-powered semantic similarity - components with related functionality:

function scan_wuxi2_folder_v1 91.7% similar

Recursively scans a directory for PDF files, extracts document codes from filenames, and returns a dictionary mapping each unique document code to a list of file metadata dictionaries.
From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
function find_best_folder 70.3% similar

Finds the best matching folder in a directory tree by comparing hierarchical document codes with folder names containing numeric codes.
From: /tf/active/vicechatdev/mailsearch/copy_signed_documents.py
function scan_output_folder 64.6% similar

Scans a specified output folder for PDF files containing document codes, extracts those codes, and returns a dictionary mapping each code to its associated file information.
From: /tf/active/vicechatdev/mailsearch/compare_documents.py
function compare_documents 62.8% similar

Compares documents from an output folder with documents in a wuxi2 repository by matching document codes, file hashes, sizes, and filenames to identify identical, similar, or missing documents.
From: /tf/active/vicechatdev/mailsearch/compare_documents.py
function scan_output_folder_v1 62.2% similar

Scans a specified folder for PDF documents with embedded codes in their filenames, extracting metadata and signature information for each coded document found.
From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def scan_wuxi2_folder(wuxi2_folder: str) -> Dict[str, List[Dict]]:
    """
    Recursively scan wuxi2 folder for documents with codes
    
    Args:
        wuxi2_folder: Path to wuxi2 repository
        
    Returns:
        Dictionary mapping document codes to list of matching files
    """
    print(f"\nScanning wuxi2 repository: {wuxi2_folder}")
    coded_documents = defaultdict(list)
    total_files = 0
    matched_files = 0
    
    for root, dirs, files in os.walk(wuxi2_folder):
        # Skip hidden directories and summary folder
        dirs[:] = [d for d in dirs if not d.startswith('.') and d != 'summary']
        
        for filename in files:
            # Only process PDF files
            if not filename.lower().endswith('.pdf'):
                continue
            
            total_files += 1
            filepath = os.path.join(root, filename)
            
            # Extract document code
            code = extract_document_code(filename)
            if code:
                file_info = get_file_info(filepath)
                file_info['filename'] = filename
                file_info['relative_path'] = os.path.relpath(filepath, wuxi2_folder)
                coded_documents[code].append(file_info)
                matched_files += 1
                
                if matched_files % 50 == 0:
                    print(f"  Processed {matched_files} coded documents...")
    
    print(f"\nTotal PDF files scanned: {total_files}")
    print(f"Total coded documents found: {matched_files}")
    print(f"Unique document codes: {len(coded_documents)}")
    
    return coded_documents
                        

Improved Code

🔍 Code Extractor

function scan_wuxi2_folder

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function scan_wuxi2_folder_v1 91.7% similar

function find_best_folder 70.3% similar

function scan_output_folder 64.6% similar

function compare_documents 62.8% similar

function scan_output_folder_v1 62.2% similar

function scan_wuxi2_folder

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function scan_wuxi2_folder_v1 91.7% similar

function find_best_folder 70.3% similar

function scan_output_folder 64.6% similar

function compare_documents 62.8% similar

function scan_output_folder_v1 62.2% similar

✨ Improve Code: scan_wuxi2_folder

Code Comparison