🔍 Code Extractor

function find_best_folder

Maturity: 45

Finds the best matching folder in a directory tree by comparing hierarchical document codes with folder names containing numeric codes.

File:
/tf/active/vicechatdev/mailsearch/copy_signed_documents.py
Lines:
37 - 83
Complexity:
moderate

Purpose

This function traverses a directory structure (wuxi2_root) to find the most appropriate folder for placing a document based on its hierarchical code (e.g., '1.2.3'). It matches document codes against numeric codes found in folder names, prioritizing folders with the highest matching prefix and longest code length that is still shorter than the document code. This is useful for organizing documents in a hierarchical filing system where folders represent higher-level categories and documents belong in the most specific matching folder.

Source Code

def find_best_folder(doc_code, wuxi2_root=WUXI2_ROOT):
    """Find the best matching folder in wuxi2 based on code structure"""
    code_parts = extract_code_parts(doc_code)
    doc_code_length = len(code_parts)
    
    best_match = None
    best_score = 0
    best_folder_length = 0
    
    for root, dirs, files in os.walk(wuxi2_root):
        rel_path = os.path.relpath(root, wuxi2_root)
        
        if rel_path == '.':
            continue
        
        path_parts = rel_path.split(os.sep)
        
        # Look for folders with document codes in their names
        for folder in path_parts:
            folder_codes = re.findall(r'\d+(?:\.\d+)*', folder)
            
            for folder_code in folder_codes:
                folder_parts = folder_code.split('.')
                folder_code_length = len(folder_parts)
                
                # Skip folders with coding same length or longer than the document
                # When codes are equal length, document goes next to folder (in parent), not inside
                if folder_code_length >= doc_code_length:
                    continue
                
                # Calculate match score (how many leading parts match)
                score = 0
                for cp, fp in zip(code_parts, folder_parts):
                    if cp == fp:
                        score += 1
                    else:
                        break
                
                # Update best match if this is better
                # Priority: 1) Higher score, 2) Longer folder code (closer to doc length)
                if (score > best_score or 
                    (score == best_score and folder_code_length > best_folder_length)):
                    best_score = score
                    best_folder_length = folder_code_length
                    best_match = root
    
    return best_match, best_score

Parameters

Name Type Default Kind
doc_code - - positional_or_keyword
wuxi2_root - WUXI2_ROOT positional_or_keyword

Parameter Details

doc_code: A string representing the document's hierarchical code (e.g., '1.2.3.4'). The code should contain numeric parts separated by dots, representing a hierarchical classification system. This code is parsed to find the best matching folder in the directory tree.

wuxi2_root: A string path to the root directory where the search should begin. Defaults to WUXI2_ROOT constant. This directory is traversed recursively to find folders with matching codes. The path should be absolute or relative to the current working directory.

Return Value

Returns a tuple of (best_match, best_score). 'best_match' is a string containing the full path to the best matching folder, or None if no suitable match is found. 'best_score' is an integer representing the number of matching code parts between the document code and the folder code (e.g., if doc_code is '1.2.3.4' and folder has '1.2', score would be 2). Higher scores indicate better matches.

Dependencies

  • os
  • re

Required Imports

import os
import re

Usage Example

import os
import re

# Define required constant and dependency function
WUXI2_ROOT = '/path/to/wuxi2'

def extract_code_parts(code):
    """Helper function to extract code parts"""
    return code.split('.')

# Use the function
doc_code = '1.2.3.4'
best_folder, score = find_best_folder(doc_code)

if best_folder:
    print(f"Best folder: {best_folder}")
    print(f"Match score: {score}")
else:
    print("No matching folder found")

# With custom root
custom_root = '/custom/path'
best_folder, score = find_best_folder('2.1.5', wuxi2_root=custom_root)

Best Practices

  • Ensure the extract_code_parts function is defined before calling find_best_folder, as it's a required dependency
  • The WUXI2_ROOT constant should be defined at module level if using the default parameter
  • Document codes should follow a consistent hierarchical format (e.g., '1.2.3') for proper matching
  • The function skips folders with codes equal to or longer than the document code, as documents should be placed alongside (not inside) folders of equal hierarchy
  • For large directory trees, this function may be slow as it performs a full recursive walk; consider caching results if calling repeatedly
  • Folder names can contain multiple numeric codes; the function will check all of them for matches
  • The function returns None as best_match if no suitable folder is found; always check for None before using the result

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function scan_wuxi2_folder 70.3% similar

    Recursively scans a wuxi2 folder for PDF documents, extracts document codes from filenames, and organizes them into a dictionary mapping codes to file information.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function scan_wuxi2_folder_v1 67.1% similar

    Recursively scans a directory for PDF files, extracts document codes from filenames, and returns a dictionary mapping each unique document code to a list of file metadata dictionaries.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function find_best_match 60.9% similar

    Finds the best matching document from a list of candidates by comparing hash, size, filename, and content similarity with configurable confidence thresholds.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function main_v57 59.6% similar

    Main execution function that orchestrates a document comparison workflow between two directories (mailsearch/output and wuxi2 repository), scanning for coded documents, comparing them, and generating results.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function main_v102 58.2% similar

    Main entry point function that orchestrates a document comparison workflow between two folders (mailsearch/output and wuxi2 repository), detecting signatures and generating comparison results.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
← Back to Browse