🔍 Code Extractor

function scan_output_folder

Maturity: 56

Scans a specified output folder for PDF files containing document codes, extracts those codes, and returns a dictionary mapping each code to its associated file information.

File:
/tf/active/vicechatdev/mailsearch/compare_documents.py
Lines:
87 - 116
Complexity:
simple

Purpose

This function is designed to inventory and catalog PDF documents in an output directory that follow a naming convention with embedded document codes. It filters out non-PDF files and directories, extracts document codes from filenames using a helper function, retrieves file metadata, and builds a comprehensive mapping of codes to file details. This is useful for document management systems, batch processing workflows, or audit trails where documents need to be tracked by unique identifiers.

Source Code

def scan_output_folder(output_folder: str) -> Dict[str, Dict]:
    """
    Scan output folder for documents with codes
    
    Args:
        output_folder: Path to output folder
        
    Returns:
        Dictionary mapping document codes to file info
    """
    print(f"\nScanning output folder: {output_folder}")
    coded_documents = {}
    
    for filename in os.listdir(output_folder):
        filepath = os.path.join(output_folder, filename)
        
        # Skip directories and non-PDF files
        if not os.path.isfile(filepath) or not filename.lower().endswith('.pdf'):
            continue
        
        # Extract document code
        code = extract_document_code(filename)
        if code:
            file_info = get_file_info(filepath)
            file_info['filename'] = filename
            coded_documents[code] = file_info
            print(f"  Found: {code} - {filename}")
    
    print(f"\nTotal coded documents in output: {len(coded_documents)}")
    return coded_documents

Parameters

Name Type Default Kind
output_folder str - positional_or_keyword

Parameter Details

output_folder: String path to the directory containing PDF files to scan. Can be absolute or relative path. The folder should exist and be readable. The function will iterate through all files in this directory (non-recursive, only top-level files).

Return Value

Type: Dict[str, Dict]

Returns a dictionary where keys are document codes (strings) extracted from filenames, and values are dictionaries containing file information. Each file info dictionary includes metadata from the get_file_info() helper function plus a 'filename' key with the original filename. If no coded documents are found, returns an empty dictionary. Only PDF files with successfully extracted codes are included in the result.

Dependencies

  • os
  • re
  • hashlib
  • pathlib
  • typing
  • csv
  • datetime
  • collections
  • json

Required Imports

import os
from typing import Dict

Usage Example

# Assuming helper functions extract_document_code() and get_file_info() are defined
import os
from typing import Dict

# Example usage
output_folder = './processed_documents'

# Scan the folder for coded PDF documents
coded_docs = scan_output_folder(output_folder)

# Access results
for code, info in coded_docs.items():
    print(f"Code: {code}")
    print(f"Filename: {info['filename']}")
    print(f"File info: {info}")
    print()

# Check if specific code exists
if 'DOC-12345' in coded_docs:
    print(f"Found document: {coded_docs['DOC-12345']['filename']}")

Best Practices

  • Ensure the output_folder path exists before calling this function to avoid FileNotFoundError
  • The function only scans the top-level directory (non-recursive); subdirectories are skipped
  • Only files with .pdf extension (case-insensitive) are processed
  • The function prints progress information to stdout; consider redirecting or capturing output in production environments
  • If multiple files have the same document code, only the last one processed will be retained in the dictionary
  • Ensure extract_document_code() and get_file_info() helper functions are properly implemented and handle edge cases
  • Consider adding error handling for permission issues or corrupted files if used in production
  • The function assumes document codes are unique; duplicate codes will overwrite previous entries

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function scan_output_folder_v1 81.9% similar

    Scans a specified folder for PDF documents with embedded codes in their filenames, extracting metadata and signature information for each coded document found.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function scan_wuxi2_folder_v1 66.3% similar

    Recursively scans a directory for PDF files, extracts document codes from filenames, and returns a dictionary mapping each unique document code to a list of file metadata dictionaries.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function scan_wuxi2_folder 64.6% similar

    Recursively scans a wuxi2 folder for PDF documents, extracts document codes from filenames, and organizes them into a dictionary mapping codes to file information.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function compare_documents 52.6% similar

    Compares documents from an output folder with documents in a wuxi2 repository by matching document codes, file hashes, sizes, and filenames to identify identical, similar, or missing documents.

    From: /tf/active/vicechatdev/mailsearch/compare_documents.py
  • function convert_to_pdf 50.2% similar

    Converts a document file to PDF format, automatically generating an output path if not specified.

    From: /tf/active/vicechatdev/CDocs/utils/pdf_utils.py
← Back to Browse