function scan_output_folder
Scans a specified output folder for PDF files containing document codes, extracts those codes, and returns a dictionary mapping each code to its associated file information.
/tf/active/vicechatdev/mailsearch/compare_documents.py
87 - 116
simple
Purpose
This function is designed to inventory and catalog PDF documents in an output directory that follow a naming convention with embedded document codes. It filters out non-PDF files and directories, extracts document codes from filenames using a helper function, retrieves file metadata, and builds a comprehensive mapping of codes to file details. This is useful for document management systems, batch processing workflows, or audit trails where documents need to be tracked by unique identifiers.
Source Code
def scan_output_folder(output_folder: str) -> Dict[str, Dict]:
"""
Scan output folder for documents with codes
Args:
output_folder: Path to output folder
Returns:
Dictionary mapping document codes to file info
"""
print(f"\nScanning output folder: {output_folder}")
coded_documents = {}
for filename in os.listdir(output_folder):
filepath = os.path.join(output_folder, filename)
# Skip directories and non-PDF files
if not os.path.isfile(filepath) or not filename.lower().endswith('.pdf'):
continue
# Extract document code
code = extract_document_code(filename)
if code:
file_info = get_file_info(filepath)
file_info['filename'] = filename
coded_documents[code] = file_info
print(f" Found: {code} - {filename}")
print(f"\nTotal coded documents in output: {len(coded_documents)}")
return coded_documents
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
output_folder |
str | - | positional_or_keyword |
Parameter Details
output_folder: String path to the directory containing PDF files to scan. Can be absolute or relative path. The folder should exist and be readable. The function will iterate through all files in this directory (non-recursive, only top-level files).
Return Value
Type: Dict[str, Dict]
Returns a dictionary where keys are document codes (strings) extracted from filenames, and values are dictionaries containing file information. Each file info dictionary includes metadata from the get_file_info() helper function plus a 'filename' key with the original filename. If no coded documents are found, returns an empty dictionary. Only PDF files with successfully extracted codes are included in the result.
Dependencies
osrehashlibpathlibtypingcsvdatetimecollectionsjson
Required Imports
import os
from typing import Dict
Usage Example
# Assuming helper functions extract_document_code() and get_file_info() are defined
import os
from typing import Dict
# Example usage
output_folder = './processed_documents'
# Scan the folder for coded PDF documents
coded_docs = scan_output_folder(output_folder)
# Access results
for code, info in coded_docs.items():
print(f"Code: {code}")
print(f"Filename: {info['filename']}")
print(f"File info: {info}")
print()
# Check if specific code exists
if 'DOC-12345' in coded_docs:
print(f"Found document: {coded_docs['DOC-12345']['filename']}")
Best Practices
- Ensure the output_folder path exists before calling this function to avoid FileNotFoundError
- The function only scans the top-level directory (non-recursive); subdirectories are skipped
- Only files with .pdf extension (case-insensitive) are processed
- The function prints progress information to stdout; consider redirecting or capturing output in production environments
- If multiple files have the same document code, only the last one processed will be retained in the dictionary
- Ensure extract_document_code() and get_file_info() helper functions are properly implemented and handle edge cases
- Consider adding error handling for permission issues or corrupted files if used in production
- The function assumes document codes are unique; duplicate codes will overwrite previous entries
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function scan_output_folder_v1 81.9% similar
-
function scan_wuxi2_folder_v1 66.3% similar
-
function scan_wuxi2_folder 64.6% similar
-
function compare_documents 52.6% similar
-
function convert_to_pdf 50.2% similar