build_document_tree_recursive

function build_document_tree_recursive

Maturity: 51

Recursively builds a complete hierarchical tree structure of documents and folders from a target directory path, filtering for supported file types and skipping hidden/cache directories.

File:
/tf/active/vicechatdev/docchat/app.py

Lines:
404 - 476

Complexity:
moderate

Purpose

This function is designed to create a full document tree representation for search functionality in a document management system. It traverses the entire directory structure starting from a target path, collecting metadata about files (size, indexing status, document ID, chunk count) and organizing them into a nested tree structure with folders and files. The function is optimized for building complete trees rather than lazy-loading, making it suitable for search indexing and full directory visualization.

Source Code

def build_document_tree_recursive(target_path, relative_base=""):
    """Build FULL document tree recursively (for search functionality)"""
    import time
    start_time = time.time()
    items = []
    
    logger.info(f"[BUILD_TREE_RECURSIVE] Scanning: {target_path}")
    
    try:
        entries = sorted(target_path.iterdir(), key=lambda x: (not x.is_dir(), x.name.lower()))
    except PermissionError:
        logger.warning(f"[BUILD_TREE_RECURSIVE] Permission denied: {target_path}")
        return {'children': []}
    
    for entry in entries:
        try:
            # Calculate relative path from document root
            if relative_base:
                relative_path = f"{relative_base}/{entry.name}"
            else:
                relative_path = entry.name
            
            if entry.is_dir():
                # Skip hidden and cache directories
                if entry.name.startswith('.') or entry.name == '__pycache__':
                    continue
                
                # Recursively load all children
                children_result = build_document_tree_recursive(entry, relative_path)
                
                if children_result['children']:  # Only include folders with content
                    items.append({
                        'type': 'folder',
                        'name': entry.name,
                        'path': relative_path,
                        'children': children_result['children'],
                        'hasChildren': True,
                        'loaded': True  # Already loaded recursively
                    })
            
            elif entry.is_file():
                # Check if file extension is supported
                if entry.suffix.lower() in ['.pdf', '.txt', '.md', '.doc', '.docx', '.pptx', '.ppt', '.xlsx', '.xls', '.html']:
                    file_info = get_document_info(entry)
                    items.append({
                        'type': 'file',
                        'name': entry.name,
                        'path': relative_path,
                        'size': entry.stat().st_size,
                        'indexed': file_info['indexed'],
                        'doc_id': file_info.get('doc_id'),
                        'chunk_count': file_info.get('chunk_count', 0)
                    })
                    
        except Exception as e:
            logger.warning(f"[BUILD_TREE_RECURSIVE] Error processing {entry}: {e}")
            continue
    
    elapsed = time.time() - start_time
    logger.info(f"[BUILD_TREE_RECURSIVE] Completed in {elapsed:.2f}s, {len(items)} items")
    
    if relative_base:
        # Return children only for subfolder expansion
        return {'children': items}
    else:
        # Return root structure
        return {
            'type': 'folder',
            'name': target_path.name,
            'path': '',
            'children': items,
            'loaded': True
        }

Parameters

Name	Type	Default	Kind
`target_path`	-	-	positional_or_keyword
`relative_base`	-	''	positional_or_keyword

Parameter Details

target_path: A Path object (from pathlib) representing the root directory to scan. This should be an absolute or relative path to the folder containing documents to be indexed. The function will recursively traverse all subdirectories from this starting point.

relative_base: A string representing the relative path prefix from the document root. Defaults to empty string for the root level. Used internally during recursion to build proper relative paths for nested items. Format: 'parent/child' without leading or trailing slashes.

Return Value

Returns a dictionary representing the document tree structure. At the root level (when relative_base is empty), returns: {'type': 'folder', 'name': <folder_name>, 'path': '', 'children': [...], 'loaded': True}. For recursive calls (when relative_base is not empty), returns: {'children': [...]}. Each child item is a dictionary with keys: 'type' ('folder' or 'file'), 'name' (filename/foldername), 'path' (relative path from root), and additional keys depending on type. Folders include: 'children' (list), 'hasChildren' (bool), 'loaded' (bool). Files include: 'size' (bytes), 'indexed' (bool), 'doc_id' (string or None), 'chunk_count' (int).

Dependencies

pathlib
time
logging

Required Imports

from pathlib import Path
import time
import logging

Conditional/Optional Imports

These imports are only needed under specific conditions:

import time

Condition: imported inside the function for performance timing

Required (conditional)

Usage Example

from pathlib import Path
import logging

# Setup logger
logger = logging.getLogger(__name__)

# Define get_document_info function (required dependency)
def get_document_info(file_path):
    # Mock implementation - replace with actual logic
    return {
        'indexed': True,
        'doc_id': 'doc_123',
        'chunk_count': 5
    }

# Build document tree from a directory
document_root = Path('/path/to/documents')
tree = build_document_tree_recursive(document_root)

# Access the tree structure
print(f"Root folder: {tree['name']}")
print(f"Number of items: {len(tree['children'])}")

# Iterate through children
for item in tree['children']:
    if item['type'] == 'folder':
        print(f"Folder: {item['name']} with {len(item['children'])} items")
    elif item['type'] == 'file':
        print(f"File: {item['name']}, Size: {item['size']} bytes, Indexed: {item['indexed']}")

Best Practices

Ensure the 'logger' object is properly configured before calling this function to capture debug information
The function requires a 'get_document_info' helper function to be defined in the same module scope
Handle PermissionError gracefully - the function logs warnings but continues processing other directories
Be aware of performance implications when scanning large directory structures - the function logs timing information
The function filters out hidden directories (starting with '.') and '__pycache__' directories automatically
Only folders with content are included in the tree - empty folders are skipped
Supported file types are hardcoded - modify the extension list in the source if you need to support additional formats
The 'relative_base' parameter is primarily for internal recursion - external callers should typically use the default empty string
The function sorts entries with directories first, then alphabetically by name (case-insensitive)
Consider implementing caching or memoization for frequently accessed directory trees to improve performance

Similar Components

AI-powered semantic similarity - components with related functionality:

function build_document_tree_lazy 76.5% similar

Builds a single-level document tree structure for lazy loading, scanning only immediate children of a target directory without recursively loading subdirectories.
From: /tf/active/vicechatdev/docchat/app.py
function api_document_tree 66.3% similar

Flask API endpoint that returns a hierarchical document tree structure from a configured document folder, supporting lazy loading and full expansion modes for efficient navigation and search.
From: /tf/active/vicechatdev/docchat/app.py
function api_folders 60.7% similar

Flask API endpoint that returns a hierarchical JSON tree structure of all folders (excluding files) within the configured document folder, used for folder selection in upload interfaces.
From: /tf/active/vicechatdev/docchat/app.py
function create_folder_hierarchy_v2 60.5% similar

Creates a hierarchical structure of Subfolder nodes in a Neo4j graph database based on a file path, establishing parent-child relationships between folders.
From: /tf/active/vicechatdev/offline_parser_docstore.py
function create_folder_hierarchy_v1 59.8% similar

Creates a hierarchical structure of Subfolder nodes in a Neo4j graph database based on a file path, connecting each folder level with PATH relationships.
From: /tf/active/vicechatdev/offline_docstore_multi.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def build_document_tree_recursive(target_path, relative_base=""):
    """Build FULL document tree recursively (for search functionality)"""
    import time
    start_time = time.time()
    items = []
    
    logger.info(f"[BUILD_TREE_RECURSIVE] Scanning: {target_path}")
    
    try:
        entries = sorted(target_path.iterdir(), key=lambda x: (not x.is_dir(), x.name.lower()))
    except PermissionError:
        logger.warning(f"[BUILD_TREE_RECURSIVE] Permission denied: {target_path}")
        return {'children': []}
    
    for entry in entries:
        try:
            # Calculate relative path from document root
            if relative_base:
                relative_path = f"{relative_base}/{entry.name}"
            else:
                relative_path = entry.name
            
            if entry.is_dir():
                # Skip hidden and cache directories
                if entry.name.startswith('.') or entry.name == '__pycache__':
                    continue
                
                # Recursively load all children
                children_result = build_document_tree_recursive(entry, relative_path)
                
                if children_result['children']:  # Only include folders with content
                    items.append({
                        'type': 'folder',
                        'name': entry.name,
                        'path': relative_path,
                        'children': children_result['children'],
                        'hasChildren': True,
                        'loaded': True  # Already loaded recursively
                    })
            
            elif entry.is_file():
                # Check if file extension is supported
                if entry.suffix.lower() in ['.pdf', '.txt', '.md', '.doc', '.docx', '.pptx', '.ppt', '.xlsx', '.xls', '.html']:
                    file_info = get_document_info(entry)
                    items.append({
                        'type': 'file',
                        'name': entry.name,
                        'path': relative_path,
                        'size': entry.stat().st_size,
                        'indexed': file_info['indexed'],
                        'doc_id': file_info.get('doc_id'),
                        'chunk_count': file_info.get('chunk_count', 0)
                    })
                    
        except Exception as e:
            logger.warning(f"[BUILD_TREE_RECURSIVE] Error processing {entry}: {e}")
            continue
    
    elapsed = time.time() - start_time
    logger.info(f"[BUILD_TREE_RECURSIVE] Completed in {elapsed:.2f}s, {len(items)} items")
    
    if relative_base:
        # Return children only for subfolder expansion
        return {'children': items}
    else:
        # Return root structure
        return {
            'type': 'folder',
            'name': target_path.name,
            'path': '',
            'children': items,
            'loaded': True
        }
                        

Improved Code

🔍 Code Extractor

function build_document_tree_recursive

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function build_document_tree_lazy 76.5% similar

function api_document_tree 66.3% similar

function api_folders 60.7% similar

function create_folder_hierarchy_v2 60.5% similar

function create_folder_hierarchy_v1 59.8% similar

function build_document_tree_recursive

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function build_document_tree_lazy 76.5% similar

function api_document_tree 66.3% similar

function api_folders 60.7% similar

function create_folder_hierarchy_v2 60.5% similar

function create_folder_hierarchy_v1 59.8% similar

✨ Improve Code: build_document_tree_recursive

Code Comparison