🔍 Code Extractor

function create_folder_hierarchy_v2

Maturity: 47

Creates a hierarchical structure of Subfolder nodes in a Neo4j graph database based on a file path, establishing parent-child relationships between folders.

File:
/tf/active/vicechatdev/offline_parser_docstore.py
Lines:
114 - 169
Complexity:
moderate

Purpose

This function parses a file path (expected to start with './PDF_docs/') and creates a corresponding hierarchy of Subfolder nodes in a Neo4j graph database. Each folder level is represented as a node with properties including UID, Name, Path, Level, and Keys. The function connects folders to their parent folders using PATH relationships, with the root level connecting to a Rootfolder node named 'T001'. It's designed for organizing document storage structures in a graph database, particularly for PDF document management systems.

Source Code

def create_folder_hierarchy(graph, file_path):
    """Create a hierarchy of Subfolder nodes based on the file path"""
    # Get path components from the PDF_docs folder
    if file_path.startswith("./PDF_docs/"):
        rel_path = file_path[11:]  # Remove './PDF_docs/' prefix
    else:
        rel_path = os.path.basename(file_path)  # Just use filename if no expected prefix
    
    # If file is directly in the PDF_docs root
    if "/" not in rel_path:
        return None
    
    # Split into folder components
    folders = rel_path.split("/")
    folders.pop()  # Remove the filename itself
    
    if not folders:  # No subfolders
        return None
    
    current_path = "./PDF_docs"
    parent_uid = None
    key=graph.run("match (x:Docstores)  where not ('Template' in labels(x)) return x.Keys").evaluate()
    
    # Create folder hierarchy
    for i, folder in enumerate(folders):
        current_path = os.path.join(current_path, folder)
        folder_escaped = folder.replace("'", "`")
        current_path_escaped = current_path.replace("'", "``")
        
        # Check if this folder node already exists
        result = graph.run(f"MATCH (f:Subfolder {{Path: '{current_path_escaped}'}})"
                          f" RETURN f.UID as uid").data()
        
        if not result:
            # Create new folder node
            folder_uid = str(uuid4())
            if i == 0:
                # Connect to the References node since it's the first level
                graph.run(f"MATCH (x:Rootfolder {{Name:'T001'}}) "
                         f" MERGE (x)-[:PATH]->(:Subfolder {{UID: '{folder_uid}', "
                         f"Name: '{folder_escaped}', Path: '{current_path_escaped}', "
                         f"Level: '{i+1}',"
                         f"Keys:'{key}'}})")
            else:
                # Connect to parent folder
                graph.run(f"MATCH (p:Subfolder {{UID: '{parent_uid}'}})"
                         f" MERGE (p)-[:PATH]->(:Subfolder {{UID: '{folder_uid}', "
                         f"Name: '{folder_escaped}', Path: '{current_path_escaped}', "
                         f"Level: '{i+1}',"
                         f"Keys:'{key}'}})")
            parent_uid = folder_uid
        else:
            parent_uid = result[0]['uid']
    
    # Return the UID of the deepest subfolder
    return parent_uid

Parameters

Name Type Default Kind
graph - - positional_or_keyword
file_path - - positional_or_keyword

Parameter Details

graph: A Neo4j graph database connection object (likely from py2neo or similar library) that provides a 'run' method to execute Cypher queries. This object is used to query and create nodes and relationships in the database.

file_path: A string representing the file path to process. Expected format is './PDF_docs/subfolder1/subfolder2/.../filename.ext'. The function extracts folder hierarchy from this path. If the path doesn't start with './PDF_docs/', only the basename is used.

Return Value

Returns a string containing the UID (Unique Identifier) of the deepest subfolder node created or found in the hierarchy. Returns None if the file is directly in the PDF_docs root directory (no subfolders) or if there are no folders to process. The returned UID can be used to link documents to their containing folder.

Dependencies

  • neo4j
  • py2neo
  • uuid

Required Imports

from uuid import uuid4
import os

Usage Example

from uuid import uuid4
import os
from py2neo import Graph

# Establish Neo4j connection
graph = Graph('bolt://localhost:7687', auth=('neo4j', 'password'))

# Ensure required nodes exist
graph.run("MERGE (r:Rootfolder {Name:'T001'})")
graph.run("MERGE (d:Docstores {Keys:'default_key'})")

# Create folder hierarchy for a file
file_path = './PDF_docs/research/papers/2024/document.pdf'
deepest_folder_uid = create_folder_hierarchy(graph, file_path)

if deepest_folder_uid:
    print(f'Deepest folder UID: {deepest_folder_uid}')
    # Use the UID to link a document node
    graph.run(f"MATCH (f:Subfolder {{UID: '{deepest_folder_uid}'}}) "
              f"MERGE (f)-[:CONTAINS]->(:Document {{Name: 'document.pdf'}})")
else:
    print('File is in root directory, no subfolders created')

Best Practices

  • Ensure the Neo4j database has a Rootfolder node with Name='T001' before calling this function
  • Ensure at least one Docstores node exists in the database without a 'Template' label
  • Be aware that single quotes in folder names are escaped to backticks, which may cause issues with folder names containing backticks
  • The function uses string interpolation in Cypher queries which could be vulnerable to injection attacks; consider using parameterized queries instead
  • The function assumes './PDF_docs/' as the root path; modify the prefix removal logic if using a different root directory
  • Consider adding error handling for database connection failures or query execution errors
  • The function creates nodes with MERGE operations, which prevents duplicates but may have performance implications for large hierarchies
  • The 'Keys' property is retrieved once and applied to all folders; ensure this is the intended behavior for your use case

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function create_folder_hierarchy_v1 93.9% similar

    Creates a hierarchical structure of Subfolder nodes in a Neo4j graph database based on a file path, connecting each folder level with PATH relationships.

    From: /tf/active/vicechatdev/offline_docstore_multi.py
  • function create_folder_hierarchy 93.4% similar

    Creates a hierarchical structure of Subfolder nodes in a Neo4j graph database based on a file system path, connecting each folder level with PATH relationships.

    From: /tf/active/vicechatdev/offline_docstore_multi_vice.py
  • function create_folder 67.5% similar

    Creates a nested folder structure on a FileCloud server by traversing a path and creating missing directories.

    From: /tf/active/vicechatdev/filecloud_wuxi_sync.py
  • function add_document_to_graph 63.6% similar

    Creates nodes and relationships in a Neo4j graph database for a processed document, including its text and table chunks, connecting it to a folder hierarchy.

    From: /tf/active/vicechatdev/offline_docstore_multi.py
  • function add_document_to_graph_v1 62.4% similar

    Creates a Neo4j graph node for a processed document and connects it to a folder hierarchy, along with its text and table chunks.

    From: /tf/active/vicechatdev/offline_docstore_multi_vice.py
← Back to Browse