add_document_to_graph - Code Extractor

function add_document_to_graph

Maturity: 48

Creates nodes and relationships in a Neo4j graph database for a processed document, including its text and table chunks, connecting it to a folder hierarchy.

File:
/tf/active/vicechatdev/offline_docstore_multi.py

Lines:
1181 - 1231

Complexity:
moderate

Purpose

This function integrates a processed document into a Neo4j knowledge graph by creating a Document node with metadata, linking it to either a specified subfolder or root folder, and creating child nodes for text and table chunks extracted from the document. It maintains a hierarchical structure of folders and documents with their associated content chunks.

Source Code

def add_document_to_graph(session, processed_doc, deepest_folder_uid):
    """Add processed document to Neo4j graph"""
    file_path = processed_doc["file_path"]
    file_path_escaped = file_path.replace("'", "``")
    filename = processed_doc["file_name"]
    filename_escaped = filename.replace("'", "``")
    text_chunks = processed_doc.get("text_chunks", [])
    table_chunks = processed_doc.get("table_chunks", [])    
    
    # Generate UID for the document
    doc_uid = str(uuid4())
    key = evaluate_query(session,"match (x:Docstores) where not ('Template' in labels(x)) return x.Keys")
    
    # Connect document to folder
    if deepest_folder_uid:
        query = f"MATCH (f:Subfolder {{UID: '{deepest_folder_uid}'}}) " \
               f"MERGE (f)-[:PATH]->(n:Document {{UID:'{doc_uid}', " \
               f"Name:'{filename_escaped}', " \
               f"File:'{file_path_escaped}', " \
               f"Type:'{processed_doc['file_type']}', " \
               f"Keys:'{key}'}})"
        run_query(session,query)
    else:
        # Connect to root folder
        query = f"MATCH (x:Rootfolder {{Name:'T001'}}) " \
               f"MERGE (x)-[:PATH]->(n:Document {{UID:'{doc_uid}', " \
               f"Name:'{filename_escaped}', " \
               f"File:'{file_path_escaped}', " \
               f"Type:'{processed_doc['file_type']}', " \
               f"Keys:'{key}'}})"
        run_query(session,query)
    
   # Connect chunks to the document (unchanged)
    for i,text in enumerate(text_chunks):
        out=run_query(session,f"MATCH (x {{UID:'{doc_uid}'}}) "
                     f"MERGE (x)-[:CHUNK]->(n:Text_chunk {{UID:'{text[2]}',"
                     f"Name:'{filename}:Text:{str(i)}',"
                     f"Text:'{text[1]}',"
                     f"Parent:'{text[0]}',"
                     f"Keys:'{key}'}})")
    
    for i,text in enumerate(table_chunks):
        out=run_query(session,f"MATCH (x {{UID:'{doc_uid}'}}) "
                     f"MERGE (x)-[:CHUNK]->(n:Table_chunk {{UID:'{text[3]}',"
                     f"Name:'{filename}:Table:{str(i)}',"
                     f"Text:'{text[2]}',"
                     f"Html:'{text[1]}',"
                     f"Parent:'{text[0]}',"
                     f"Keys:'{key}'}})")
    
    return doc_uid

Parameters

Name	Type	Default	Kind
`session`	-	-	positional_or_keyword
`processed_doc`	-	-	positional_or_keyword
`deepest_folder_uid`	-	-	positional_or_keyword

Parameter Details

session: Neo4j database session object used to execute Cypher queries against the graph database. Should be an active session from neo4j.GraphDatabase.driver().session().

processed_doc: Dictionary containing document metadata and content. Expected keys: 'file_path' (str: full path to file), 'file_name' (str: name of file), 'file_type' (str: document type/extension), 'text_chunks' (list of tuples: [(parent, text, uid), ...]), 'table_chunks' (list of tuples: [(parent, html, text, uid), ...]). Text chunks contain extracted text content, table chunks contain both HTML and text representations.

deepest_folder_uid: String UID of the deepest subfolder in the hierarchy where this document should be attached. If None or empty, the document will be connected to the root folder 'T001'. Should be a valid UUID string matching an existing Subfolder node.

Return Value

Returns a string containing the generated UUID (doc_uid) for the newly created Document node in the Neo4j graph. This UID can be used to reference or query the document later.

Dependencies

neo4j
uuid

Required Imports

from uuid import uuid4
from neo4j import GraphDatabase

Usage Example

from neo4j import GraphDatabase
from uuid import uuid4

# Assuming evaluate_query and run_query helper functions exist
driver = GraphDatabase.driver('bolt://localhost:7687', auth=('neo4j', 'password'))
session = driver.session()

processed_doc = {
    'file_path': '/documents/report.pdf',
    'file_name': 'report.pdf',
    'file_type': 'pdf',
    'text_chunks': [
        ('section1', 'This is the first paragraph', 'uuid-text-1'),
        ('section2', 'This is the second paragraph', 'uuid-text-2')
    ],
    'table_chunks': [
        ('table1', '<table><tr><td>Data</td></tr></table>', 'Data', 'uuid-table-1')
    ]
}

folder_uid = 'existing-folder-uuid-123'
doc_uid = add_document_to_graph(session, processed_doc, folder_uid)
print(f'Document created with UID: {doc_uid}')

session.close()
driver.close()

Best Practices

Ensure the Neo4j session is properly opened before calling this function and closed after use
The function uses string interpolation for Cypher queries which is vulnerable to injection attacks; consider using parameterized queries instead
Single quotes in file paths and filenames are escaped with double backticks (``), but this may not handle all special characters safely
The function assumes helper functions 'evaluate_query' and 'run_query' exist in the scope; ensure these are imported or defined
Text content in chunks should be properly escaped before passing to this function to avoid Cypher syntax errors
The function does not validate input data structure; ensure processed_doc contains all required keys
Consider wrapping the function call in a try-except block to handle Neo4j connection errors
The hardcoded root folder name 'T001' should ideally be configurable
For large documents with many chunks, consider batching the chunk creation queries for better performance

Similar Components

AI-powered semantic similarity - components with related functionality:

function add_document_to_graph_v1 98.1% similar

Creates a Neo4j graph node for a processed document and connects it to a folder hierarchy, along with its text and table chunks.
From: /tf/active/vicechatdev/offline_docstore_multi_vice.py
function create_document_version 64.3% similar

Creates a new controlled document in a document management system with versioning, audit trails, and relationship management in a Neo4j graph database.
From: /tf/active/vicechatdev/CDocs copy/controllers/document_controller.py
function create_document_legacy 63.8% similar

Creates a new controlled document in a document management system with versioning, audit trails, and notifications. Generates document nodes in a graph database with relationships to users and versions.
From: /tf/active/vicechatdev/CDocs/controllers/document_controller.py
function create_folder_hierarchy_v2 63.6% similar

Creates a hierarchical structure of Subfolder nodes in a Neo4j graph database based on a file path, establishing parent-child relationships between folders.
From: /tf/active/vicechatdev/offline_parser_docstore.py
function create_folder_hierarchy_v1 63.2% similar

Creates a hierarchical structure of Subfolder nodes in a Neo4j graph database based on a file path, connecting each folder level with PATH relationships.
From: /tf/active/vicechatdev/offline_docstore_multi.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def add_document_to_graph(session, processed_doc, deepest_folder_uid):
    """Add processed document to Neo4j graph"""
    file_path = processed_doc["file_path"]
    file_path_escaped = file_path.replace("'", "``")
    filename = processed_doc["file_name"]
    filename_escaped = filename.replace("'", "``")
    text_chunks = processed_doc.get("text_chunks", [])
    table_chunks = processed_doc.get("table_chunks", [])    
    
    # Generate UID for the document
    doc_uid = str(uuid4())
    key = evaluate_query(session,"match (x:Docstores) where not ('Template' in labels(x)) return x.Keys")
    
    # Connect document to folder
    if deepest_folder_uid:
        query = f"MATCH (f:Subfolder {{UID: '{deepest_folder_uid}'}}) " \
               f"MERGE (f)-[:PATH]->(n:Document {{UID:'{doc_uid}', " \
               f"Name:'{filename_escaped}', " \
               f"File:'{file_path_escaped}', " \
               f"Type:'{processed_doc['file_type']}', " \
               f"Keys:'{key}'}})"
        run_query(session,query)
    else:
        # Connect to root folder
        query = f"MATCH (x:Rootfolder {{Name:'T001'}}) " \
               f"MERGE (x)-[:PATH]->(n:Document {{UID:'{doc_uid}', " \
               f"Name:'{filename_escaped}', " \
               f"File:'{file_path_escaped}', " \
               f"Type:'{processed_doc['file_type']}', " \
               f"Keys:'{key}'}})"
        run_query(session,query)
    
   # Connect chunks to the document (unchanged)
    for i,text in enumerate(text_chunks):
        out=run_query(session,f"MATCH (x {{UID:'{doc_uid}'}}) "
                     f"MERGE (x)-[:CHUNK]->(n:Text_chunk {{UID:'{text[2]}',"
                     f"Name:'{filename}:Text:{str(i)}',"
                     f"Text:'{text[1]}',"
                     f"Parent:'{text[0]}',"
                     f"Keys:'{key}'}})")
    
    for i,text in enumerate(table_chunks):
        out=run_query(session,f"MATCH (x {{UID:'{doc_uid}'}}) "
                     f"MERGE (x)-[:CHUNK]->(n:Table_chunk {{UID:'{text[3]}',"
                     f"Name:'{filename}:Table:{str(i)}',"
                     f"Text:'{text[2]}',"
                     f"Html:'{text[1]}',"
                     f"Parent:'{text[0]}',"
                     f"Keys:'{key}'}})")
    
    return doc_uid
                        

Improved Code

🔍 Code Extractor

function add_document_to_graph

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function add_document_to_graph_v1 98.1% similar

function create_document_version 64.3% similar

function create_document_legacy 63.8% similar

function create_folder_hierarchy_v2 63.6% similar

function create_folder_hierarchy_v1 63.2% similar

function add_document_to_graph

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function add_document_to_graph_v1 98.1% similar

function create_document_version 64.3% similar

function create_document_legacy 63.8% similar

function create_folder_hierarchy_v2 63.6% similar

function create_folder_hierarchy_v1 63.2% similar

✨ Improve Code: add_document_to_graph

Code Comparison