🔍 Code Extractor

function add_document_to_graph_v1

Maturity: 46

Creates a Neo4j graph node for a processed document and connects it to a folder hierarchy, along with its text and table chunks.

File:
/tf/active/vicechatdev/offline_docstore_multi_vice.py
Lines:
1220 - 1275
Complexity:
moderate

Purpose

This function integrates a processed document into a Neo4j knowledge graph by creating a Document node with metadata, linking it to either a subfolder or root folder, and creating child nodes for text and table chunks extracted from the document. It's designed for document management systems that use graph databases to represent hierarchical document structures with searchable content chunks.

Source Code

def add_document_to_graph(session, processed_doc, deepest_folder_uid,rootfolder_uid):
    """Add processed document to Neo4j graph"""
    file_path = processed_doc["file_path"]
    file_path_escaped = file_path.replace("'", "``")
    filename = processed_doc["file_name"]
    filename_escaped = filename.replace("'", "``")
    text_chunks = processed_doc.get("text_chunks", [])
    table_chunks = processed_doc.get("table_chunks", [])    
    
    # Generate UID for the document
    doc_uid = str(uuid4())
    key = evaluate_query(session,"match (x:Docstores) where not ('Template' in labels(x)) return x.Keys")
    
    # Connect document to folder
    if deepest_folder_uid:
        query = f"MATCH (f:Subfolder {{UID: '{deepest_folder_uid}'}}) " \
               f"MERGE (f)-[:PATH]->(n:Document {{UID:'{doc_uid}', " \
               f"Name:'{filename_escaped}', " \
               f"File:'{file_path_escaped}', " \
               f"Type:'{processed_doc['file_type']}', " \
               f"Keys:'{key}'}})"
        run_query(session,query)
    else:
        # Connect to root folder
        query = f"MATCH (x:Rootfolder {{UID:'{rootfolder_uid}'}}) " \
               f"MERGE (x)-[:PATH]->(n:Document {{UID:'{doc_uid}', " \
               f"Name:'{filename_escaped}', " \
               f"File:'{file_path_escaped}', " \
               f"Type:'{processed_doc['file_type']}', " \
               f"Keys:'{key}'}})"
        run_query(session,query)
    
   # Connect chunks to the document (unchanged)
    for i,text in enumerate(text_chunks):
        text1_escaped=text[1].replace("'", "``")
        text0_escaped=text[0].replace("'", "``")
        out=run_query(session,f"MATCH (x {{UID:'{doc_uid}'}}) "
                     f"MERGE (x)-[:CHUNK]->(n:Text_chunk {{UID:'{text[2]}',"
                     f"Name:'{filename_escaped}:Text:{str(i)}',"
                     f"Text:'{text1_escaped}',"
                     f"Parent:'{text0_escaped}',"
                     f"Keys:'{key}'}})")
    
    for i,text in enumerate(table_chunks):
        text1_escaped=text[1].replace("'", "``")
        text2_escaped=text[2].replace("'", "``")
        text0_escaped=text[0].replace("'", "``")
        out=run_query(session,f"MATCH (x {{UID:'{doc_uid}'}}) "
                     f"MERGE (x)-[:CHUNK]->(n:Table_chunk {{UID:'{text[3]}',"
                     f"Name:'{filename_escaped}:Table:{str(i)}',"
                     f"Text:'{text2_escaped}',"
                     f"Html:'{text1_escaped}',"
                     f"Parent:'{text0_escaped}',"
                     f"Keys:'{key}'}})")
    
    return doc_uid

Parameters

Name Type Default Kind
session - - positional_or_keyword
processed_doc - - positional_or_keyword
deepest_folder_uid - - positional_or_keyword
rootfolder_uid - - positional_or_keyword

Parameter Details

session: Neo4j database session object used to execute Cypher queries against the graph database. Must be an active session from neo4j.GraphDatabase.driver().session()

processed_doc: Dictionary containing document metadata and content. Expected keys: 'file_path' (str: full path to file), 'file_name' (str: name of file), 'file_type' (str: document type/extension), 'text_chunks' (list of tuples: [(parent, text, uid), ...]), 'table_chunks' (list of tuples: [(parent, html, text, uid), ...])

deepest_folder_uid: String UID of the deepest subfolder in the hierarchy where this document should be attached. If None or empty, the document will be attached to the root folder instead

rootfolder_uid: String UID of the root folder node in the graph. Used as fallback when deepest_folder_uid is not provided

Return Value

Returns a string containing the generated UUID (doc_uid) for the newly created Document node in the Neo4j graph. This UID can be used to reference or query this document in subsequent operations.

Dependencies

  • neo4j
  • uuid

Required Imports

from uuid import uuid4
from neo4j import GraphDatabase

Usage Example

from neo4j import GraphDatabase
from uuid import uuid4

# Establish Neo4j connection
driver = GraphDatabase.driver('bolt://localhost:7687', auth=('neo4j', 'password'))
session = driver.session()

# Prepare processed document data
processed_doc = {
    'file_path': '/documents/report.pdf',
    'file_name': 'report.pdf',
    'file_type': 'pdf',
    'text_chunks': [
        ('Section 1', 'This is the first paragraph', 'chunk-uid-1'),
        ('Section 2', 'This is the second paragraph', 'chunk-uid-2')
    ],
    'table_chunks': [
        ('Table 1', '<table><tr><td>Data</td></tr></table>', 'Data', 'table-uid-1')
    ]
}

# Add document to graph
doc_uid = add_document_to_graph(
    session=session,
    processed_doc=processed_doc,
    deepest_folder_uid='subfolder-123',
    rootfolder_uid='root-456'
)

print(f'Document added with UID: {doc_uid}')
session.close()
driver.close()

Best Practices

  • Ensure the Neo4j session is properly opened and closed using context managers or explicit close() calls
  • Validate that processed_doc contains all required keys before calling this function to avoid KeyError exceptions
  • The function uses string escaping (replacing ' with ``) which may not be sufficient for all special characters - consider using parameterized queries instead of string formatting to prevent Cypher injection
  • Ensure that helper functions 'evaluate_query' and 'run_query' are properly implemented and handle errors appropriately
  • Consider wrapping the function call in a try-except block to handle Neo4j connection errors and query failures
  • The function assumes text_chunks and table_chunks have specific tuple structures - validate data structure before passing
  • For large documents with many chunks, consider batching chunk creation queries for better performance
  • The current implementation is vulnerable to Cypher injection - refactor to use parameterized queries for production use

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function add_document_to_graph 98.1% similar

    Creates nodes and relationships in a Neo4j graph database for a processed document, including its text and table chunks, connecting it to a folder hierarchy.

    From: /tf/active/vicechatdev/offline_docstore_multi.py
  • function create_document_legacy 63.6% similar

    Creates a new controlled document in a document management system with versioning, audit trails, and notifications. Generates document nodes in a graph database with relationships to users and versions.

    From: /tf/active/vicechatdev/CDocs/controllers/document_controller.py
  • function create_folder_hierarchy_v2 62.4% similar

    Creates a hierarchical structure of Subfolder nodes in a Neo4j graph database based on a file path, establishing parent-child relationships between folders.

    From: /tf/active/vicechatdev/offline_parser_docstore.py
  • function create_folder_hierarchy_v1 62.4% similar

    Creates a hierarchical structure of Subfolder nodes in a Neo4j graph database based on a file path, connecting each folder level with PATH relationships.

    From: /tf/active/vicechatdev/offline_docstore_multi.py
  • function create_folder_hierarchy 61.5% similar

    Creates a hierarchical structure of Subfolder nodes in a Neo4j graph database based on a file system path, connecting each folder level with PATH relationships.

    From: /tf/active/vicechatdev/offline_docstore_multi_vice.py
← Back to Browse