🔍 Code Extractor

function add_document_to_graph

Maturity: 48

Creates nodes and relationships in a Neo4j graph database for a processed document, including its text and table chunks, connecting it to a folder hierarchy.

File:
/tf/active/vicechatdev/offline_docstore_multi.py
Lines:
1181 - 1231
Complexity:
moderate

Purpose

This function integrates a processed document into a Neo4j knowledge graph by creating a Document node with metadata, linking it to either a specified subfolder or root folder, and creating child nodes for text and table chunks extracted from the document. It maintains a hierarchical structure of folders and documents with their associated content chunks.

Source Code

def add_document_to_graph(session, processed_doc, deepest_folder_uid):
    """Add processed document to Neo4j graph"""
    file_path = processed_doc["file_path"]
    file_path_escaped = file_path.replace("'", "``")
    filename = processed_doc["file_name"]
    filename_escaped = filename.replace("'", "``")
    text_chunks = processed_doc.get("text_chunks", [])
    table_chunks = processed_doc.get("table_chunks", [])    
    
    # Generate UID for the document
    doc_uid = str(uuid4())
    key = evaluate_query(session,"match (x:Docstores) where not ('Template' in labels(x)) return x.Keys")
    
    # Connect document to folder
    if deepest_folder_uid:
        query = f"MATCH (f:Subfolder {{UID: '{deepest_folder_uid}'}}) " \
               f"MERGE (f)-[:PATH]->(n:Document {{UID:'{doc_uid}', " \
               f"Name:'{filename_escaped}', " \
               f"File:'{file_path_escaped}', " \
               f"Type:'{processed_doc['file_type']}', " \
               f"Keys:'{key}'}})"
        run_query(session,query)
    else:
        # Connect to root folder
        query = f"MATCH (x:Rootfolder {{Name:'T001'}}) " \
               f"MERGE (x)-[:PATH]->(n:Document {{UID:'{doc_uid}', " \
               f"Name:'{filename_escaped}', " \
               f"File:'{file_path_escaped}', " \
               f"Type:'{processed_doc['file_type']}', " \
               f"Keys:'{key}'}})"
        run_query(session,query)
    
   # Connect chunks to the document (unchanged)
    for i,text in enumerate(text_chunks):
        out=run_query(session,f"MATCH (x {{UID:'{doc_uid}'}}) "
                     f"MERGE (x)-[:CHUNK]->(n:Text_chunk {{UID:'{text[2]}',"
                     f"Name:'{filename}:Text:{str(i)}',"
                     f"Text:'{text[1]}',"
                     f"Parent:'{text[0]}',"
                     f"Keys:'{key}'}})")
    
    for i,text in enumerate(table_chunks):
        out=run_query(session,f"MATCH (x {{UID:'{doc_uid}'}}) "
                     f"MERGE (x)-[:CHUNK]->(n:Table_chunk {{UID:'{text[3]}',"
                     f"Name:'{filename}:Table:{str(i)}',"
                     f"Text:'{text[2]}',"
                     f"Html:'{text[1]}',"
                     f"Parent:'{text[0]}',"
                     f"Keys:'{key}'}})")
    
    return doc_uid

Parameters

Name Type Default Kind
session - - positional_or_keyword
processed_doc - - positional_or_keyword
deepest_folder_uid - - positional_or_keyword

Parameter Details

session: Neo4j database session object used to execute Cypher queries against the graph database. Should be an active session from neo4j.GraphDatabase.driver().session().

processed_doc: Dictionary containing document metadata and content. Expected keys: 'file_path' (str: full path to file), 'file_name' (str: name of file), 'file_type' (str: document type/extension), 'text_chunks' (list of tuples: [(parent, text, uid), ...]), 'table_chunks' (list of tuples: [(parent, html, text, uid), ...]). Text chunks contain extracted text content, table chunks contain both HTML and text representations.

deepest_folder_uid: String UID of the deepest subfolder in the hierarchy where this document should be attached. If None or empty, the document will be connected to the root folder 'T001'. Should be a valid UUID string matching an existing Subfolder node.

Return Value

Returns a string containing the generated UUID (doc_uid) for the newly created Document node in the Neo4j graph. This UID can be used to reference or query the document later.

Dependencies

  • neo4j
  • uuid

Required Imports

from uuid import uuid4
from neo4j import GraphDatabase

Usage Example

from neo4j import GraphDatabase
from uuid import uuid4

# Assuming evaluate_query and run_query helper functions exist
driver = GraphDatabase.driver('bolt://localhost:7687', auth=('neo4j', 'password'))
session = driver.session()

processed_doc = {
    'file_path': '/documents/report.pdf',
    'file_name': 'report.pdf',
    'file_type': 'pdf',
    'text_chunks': [
        ('section1', 'This is the first paragraph', 'uuid-text-1'),
        ('section2', 'This is the second paragraph', 'uuid-text-2')
    ],
    'table_chunks': [
        ('table1', '<table><tr><td>Data</td></tr></table>', 'Data', 'uuid-table-1')
    ]
}

folder_uid = 'existing-folder-uuid-123'
doc_uid = add_document_to_graph(session, processed_doc, folder_uid)
print(f'Document created with UID: {doc_uid}')

session.close()
driver.close()

Best Practices

  • Ensure the Neo4j session is properly opened before calling this function and closed after use
  • The function uses string interpolation for Cypher queries which is vulnerable to injection attacks; consider using parameterized queries instead
  • Single quotes in file paths and filenames are escaped with double backticks (``), but this may not handle all special characters safely
  • The function assumes helper functions 'evaluate_query' and 'run_query' exist in the scope; ensure these are imported or defined
  • Text content in chunks should be properly escaped before passing to this function to avoid Cypher syntax errors
  • The function does not validate input data structure; ensure processed_doc contains all required keys
  • Consider wrapping the function call in a try-except block to handle Neo4j connection errors
  • The hardcoded root folder name 'T001' should ideally be configurable
  • For large documents with many chunks, consider batching the chunk creation queries for better performance

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function add_document_to_graph_v1 98.1% similar

    Creates a Neo4j graph node for a processed document and connects it to a folder hierarchy, along with its text and table chunks.

    From: /tf/active/vicechatdev/offline_docstore_multi_vice.py
  • function create_document_legacy 63.8% similar

    Creates a new controlled document in a document management system with versioning, audit trails, and notifications. Generates document nodes in a graph database with relationships to users and versions.

    From: /tf/active/vicechatdev/CDocs/controllers/document_controller.py
  • function create_folder_hierarchy_v2 63.6% similar

    Creates a hierarchical structure of Subfolder nodes in a Neo4j graph database based on a file path, establishing parent-child relationships between folders.

    From: /tf/active/vicechatdev/offline_parser_docstore.py
  • function create_folder_hierarchy_v1 63.2% similar

    Creates a hierarchical structure of Subfolder nodes in a Neo4j graph database based on a file path, connecting each folder level with PATH relationships.

    From: /tf/active/vicechatdev/offline_docstore_multi.py
  • function create_folder_hierarchy 62.8% similar

    Creates a hierarchical structure of Subfolder nodes in a Neo4j graph database based on a file system path, connecting each folder level with PATH relationships.

    From: /tf/active/vicechatdev/offline_docstore_multi_vice.py
← Back to Browse