🔍 Code Extractor

function check_document_hash_exists

Maturity: 54

Checks if a document with a given SHA-256 hash already exists in the database by querying the graph database for matching DocumentVersion nodes.

File:
/tf/active/vicechatdev/CDocs/utils/document_processor.py
Lines:
509 - 537
Complexity:
simple

Purpose

This function prevents duplicate document uploads by verifying whether a file with the same hash has already been stored. It queries a Neo4j graph database to find DocumentVersion nodes with matching hashes and returns both the existence status and the associated document UID if found. This is useful for deduplication, version control, and avoiding redundant storage of identical files.

Source Code

def check_document_hash_exists(file_hash: str) -> Tuple[bool, Optional[str]]:
    """
    Check if a document with the given hash already exists.
    
    Args:
        file_hash: SHA-256 hash of the file
        
    Returns:
        Tuple of (exists, document_uid)
    """
    try:
        result = db.run_query(
            """
            MATCH (v:DocumentVersion {hash: $hash})
            MATCH (d:ControlledDocument)-[:HAS_VERSION]->(v)
            RETURN d.UID as docUID, v.UID as versionUID
            LIMIT 1
            """,
            {"hash": file_hash}
        )
        
        if result and 'docUID' in result[0]:
            return True, result[0]['docUID']
            
        return False, None
        
    except Exception as e:
        logger.error(f"Error checking document hash: {e}")
        return False, None

Parameters

Name Type Default Kind
file_hash str - positional_or_keyword

Parameter Details

file_hash: A SHA-256 hash string representing the cryptographic hash of a file. This should be a 64-character hexadecimal string generated from the file's contents. Used to uniquely identify document versions in the database.

Return Value

Type: Tuple[bool, Optional[str]]

Returns a tuple containing two elements: (1) A boolean indicating whether a document with the given hash exists (True if found, False otherwise), and (2) An optional string containing the document's UID if found, or None if not found or if an error occurred. The UID is retrieved from the ControlledDocument node associated with the matching DocumentVersion.

Dependencies

  • CDocs.db
  • logging

Required Imports

from typing import Tuple, Optional
from CDocs import db
import logging

Usage Example

from typing import Tuple, Optional
from CDocs import db
import logging
import hashlib

# Setup logger (assumed to be in module scope)
logger = logging.getLogger(__name__)

# Calculate file hash
with open('document.pdf', 'rb') as f:
    file_content = f.read()
    file_hash = hashlib.sha256(file_content).hexdigest()

# Check if document exists
exists, doc_uid = check_document_hash_exists(file_hash)

if exists:
    print(f"Document already exists with UID: {doc_uid}")
else:
    print("Document is new, proceed with upload")

Best Practices

  • Always ensure the file_hash parameter is a valid SHA-256 hash (64-character hexadecimal string) before calling this function
  • Handle both return values appropriately - check the boolean first before using the document UID
  • The function returns (False, None) on errors, so implement proper error handling in calling code
  • This function assumes a 'logger' object exists in the module scope - ensure logging is properly configured
  • The database query uses LIMIT 1, so only the first matching document is returned if multiple exist
  • Consider the performance implications when checking large numbers of hashes - batch operations may be more efficient
  • The function catches all exceptions broadly - consider more specific exception handling for production use

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function check_document_exists_by_uid 62.9% similar

    Queries a Neo4j database to check if a ControlledDocument with a specific UID exists and returns the document object if found.

    From: /tf/active/vicechatdev/CDocs/FC_sync.py
  • function node_exists 60.3% similar

    Checks if a node with a specific UID exists in a Neo4j graph database by querying for the node and returning a boolean result.

    From: /tf/active/vicechatdev/CDocs/db/db_operations.py
  • function check_document_exists_by_doc_number 57.5% similar

    Queries a Neo4j database to check if a ControlledDocument exists with a specific document number and returns the document object if found.

    From: /tf/active/vicechatdev/CDocs/FC_sync.py
  • function compare_document_versions 54.9% similar

    Compares two document versions by their UIDs and generates a summary of changes including metadata differences and hash comparisons.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
  • function get_document_with_relationships 52.8% similar

    Retrieves a complete document from Neo4j graph database along with all its related entities including versions, reviews, approvals, and authors.

    From: /tf/active/vicechatdev/CDocs/db/db_operations.py
← Back to Browse