function check_document_hash_exists
Checks if a document with a given SHA-256 hash already exists in the database by querying the graph database for matching DocumentVersion nodes.
/tf/active/vicechatdev/CDocs/utils/document_processor.py
509 - 537
simple
Purpose
This function prevents duplicate document uploads by verifying whether a file with the same hash has already been stored. It queries a Neo4j graph database to find DocumentVersion nodes with matching hashes and returns both the existence status and the associated document UID if found. This is useful for deduplication, version control, and avoiding redundant storage of identical files.
Source Code
def check_document_hash_exists(file_hash: str) -> Tuple[bool, Optional[str]]:
"""
Check if a document with the given hash already exists.
Args:
file_hash: SHA-256 hash of the file
Returns:
Tuple of (exists, document_uid)
"""
try:
result = db.run_query(
"""
MATCH (v:DocumentVersion {hash: $hash})
MATCH (d:ControlledDocument)-[:HAS_VERSION]->(v)
RETURN d.UID as docUID, v.UID as versionUID
LIMIT 1
""",
{"hash": file_hash}
)
if result and 'docUID' in result[0]:
return True, result[0]['docUID']
return False, None
except Exception as e:
logger.error(f"Error checking document hash: {e}")
return False, None
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
file_hash |
str | - | positional_or_keyword |
Parameter Details
file_hash: A SHA-256 hash string representing the cryptographic hash of a file. This should be a 64-character hexadecimal string generated from the file's contents. Used to uniquely identify document versions in the database.
Return Value
Type: Tuple[bool, Optional[str]]
Returns a tuple containing two elements: (1) A boolean indicating whether a document with the given hash exists (True if found, False otherwise), and (2) An optional string containing the document's UID if found, or None if not found or if an error occurred. The UID is retrieved from the ControlledDocument node associated with the matching DocumentVersion.
Dependencies
CDocs.dblogging
Required Imports
from typing import Tuple, Optional
from CDocs import db
import logging
Usage Example
from typing import Tuple, Optional
from CDocs import db
import logging
import hashlib
# Setup logger (assumed to be in module scope)
logger = logging.getLogger(__name__)
# Calculate file hash
with open('document.pdf', 'rb') as f:
file_content = f.read()
file_hash = hashlib.sha256(file_content).hexdigest()
# Check if document exists
exists, doc_uid = check_document_hash_exists(file_hash)
if exists:
print(f"Document already exists with UID: {doc_uid}")
else:
print("Document is new, proceed with upload")
Best Practices
- Always ensure the file_hash parameter is a valid SHA-256 hash (64-character hexadecimal string) before calling this function
- Handle both return values appropriately - check the boolean first before using the document UID
- The function returns (False, None) on errors, so implement proper error handling in calling code
- This function assumes a 'logger' object exists in the module scope - ensure logging is properly configured
- The database query uses LIMIT 1, so only the first matching document is returned if multiple exist
- Consider the performance implications when checking large numbers of hashes - batch operations may be more efficient
- The function catches all exceptions broadly - consider more specific exception handling for production use
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function check_document_exists_by_uid 62.9% similar
-
function node_exists 60.3% similar
-
function check_document_exists_by_doc_number 57.5% similar
-
function compare_document_versions 54.9% similar
-
function get_document_with_relationships 52.8% similar