function identify_duplicates
Identifies duplicate documents by computing hash values of their text content and grouping documents with identical hashes.
/tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py
22 - 41
simple
Purpose
This function detects duplicate documents in a collection by hashing their text content and identifying which documents share the same hash value. It's useful for deduplication tasks, data cleaning pipelines, and identifying redundant content in document collections. The function only returns groups where actual duplicates exist (2 or more documents with the same hash).
Source Code
def identify_duplicates(documents: List[Dict[str, Any]]) -> Dict[str, List[str]]:
"""
Identify duplicate documents using their hash values.
Args:
documents: List of document dictionaries with 'id' and 'text' keys
Returns:
Dictionary mapping hash values to lists of document IDs with that hash
"""
hash_to_ids = {}
for doc in documents:
doc_hash = hash_text(doc['text'])
if doc_hash not in hash_to_ids:
hash_to_ids[doc_hash] = []
hash_to_ids[doc_hash].append(doc['id'])
# Only return entries with more than one document (actual duplicates)
return {h: ids for h, ids in hash_to_ids.items() if len(ids) > 1}
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
documents |
List[Dict[str, Any]] | - | positional_or_keyword |
Parameter Details
documents: A list of dictionaries where each dictionary represents a document. Each dictionary must contain at least two keys: 'id' (a unique identifier for the document, typically a string) and 'text' (the text content of the document as a string). Additional keys in the dictionaries are ignored. The list can be empty, in which case an empty dictionary is returned.
Return Value
Type: Dict[str, List[str]]
Returns a dictionary where keys are hash values (strings) and values are lists of document IDs (strings) that share that hash. Only hash values with 2 or more associated document IDs are included in the result, representing actual duplicates. If no duplicates are found, an empty dictionary is returned. The hash values are generated by the hash_text() function which must be defined elsewhere in the codebase.
Dependencies
hashlib
Required Imports
from typing import Dict, List, Any
import hashlib
Usage Example
import hashlib
from typing import Dict, List, Any
def hash_text(text: str) -> str:
return hashlib.md5(text.encode('utf-8')).hexdigest()
def identify_duplicates(documents: List[Dict[str, Any]]) -> Dict[str, List[str]]:
hash_to_ids = {}
for doc in documents:
doc_hash = hash_text(doc['text'])
if doc_hash not in hash_to_ids:
hash_to_ids[doc_hash] = []
hash_to_ids[doc_hash].append(doc['id'])
return {h: ids for h, ids in hash_to_ids.items() if len(ids) > 1}
# Example usage
documents = [
{'id': 'doc1', 'text': 'Hello world'},
{'id': 'doc2', 'text': 'Hello world'},
{'id': 'doc3', 'text': 'Different text'},
{'id': 'doc4', 'text': 'Hello world'}
]
duplicates = identify_duplicates(documents)
print(duplicates)
# Output: {'5eb63bbbe01eeed093cb22bb8f5acdc3': ['doc1', 'doc2', 'doc4']}
Best Practices
- Ensure all documents in the input list have both 'id' and 'text' keys to avoid KeyError exceptions
- The hash_text() function must be defined or imported before using this function
- Consider the hash collision probability when using this for critical deduplication tasks
- For large document collections, consider memory usage as all hashes and IDs are stored in memory
- Document IDs should be unique across the input list to avoid confusion in the results
- The function preserves the order of document IDs as they appear in the input list
- Empty text strings will be hashed and can result in duplicates if multiple documents have empty text
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function get_unique_documents 89.5% similar
-
function find_similar_documents 67.5% similar
-
class HashCleaner 64.8% similar
-
function summarize_text 56.1% similar
-
function test_remove_identical_chunks 52.5% similar