function get_unique_documents
Identifies and separates unique documents from duplicates in a list by comparing hash values of document text content.
/tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py
44 - 67
simple
Purpose
This function processes a list of document dictionaries to filter out duplicates based on text content hashing. It maintains the order of first occurrence for unique documents while collecting all duplicate instances separately. This is useful for deduplication pipelines, data cleaning workflows, and preventing redundant document processing in information retrieval or NLP systems.
Source Code
def get_unique_documents(documents: List[Dict[str, Any]]) -> Tuple[List[Dict[str, Any]], List[Dict[str, Any]]]:
"""
Get unique documents based on hash values.
Args:
documents: List of document dictionaries with 'id' and 'text' keys
Returns:
Tuple of (unique_documents, duplicate_documents)
"""
unique_docs = []
duplicate_docs = []
seen_hashes = set()
for doc in documents:
doc_hash = hash_text(doc['text'])
if doc_hash in seen_hashes:
duplicate_docs.append(doc)
else:
seen_hashes.add(doc_hash)
unique_docs.append(doc)
return unique_docs, duplicate_docs
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
documents |
List[Dict[str, Any]] | - | positional_or_keyword |
Parameter Details
documents: A list of dictionary objects where each dictionary must contain at least a 'text' key with string content to be hashed. Typically also includes an 'id' key for document identification. The function expects well-formed dictionaries with the 'text' key present; missing keys will cause KeyError exceptions.
Return Value
Type: Tuple[List[Dict[str, Any]], List[Dict[str, Any]]]
Returns a tuple containing two lists: (1) unique_documents - a list of document dictionaries that appear for the first time based on their text hash, preserving insertion order; (2) duplicate_documents - a list of document dictionaries whose text content hash matches a previously seen document. Both lists contain the original dictionary objects from the input.
Dependencies
hashlibtyping
Required Imports
from typing import Dict, List, Tuple, Any
import hashlib
Usage Example
# Assuming hash_text function is defined
import hashlib
from typing import Dict, List, Tuple, Any
def hash_text(text: str) -> str:
return hashlib.sha256(text.encode('utf-8')).hexdigest()
# Sample documents
documents = [
{'id': 1, 'text': 'Hello world'},
{'id': 2, 'text': 'Python programming'},
{'id': 3, 'text': 'Hello world'},
{'id': 4, 'text': 'Data science'}
]
# Get unique and duplicate documents
unique_docs, duplicate_docs = get_unique_documents(documents)
print(f"Unique documents: {len(unique_docs)}")
print(f"Duplicate documents: {len(duplicate_docs)}")
print(f"Unique: {unique_docs}")
print(f"Duplicates: {duplicate_docs}")
Best Practices
- Ensure all documents in the input list have a 'text' key to avoid KeyError exceptions
- The hash_text() function must be available in scope before calling this function
- Consider memory usage when processing large document collections as all hashes are stored in memory
- The function preserves the first occurrence of duplicate documents; if you need different behavior, modify the logic accordingly
- Hash collisions are theoretically possible but extremely rare with cryptographic hash functions like SHA-256
- For very large datasets, consider using a database or streaming approach instead of in-memory sets
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function identify_duplicates 89.5% similar
-
function find_similar_documents 65.4% similar
-
class HashCleaner 57.3% similar
-
function summarize_text 53.4% similar
-
function test_remove_identical_chunks 49.2% similar