🔍 Code Extractor

function get_unique_documents

Maturity: 52

Identifies and separates unique documents from duplicates in a list by comparing hash values of document text content.

File:
/tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py
Lines:
44 - 67
Complexity:
simple

Purpose

This function processes a list of document dictionaries to filter out duplicates based on text content hashing. It maintains the order of first occurrence for unique documents while collecting all duplicate instances separately. This is useful for deduplication pipelines, data cleaning workflows, and preventing redundant document processing in information retrieval or NLP systems.

Source Code

def get_unique_documents(documents: List[Dict[str, Any]]) -> Tuple[List[Dict[str, Any]], List[Dict[str, Any]]]:
    """
    Get unique documents based on hash values.
    
    Args:
        documents: List of document dictionaries with 'id' and 'text' keys
        
    Returns:
        Tuple of (unique_documents, duplicate_documents)
    """
    unique_docs = []
    duplicate_docs = []
    seen_hashes = set()
    
    for doc in documents:
        doc_hash = hash_text(doc['text'])
        
        if doc_hash in seen_hashes:
            duplicate_docs.append(doc)
        else:
            seen_hashes.add(doc_hash)
            unique_docs.append(doc)
    
    return unique_docs, duplicate_docs

Parameters

Name Type Default Kind
documents List[Dict[str, Any]] - positional_or_keyword

Parameter Details

documents: A list of dictionary objects where each dictionary must contain at least a 'text' key with string content to be hashed. Typically also includes an 'id' key for document identification. The function expects well-formed dictionaries with the 'text' key present; missing keys will cause KeyError exceptions.

Return Value

Type: Tuple[List[Dict[str, Any]], List[Dict[str, Any]]]

Returns a tuple containing two lists: (1) unique_documents - a list of document dictionaries that appear for the first time based on their text hash, preserving insertion order; (2) duplicate_documents - a list of document dictionaries whose text content hash matches a previously seen document. Both lists contain the original dictionary objects from the input.

Dependencies

  • hashlib
  • typing

Required Imports

from typing import Dict, List, Tuple, Any
import hashlib

Usage Example

# Assuming hash_text function is defined
import hashlib
from typing import Dict, List, Tuple, Any

def hash_text(text: str) -> str:
    return hashlib.sha256(text.encode('utf-8')).hexdigest()

# Sample documents
documents = [
    {'id': 1, 'text': 'Hello world'},
    {'id': 2, 'text': 'Python programming'},
    {'id': 3, 'text': 'Hello world'},
    {'id': 4, 'text': 'Data science'}
]

# Get unique and duplicate documents
unique_docs, duplicate_docs = get_unique_documents(documents)

print(f"Unique documents: {len(unique_docs)}")
print(f"Duplicate documents: {len(duplicate_docs)}")
print(f"Unique: {unique_docs}")
print(f"Duplicates: {duplicate_docs}")

Best Practices

  • Ensure all documents in the input list have a 'text' key to avoid KeyError exceptions
  • The hash_text() function must be available in scope before calling this function
  • Consider memory usage when processing large document collections as all hashes are stored in memory
  • The function preserves the first occurrence of duplicate documents; if you need different behavior, modify the logic accordingly
  • Hash collisions are theoretically possible but extremely rare with cryptographic hash functions like SHA-256
  • For very large datasets, consider using a database or streaming approach instead of in-memory sets

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function identify_duplicates 89.5% similar

    Identifies duplicate documents by computing hash values of their text content and grouping documents with identical hashes.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py
  • function find_similar_documents 65.4% similar

    Identifies pairs of similar documents by comparing their embeddings and returns those exceeding a specified similarity threshold, sorted by similarity score.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
  • class HashCleaner 57.3% similar

    A document deduplication cleaner that removes documents with identical content by comparing hash values of document text.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/hash_cleaner.py
  • function summarize_text 53.4% similar

    A deprecated standalone function that was originally designed to summarize groups of similar documents but now only returns the input documents unchanged with a deprecation warning.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/summarization/summarizer.py
  • function test_remove_identical_chunks 49.2% similar

    A pytest test function that verifies the HashCleaner's ability to remove duplicate text chunks from a list while preserving order and unique entries.

    From: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py
← Back to Browse