🔍 Code Extractor

function identify_duplicates

Maturity: 55

Identifies duplicate documents by computing hash values of their text content and grouping documents with identical hashes.

File:
/tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py
Lines:
22 - 41
Complexity:
simple

Purpose

This function detects duplicate documents in a collection by hashing their text content and identifying which documents share the same hash value. It's useful for deduplication tasks, data cleaning pipelines, and identifying redundant content in document collections. The function only returns groups where actual duplicates exist (2 or more documents with the same hash).

Source Code

def identify_duplicates(documents: List[Dict[str, Any]]) -> Dict[str, List[str]]:
    """
    Identify duplicate documents using their hash values.
    
    Args:
        documents: List of document dictionaries with 'id' and 'text' keys
        
    Returns:
        Dictionary mapping hash values to lists of document IDs with that hash
    """
    hash_to_ids = {}
    
    for doc in documents:
        doc_hash = hash_text(doc['text'])
        if doc_hash not in hash_to_ids:
            hash_to_ids[doc_hash] = []
        hash_to_ids[doc_hash].append(doc['id'])
    
    # Only return entries with more than one document (actual duplicates)
    return {h: ids for h, ids in hash_to_ids.items() if len(ids) > 1}

Parameters

Name Type Default Kind
documents List[Dict[str, Any]] - positional_or_keyword

Parameter Details

documents: A list of dictionaries where each dictionary represents a document. Each dictionary must contain at least two keys: 'id' (a unique identifier for the document, typically a string) and 'text' (the text content of the document as a string). Additional keys in the dictionaries are ignored. The list can be empty, in which case an empty dictionary is returned.

Return Value

Type: Dict[str, List[str]]

Returns a dictionary where keys are hash values (strings) and values are lists of document IDs (strings) that share that hash. Only hash values with 2 or more associated document IDs are included in the result, representing actual duplicates. If no duplicates are found, an empty dictionary is returned. The hash values are generated by the hash_text() function which must be defined elsewhere in the codebase.

Dependencies

  • hashlib

Required Imports

from typing import Dict, List, Any
import hashlib

Usage Example

import hashlib
from typing import Dict, List, Any

def hash_text(text: str) -> str:
    return hashlib.md5(text.encode('utf-8')).hexdigest()

def identify_duplicates(documents: List[Dict[str, Any]]) -> Dict[str, List[str]]:
    hash_to_ids = {}
    for doc in documents:
        doc_hash = hash_text(doc['text'])
        if doc_hash not in hash_to_ids:
            hash_to_ids[doc_hash] = []
        hash_to_ids[doc_hash].append(doc['id'])
    return {h: ids for h, ids in hash_to_ids.items() if len(ids) > 1}

# Example usage
documents = [
    {'id': 'doc1', 'text': 'Hello world'},
    {'id': 'doc2', 'text': 'Hello world'},
    {'id': 'doc3', 'text': 'Different text'},
    {'id': 'doc4', 'text': 'Hello world'}
]

duplicates = identify_duplicates(documents)
print(duplicates)
# Output: {'5eb63bbbe01eeed093cb22bb8f5acdc3': ['doc1', 'doc2', 'doc4']}

Best Practices

  • Ensure all documents in the input list have both 'id' and 'text' keys to avoid KeyError exceptions
  • The hash_text() function must be defined or imported before using this function
  • Consider the hash collision probability when using this for critical deduplication tasks
  • For large document collections, consider memory usage as all hashes and IDs are stored in memory
  • Document IDs should be unique across the input list to avoid confusion in the results
  • The function preserves the order of document IDs as they appear in the input list
  • Empty text strings will be hashed and can result in duplicates if multiple documents have empty text

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function get_unique_documents 89.5% similar

    Identifies and separates unique documents from duplicates in a list by comparing hash values of document text content.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py
  • function find_similar_documents 67.5% similar

    Identifies pairs of similar documents by comparing their embeddings and returns those exceeding a specified similarity threshold, sorted by similarity score.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
  • class HashCleaner 64.8% similar

    A document deduplication cleaner that removes documents with identical content by comparing hash values of document text.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/hash_cleaner.py
  • function summarize_text 56.1% similar

    A deprecated standalone function that was originally designed to summarize groups of similar documents but now only returns the input documents unchanged with a deprecation warning.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/summarization/summarizer.py
  • function test_remove_identical_chunks 52.5% similar

    A pytest test function that verifies the HashCleaner's ability to remove duplicate text chunks from a list while preserving order and unique entries.

    From: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py
← Back to Browse