identify_duplicates - Code Extractor

function identify_duplicates

Maturity: 55

Identifies duplicate documents by computing hash values of their text content and grouping documents with identical hashes.

File:
/tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py

Lines:
22 - 41

Complexity:
simple

Purpose

This function detects duplicate documents in a collection by hashing their text content and identifying which documents share the same hash value. It's useful for deduplication tasks, data cleaning pipelines, and identifying redundant content in document collections. The function only returns groups where actual duplicates exist (2 or more documents with the same hash).

Source Code

def identify_duplicates(documents: List[Dict[str, Any]]) -> Dict[str, List[str]]:
    """
    Identify duplicate documents using their hash values.
    
    Args:
        documents: List of document dictionaries with 'id' and 'text' keys
        
    Returns:
        Dictionary mapping hash values to lists of document IDs with that hash
    """
    hash_to_ids = {}
    
    for doc in documents:
        doc_hash = hash_text(doc['text'])
        if doc_hash not in hash_to_ids:
            hash_to_ids[doc_hash] = []
        hash_to_ids[doc_hash].append(doc['id'])
    
    # Only return entries with more than one document (actual duplicates)
    return {h: ids for h, ids in hash_to_ids.items() if len(ids) > 1}

Parameters

Name	Type	Default	Kind
`documents`	List[Dict[str, Any]]	-	positional_or_keyword

Parameter Details

documents: A list of dictionaries where each dictionary represents a document. Each dictionary must contain at least two keys: 'id' (a unique identifier for the document, typically a string) and 'text' (the text content of the document as a string). Additional keys in the dictionaries are ignored. The list can be empty, in which case an empty dictionary is returned.

Return Value

Type: Dict[str, List[str]]

Returns a dictionary where keys are hash values (strings) and values are lists of document IDs (strings) that share that hash. Only hash values with 2 or more associated document IDs are included in the result, representing actual duplicates. If no duplicates are found, an empty dictionary is returned. The hash values are generated by the hash_text() function which must be defined elsewhere in the codebase.

Dependencies

hashlib

Required Imports

from typing import Dict, List, Any
import hashlib

Usage Example

import hashlib
from typing import Dict, List, Any

def hash_text(text: str) -> str:
    return hashlib.md5(text.encode('utf-8')).hexdigest()

def identify_duplicates(documents: List[Dict[str, Any]]) -> Dict[str, List[str]]:
    hash_to_ids = {}
    for doc in documents:
        doc_hash = hash_text(doc['text'])
        if doc_hash not in hash_to_ids:
            hash_to_ids[doc_hash] = []
        hash_to_ids[doc_hash].append(doc['id'])
    return {h: ids for h, ids in hash_to_ids.items() if len(ids) > 1}

# Example usage
documents = [
    {'id': 'doc1', 'text': 'Hello world'},
    {'id': 'doc2', 'text': 'Hello world'},
    {'id': 'doc3', 'text': 'Different text'},
    {'id': 'doc4', 'text': 'Hello world'}
]

duplicates = identify_duplicates(documents)
print(duplicates)
# Output: {'5eb63bbbe01eeed093cb22bb8f5acdc3': ['doc1', 'doc2', 'doc4']}

Best Practices

Ensure all documents in the input list have both 'id' and 'text' keys to avoid KeyError exceptions
The hash_text() function must be defined or imported before using this function
Consider the hash collision probability when using this for critical deduplication tasks
For large document collections, consider memory usage as all hashes and IDs are stored in memory
Document IDs should be unique across the input list to avoid confusion in the results
The function preserves the order of document IDs as they appear in the input list
Empty text strings will be hashed and can result in duplicates if multiple documents have empty text

Similar Components

AI-powered semantic similarity - components with related functionality:

function get_unique_documents 89.5% similar

Identifies and separates unique documents from duplicates in a list by comparing hash values of document text content.
From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py
function find_similar_documents 67.5% similar

Identifies pairs of similar documents by comparing their embeddings and returns those exceeding a specified similarity threshold, sorted by similarity score.
From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
class HashCleaner 64.8% similar

A document deduplication cleaner that removes documents with identical content by comparing hash values of document text.
From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/hash_cleaner.py
function summarize_text 56.1% similar

A deprecated standalone function that was originally designed to summarize groups of similar documents but now only returns the input documents unchanged with a deprecation warning.
From: /tf/active/vicechatdev/chromadb-cleanup/src/summarization/summarizer.py
function test_remove_identical_chunks 52.5% similar

A pytest test function that verifies the HashCleaner's ability to remove duplicate text chunks from a list while preserving order and unique entries.
From: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py

🔍 Code Extractor

function identify_duplicates

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function get_unique_documents 89.5% similar

function find_similar_documents 67.5% similar

class HashCleaner 64.8% similar

function summarize_text 56.1% similar

function test_remove_identical_chunks 52.5% similar

function identify_duplicates

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function get_unique_documents 89.5% similar

function find_similar_documents 67.5% similar

class HashCleaner 64.8% similar

function summarize_text 56.1% similar

function test_remove_identical_chunks 52.5% similar

✨ Improve Code: identify_duplicates

Code Comparison