🔍 Code Extractor

function hash_text

Maturity: 53

Creates a SHA-256 hash of normalized text content to generate a unique identifier for documents, enabling duplicate detection and content comparison.

File:
/tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py
Lines:
5 - 19
Complexity:
simple

Purpose

This function generates a deterministic hash value for text content by first normalizing the input (converting to lowercase and collapsing whitespace) and then computing a SHA-256 hash. It's primarily used for identifying duplicate documents, creating content fingerprints, or implementing content-based deduplication systems. The normalization step ensures that minor formatting differences don't result in different hashes for semantically identical content.

Source Code

def hash_text(text: str) -> str:
    """
    Create a hash of the text content to identify identical documents.
    
    Args:
        text: The text to hash
        
    Returns:
        A hexadecimal hash string
    """
    # Normalize text before hashing (lowercase, strip whitespace)
    normalized_text = " ".join(text.lower().split())
    
    # Create SHA-256 hash
    return hashlib.sha256(normalized_text.encode('utf-8')).hexdigest()

Parameters

Name Type Default Kind
text str - positional_or_keyword

Parameter Details

text: A string containing the text content to be hashed. Can be of any length and contain any Unicode characters. The text will be normalized (lowercased and whitespace-collapsed) before hashing to ensure consistent results for similar content with minor formatting differences.

Return Value

Type: str

Returns a string containing the hexadecimal representation of the SHA-256 hash of the normalized input text. The returned string is always 64 characters long (256 bits represented as hex). This hash is deterministic - the same normalized input will always produce the same hash value.

Dependencies

  • hashlib

Required Imports

import hashlib

Usage Example

import hashlib

def hash_text(text: str) -> str:
    normalized_text = " ".join(text.lower().split())
    return hashlib.sha256(normalized_text.encode('utf-8')).hexdigest()

# Example usage
text1 = "Hello World"
text2 = "hello   world"  # Different spacing and case
text3 = "Hello Universe"

hash1 = hash_text(text1)
hash2 = hash_text(text2)
hash3 = hash_text(text3)

print(f"Hash 1: {hash1}")
print(f"Hash 2: {hash2}")
print(f"Hash 3: {hash3}")
print(f"Text 1 and 2 are identical: {hash1 == hash2}")  # True
print(f"Text 1 and 3 are identical: {hash1 == hash3}")  # False

Best Practices

  • The function normalizes text by converting to lowercase and collapsing whitespace, so 'Hello World' and 'hello world' will produce the same hash
  • Use this function when you need to identify duplicate content regardless of minor formatting differences
  • The normalization step means that case-sensitive or whitespace-sensitive comparisons should use a different approach
  • SHA-256 is cryptographically secure but if you only need fast deduplication (not security), consider using a faster hash like MD5 or a non-cryptographic hash
  • The function assumes UTF-8 encoding; ensure your text is properly decoded before passing it to this function
  • For very large texts, consider streaming the hash computation to reduce memory usage
  • Store the returned hash in a database or cache to enable efficient duplicate lookups

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function clean_text 52.4% similar

    Cleans and normalizes text content by removing HTML tags, normalizing whitespace, and stripping markdown formatting elements.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
  • function html_to_text 50.5% similar

    Converts HTML content to plain text by removing HTML tags, decoding common HTML entities, and normalizing whitespace.

    From: /tf/active/vicechatdev/CDocs/utils/notifications.py
  • class HashCleaner 50.3% similar

    A document deduplication cleaner that removes documents with identical content by comparing hash values of document text.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/hash_cleaner.py
  • function identify_duplicates 48.2% similar

    Identifies duplicate documents by computing hash values of their text content and grouping documents with identical hashes.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py
  • class HashGenerator 47.9% similar

    A class that provides cryptographic hashing functionality for PDF documents, including hash generation, embedding, and verification for document integrity checking.

    From: /tf/active/vicechatdev/document_auditor/src/security/hash_generator.py
← Back to Browse