hash_text - Code Extractor

function hash_text

Maturity: 53

Creates a SHA-256 hash of normalized text content to generate a unique identifier for documents, enabling duplicate detection and content comparison.

File:
/tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py

Lines:
5 - 19

Complexity:
simple

Purpose

This function generates a deterministic hash value for text content by first normalizing the input (converting to lowercase and collapsing whitespace) and then computing a SHA-256 hash. It's primarily used for identifying duplicate documents, creating content fingerprints, or implementing content-based deduplication systems. The normalization step ensures that minor formatting differences don't result in different hashes for semantically identical content.

Source Code

def hash_text(text: str) -> str:
    """
    Create a hash of the text content to identify identical documents.
    
    Args:
        text: The text to hash
        
    Returns:
        A hexadecimal hash string
    """
    # Normalize text before hashing (lowercase, strip whitespace)
    normalized_text = " ".join(text.lower().split())
    
    # Create SHA-256 hash
    return hashlib.sha256(normalized_text.encode('utf-8')).hexdigest()

Parameters

Name	Type	Default	Kind
`text`	str	-	positional_or_keyword

Parameter Details

text: A string containing the text content to be hashed. Can be of any length and contain any Unicode characters. The text will be normalized (lowercased and whitespace-collapsed) before hashing to ensure consistent results for similar content with minor formatting differences.

Return Value

Type: str

Returns a string containing the hexadecimal representation of the SHA-256 hash of the normalized input text. The returned string is always 64 characters long (256 bits represented as hex). This hash is deterministic - the same normalized input will always produce the same hash value.

Dependencies

hashlib

Required Imports

import hashlib

Usage Example

import hashlib

def hash_text(text: str) -> str:
    normalized_text = " ".join(text.lower().split())
    return hashlib.sha256(normalized_text.encode('utf-8')).hexdigest()

# Example usage
text1 = "Hello World"
text2 = "hello   world"  # Different spacing and case
text3 = "Hello Universe"

hash1 = hash_text(text1)
hash2 = hash_text(text2)
hash3 = hash_text(text3)

print(f"Hash 1: {hash1}")
print(f"Hash 2: {hash2}")
print(f"Hash 3: {hash3}")
print(f"Text 1 and 2 are identical: {hash1 == hash2}")  # True
print(f"Text 1 and 3 are identical: {hash1 == hash3}")  # False

Best Practices

The function normalizes text by converting to lowercase and collapsing whitespace, so 'Hello World' and 'hello world' will produce the same hash
Use this function when you need to identify duplicate content regardless of minor formatting differences
The normalization step means that case-sensitive or whitespace-sensitive comparisons should use a different approach
SHA-256 is cryptographically secure but if you only need fast deduplication (not security), consider using a faster hash like MD5 or a non-cryptographic hash
The function assumes UTF-8 encoding; ensure your text is properly decoded before passing it to this function
For very large texts, consider streaming the hash computation to reduce memory usage
Store the returned hash in a database or cache to enable efficient duplicate lookups

Similar Components

AI-powered semantic similarity - components with related functionality:

function clean_text 52.4% similar

Cleans and normalizes text content by removing HTML tags, normalizing whitespace, and stripping markdown formatting elements.
From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
function html_to_text 50.5% similar

Converts HTML content to plain text by removing HTML tags, decoding common HTML entities, and normalizing whitespace.
From: /tf/active/vicechatdev/CDocs/utils/notifications.py
class HashCleaner 50.3% similar

A document deduplication cleaner that removes documents with identical content by comparing hash values of document text.
From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/hash_cleaner.py
function identify_duplicates 48.2% similar

Identifies duplicate documents by computing hash values of their text content and grouping documents with identical hashes.
From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py
class HashGenerator 47.9% similar

A class that provides cryptographic hashing functionality for PDF documents, including hash generation, embedding, and verification for document integrity checking.
From: /tf/active/vicechatdev/document_auditor/src/security/hash_generator.py

🔍 Code Extractor

function hash_text

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function clean_text 52.4% similar

function html_to_text 50.5% similar

class HashCleaner 50.3% similar

function identify_duplicates 48.2% similar

class HashGenerator 47.9% similar

function hash_text

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function clean_text 52.4% similar

function html_to_text 50.5% similar

class HashCleaner 50.3% similar

function identify_duplicates 48.2% similar

class HashGenerator 47.9% similar

✨ Improve Code: hash_text

Code Comparison