function hash_text
Creates a SHA-256 hash of normalized text content to generate a unique identifier for documents, enabling duplicate detection and content comparison.
/tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py
5 - 19
simple
Purpose
This function generates a deterministic hash value for text content by first normalizing the input (converting to lowercase and collapsing whitespace) and then computing a SHA-256 hash. It's primarily used for identifying duplicate documents, creating content fingerprints, or implementing content-based deduplication systems. The normalization step ensures that minor formatting differences don't result in different hashes for semantically identical content.
Source Code
def hash_text(text: str) -> str:
"""
Create a hash of the text content to identify identical documents.
Args:
text: The text to hash
Returns:
A hexadecimal hash string
"""
# Normalize text before hashing (lowercase, strip whitespace)
normalized_text = " ".join(text.lower().split())
# Create SHA-256 hash
return hashlib.sha256(normalized_text.encode('utf-8')).hexdigest()
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
text |
str | - | positional_or_keyword |
Parameter Details
text: A string containing the text content to be hashed. Can be of any length and contain any Unicode characters. The text will be normalized (lowercased and whitespace-collapsed) before hashing to ensure consistent results for similar content with minor formatting differences.
Return Value
Type: str
Returns a string containing the hexadecimal representation of the SHA-256 hash of the normalized input text. The returned string is always 64 characters long (256 bits represented as hex). This hash is deterministic - the same normalized input will always produce the same hash value.
Dependencies
hashlib
Required Imports
import hashlib
Usage Example
import hashlib
def hash_text(text: str) -> str:
normalized_text = " ".join(text.lower().split())
return hashlib.sha256(normalized_text.encode('utf-8')).hexdigest()
# Example usage
text1 = "Hello World"
text2 = "hello world" # Different spacing and case
text3 = "Hello Universe"
hash1 = hash_text(text1)
hash2 = hash_text(text2)
hash3 = hash_text(text3)
print(f"Hash 1: {hash1}")
print(f"Hash 2: {hash2}")
print(f"Hash 3: {hash3}")
print(f"Text 1 and 2 are identical: {hash1 == hash2}") # True
print(f"Text 1 and 3 are identical: {hash1 == hash3}") # False
Best Practices
- The function normalizes text by converting to lowercase and collapsing whitespace, so 'Hello World' and 'hello world' will produce the same hash
- Use this function when you need to identify duplicate content regardless of minor formatting differences
- The normalization step means that case-sensitive or whitespace-sensitive comparisons should use a different approach
- SHA-256 is cryptographically secure but if you only need fast deduplication (not security), consider using a faster hash like MD5 or a non-cryptographic hash
- The function assumes UTF-8 encoding; ensure your text is properly decoded before passing it to this function
- For very large texts, consider streaming the hash computation to reduce memory usage
- Store the returned hash in a database or cache to enable efficient duplicate lookups
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function clean_text 52.4% similar
-
function html_to_text 50.5% similar
-
class HashCleaner 50.3% similar
-
function identify_duplicates 48.2% similar
-
class HashGenerator 47.9% similar