function calculate_file_hash
Calculates the MD5 hash of a file by reading it in chunks to handle large files efficiently.
/tf/active/vicechatdev/mailsearch/compare_documents.py
43 - 62
simple
Purpose
This function computes the MD5 checksum of a file for integrity verification, duplicate detection, or file comparison. It reads files in configurable chunks to avoid loading entire large files into memory, making it suitable for files of any size. Returns an empty string on error with error message printed to console.
Source Code
def calculate_file_hash(filepath: str, chunk_size: int = 8192) -> str:
"""
Calculate MD5 hash of a file
Args:
filepath: Path to the file
chunk_size: Size of chunks to read
Returns:
MD5 hash as hex string
"""
md5 = hashlib.md5()
try:
with open(filepath, 'rb') as f:
while chunk := f.read(chunk_size):
md5.update(chunk)
return md5.hexdigest()
except Exception as e:
print(f"Error hashing {filepath}: {e}")
return ""
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
filepath |
str | - | positional_or_keyword |
chunk_size |
int | 8192 | positional_or_keyword |
Parameter Details
filepath: String representing the path to the file to be hashed. Can be absolute or relative path. The file must exist and be readable, otherwise an exception will be caught and empty string returned.
chunk_size: Integer specifying the number of bytes to read at a time from the file. Default is 8192 bytes (8KB). Larger values may improve performance for large files but use more memory. Must be positive integer.
Return Value
Type: str
Returns a string containing the hexadecimal representation of the MD5 hash (32 characters). Returns an empty string ('') if any error occurs during file reading or hashing (e.g., file not found, permission denied, I/O error).
Dependencies
hashlib
Required Imports
import hashlib
Usage Example
import hashlib
# Calculate hash of a file
file_hash = calculate_file_hash('/path/to/myfile.txt')
print(f"MD5 Hash: {file_hash}")
# Use custom chunk size for large files
large_file_hash = calculate_file_hash('/path/to/largefile.bin', chunk_size=65536)
print(f"Large file hash: {large_file_hash}")
# Handle potential errors (empty string returned)
hash_result = calculate_file_hash('/nonexistent/file.txt')
if hash_result:
print(f"Success: {hash_result}")
else:
print("Failed to calculate hash")
Best Practices
- MD5 is not cryptographically secure for security purposes; use SHA-256 or stronger algorithms for security-critical applications
- The function silently returns empty string on errors; consider checking the return value or modifying to raise exceptions for production use
- Default chunk_size of 8192 bytes is reasonable for most use cases; increase for very large files to improve performance
- Ensure the file exists and is readable before calling to avoid error messages being printed
- For comparing files, use the same chunk_size to ensure consistent performance
- Consider using context managers (already implemented with 'with' statement) to ensure proper file closure
- The walrus operator (:=) requires Python 3.8+; ensure compatibility with your Python version
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function calculate_file_hash_v1 76.9% similar
-
function get_file_info 62.3% similar
-
function calculate_crc32c 41.5% similar
-
function check_file_exists 40.5% similar
-
function process_document 39.0% similar