function test_similarity_threshold_effect
A pytest test function that validates the behavior of SimilarityCleaner with different similarity threshold values, ensuring that higher thresholds retain more texts while lower thresholds are more aggressive in removing similar content.
/tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py
41 - 59
simple
Purpose
This test function verifies that the SimilarityCleaner class correctly removes duplicate or similar texts based on configurable similarity thresholds. It tests two scenarios: a high threshold (0.9) that should only remove nearly identical texts, and a low threshold (0.5) that should be more aggressive in removing similar content. The test ensures that the cleaner's behavior is predictable and threshold-dependent, which is critical for text deduplication and data cleaning pipelines.
Source Code
def test_similarity_threshold_effect(setup_similarity_cleaner):
texts = [
"Text that is very similar.",
"Text that is very similar.",
"Text that is not similar."
]
cleaner_high_threshold = SimilarityCleaner(threshold=0.9)
cleaner_low_threshold = SimilarityCleaner(threshold=0.5)
cleaned_high = cleaner_high_threshold.clean(texts)
cleaned_low = cleaner_low_threshold.clean(texts)
assert len(cleaned_high) == 2 # Should remove one similar text
assert len(cleaned_low) == 1 # Should remove both similar texts
assert "Text that is very similar." in cleaned_high
assert "Text that is not similar." in cleaned_high
assert "Text that is very similar." not in cleaned_low
assert "Text that is not similar." in cleaned_low
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
setup_similarity_cleaner |
- | - | positional_or_keyword |
Parameter Details
setup_similarity_cleaner: A pytest fixture that performs setup operations for the SimilarityCleaner test suite. This fixture likely initializes test environment, mocks, or shared resources needed for similarity cleaning tests. The fixture is automatically invoked by pytest before the test runs.
Return Value
This function does not return any value (implicitly returns None). As a pytest test function, it uses assertions to validate expected behavior. The test passes if all assertions succeed, and fails if any assertion raises an AssertionError.
Dependencies
pytestsrc.cleaners.similarity_cleaner
Required Imports
import pytest
from src.cleaners.similarity_cleaner import SimilarityCleaner
Usage Example
# This is a test function meant to be run by pytest
# Run from command line:
# pytest path/to/test_file.py::test_similarity_threshold_effect
# Or run all tests in the file:
# pytest path/to/test_file.py
# Example of how SimilarityCleaner is used within the test:
from src.cleaners.similarity_cleaner import SimilarityCleaner
texts = [
"Text that is very similar.",
"Text that is very similar.",
"Text that is not similar."
]
# High threshold - less aggressive deduplication
cleaner_high = SimilarityCleaner(threshold=0.9)
cleaned_high = cleaner_high.clean(texts)
print(f"High threshold result: {len(cleaned_high)} texts remaining")
# Low threshold - more aggressive deduplication
cleaner_low = SimilarityCleaner(threshold=0.5)
cleaned_low = cleaner_low.clean(texts)
print(f"Low threshold result: {len(cleaned_low)} texts remaining")
Best Practices
- This test should be run as part of a pytest test suite, not as a standalone function
- The setup_similarity_cleaner fixture should be properly defined before running this test
- When implementing SimilarityCleaner, ensure threshold values between 0 and 1 are supported, where higher values mean stricter similarity matching
- The test demonstrates that threshold=0.9 should keep 2 texts (removing only exact duplicates) while threshold=0.5 should keep only 1 text (removing all similar texts)
- Consider running this test alongside other SimilarityCleaner tests to ensure comprehensive coverage of edge cases
- The test assumes deterministic behavior from SimilarityCleaner - if the implementation uses randomness, the test may need adjustment
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function test_nearly_similar_text_handling 83.9% similar
-
function test_identical_text_removal 81.2% similar
-
function setup_similarity_cleaner 80.1% similar
-
function test_single_text_input 76.0% similar
-
function test_empty_input 69.8% similar