🔍 Code Extractor

class TestCombinedCleaner

Maturity: 36

A unittest test class that validates the functionality of the CombinedCleaner class, testing its ability to remove duplicate and similar texts from collections.

File:
/tf/active/vicechatdev/chromadb-cleanup/tests/test_combined_cleaner.py
Lines:
6 - 47
Complexity:
simple

Purpose

This test class provides comprehensive unit tests for the CombinedCleaner component. It verifies three key aspects: (1) removal of identical duplicate texts, (2) similarity-based screening to filter out near-duplicate texts, and (3) combined functionality handling both exact duplicates and similar texts. The tests ensure that the CombinedCleaner correctly deduplicates text collections while preserving unique and sufficiently different texts.

Source Code

class TestCombinedCleaner(unittest.TestCase):

    def setUp(self):
        self.cleaner = CombinedCleaner()

    def test_identical_text_removal(self):
        texts = [
            "This is a test.",
            "This is a test.",
            "This is another test."
        ]
        cleaned_texts = self.cleaner.clean(texts)
        self.assertEqual(len(cleaned_texts), 2)
        self.assertIn("This is a test.", cleaned_texts)
        self.assertIn("This is another test.", cleaned_texts)

    def test_similarity_screening(self):
        texts = [
            "This is a test.",
            "This is a test.",
            "This is a similar test.",
            "Completely different text."
        ]
        cleaned_texts = self.cleaner.clean(texts)
        self.assertEqual(len(cleaned_texts), 3)
        self.assertIn("This is a test.", cleaned_texts)
        self.assertIn("This is a similar test.", cleaned_texts)
        self.assertIn("Completely different text.", cleaned_texts)

    def test_combined_functionality(self):
        texts = [
            "This is a test.",
            "This is a test.",
            "This is a similar test.",
            "This is a test.",
            "Another unique text."
        ]
        cleaned_texts = self.cleaner.clean(texts)
        self.assertEqual(len(cleaned_texts), 3)
        self.assertIn("This is a test.", cleaned_texts)
        self.assertIn("This is a similar test.", cleaned_texts)
        self.assertIn("Another unique text.", cleaned_texts)

Parameters

Name Type Default Kind
bases unittest.TestCase -

Parameter Details

bases: Inherits from unittest.TestCase, which provides the testing framework infrastructure including assertion methods and test execution capabilities

Return Value

As a test class, it does not return values directly. When instantiated and run by a test runner, it produces test results (pass/fail) for each test method. Individual test methods use assertions to validate expected behavior and raise AssertionError on failure.

Class Interface

Methods

setUp(self) -> None

Purpose: Initializes test fixtures before each test method runs, creating a fresh CombinedCleaner instance

Returns: None - sets up instance attributes for use in test methods

test_identical_text_removal(self) -> None

Purpose: Tests that the CombinedCleaner correctly removes exact duplicate texts from a list, keeping only unique entries

Returns: None - raises AssertionError if test fails

test_similarity_screening(self) -> None

Purpose: Tests that the CombinedCleaner handles both exact duplicates and similar texts, removing duplicates while preserving sufficiently different texts

Returns: None - raises AssertionError if test fails

test_combined_functionality(self) -> None

Purpose: Tests the complete functionality of CombinedCleaner with a complex scenario involving multiple identical duplicates, similar texts, and unique texts

Returns: None - raises AssertionError if test fails

Attributes

Name Type Description Scope
cleaner CombinedCleaner Instance of CombinedCleaner being tested, initialized fresh before each test method in setUp instance

Dependencies

  • unittest
  • src.cleaners.combined_cleaner
  • src.utils.hash_utils
  • src.utils.similarity_utils

Required Imports

import unittest
from src.cleaners.combined_cleaner import CombinedCleaner
from src.utils.hash_utils import hash_text
from src.utils.similarity_utils import calculate_similarity

Usage Example

import unittest
from src.cleaners.combined_cleaner import CombinedCleaner

# Run a single test
test = TestCombinedCleaner()
test.setUp()
test.test_identical_text_removal()

# Run all tests using unittest runner
if __name__ == '__main__':
    unittest.main()

# Or run specific test
suite = unittest.TestLoader().loadTestsFromTestCase(TestCombinedCleaner)
unittest.TextTestRunner().run(suite)

Best Practices

  • The setUp method is called before each test method, ensuring a fresh CombinedCleaner instance for each test to avoid state pollution
  • Tests are independent and can be run in any order without affecting each other
  • Each test method focuses on a specific aspect of functionality (single responsibility)
  • Test method names clearly describe what is being tested
  • Use unittest.main() to run all tests or unittest.TestLoader() for selective test execution
  • Assertions verify both the count of results and the presence of expected items
  • Tests cover edge cases like multiple identical duplicates and combinations of duplicates with similar texts

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function test_identical_text_removal 72.3% similar

    A pytest test function that verifies the SimilarityCleaner's ability to remove identical duplicate text entries from a list while preserving unique documents.

    From: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py
  • function test_nearly_similar_text_handling 72.0% similar

    A pytest test function that verifies the SimilarityCleaner's ability to identify and remove nearly similar text entries while preserving distinct ones.

    From: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py
  • class CombinedCleaner 70.3% similar

    A document cleaner that combines hash-based and similarity-based cleaning approaches to remove both exact and near-duplicate documents in a two-stage process.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/combined_cleaner.py
  • function test_identical_chunks_with_different_cases 68.3% similar

    A unit test function that verifies the HashCleaner's ability to remove duplicate text chunks while being case-sensitive, ensuring that strings differing only in case are treated as distinct entries.

    From: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py
  • function test_similarity_threshold_effect 68.0% similar

    A pytest test function that validates the behavior of SimilarityCleaner with different similarity threshold values, ensuring that higher thresholds retain more texts while lower thresholds are more aggressive in removing similar content.

    From: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py
← Back to Browse