🔍 Code Extractor

Browse Components

Showing 20 of 1710 components

  • class BaseCleaner

    Abstract base class that defines the interface for document cleaning implementations, providing methods to remove redundancy from document collections and track cleaning statistics.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/base_cleaner.py | Lines: 6-46

    abstract-base-class document-processing data-cleaning redundancy-removal statistics
  • class SimilarityCleaner

    A document cleaning class that identifies and removes duplicate or highly similar documents based on embedding vector similarity, keeping only representative documents from each similarity group.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/similarity_cleaner.py | Lines: 8-79

    document-processing deduplication similarity embeddings clustering
  • class CombinedCleaner

    A document cleaner that combines hash-based and similarity-based cleaning approaches to remove both exact and near-duplicate documents in a two-stage process.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/combined_cleaner.py | Lines: 8-45

    document-cleaning deduplication data-processing hash-based similarity-based
  • class HashCleaner

    A document deduplication cleaner that removes documents with identical content by comparing hash values of document text.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/hash_cleaner.py | Lines: 7-36

    deduplication data-cleaning hash-based document-processing duplicate-removal
  • class SummarizationConfig

    A configuration wrapper class that manages settings for a text summarization model by encapsulating a SummarizationModel instance.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/summarization/models.py | Lines: 8-17

    configuration summarization model-config wrapper nlp
  • class SummarizationModel

    A Pydantic data model class that defines the configuration schema for a text summarization model, including model name, token limits, and temperature settings.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/summarization/models.py | Lines: 3-6

    pydantic data-model configuration validation summarization
  • function create_summary

    Creates a text summary using OpenAI's GPT models or returns a truncated version as fallback when API key is unavailable.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/summarization/summarizer.py | Lines: 38-79

    summarization text-processing openai gpt nlp
  • function summarize_text

    A deprecated standalone function that was originally designed to summarize groups of similar documents but now only returns the input documents unchanged with a deprecation warning.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/summarization/summarizer.py | Lines: 22-35

    deprecated text-summarization document-processing nlp text-clustering
  • function init_openai_client

    Initializes the OpenAI client by setting the API key from either a provided parameter or environment variable.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/summarization/summarizer.py | Lines: 7-19

    initialization authentication openai api-key configuration
  • function get_unique_documents

    Identifies and separates unique documents from duplicates in a list by comparing hash values of document text content.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py | Lines: 44-67

    deduplication document-processing data-cleaning hashing text-processing
  • function identify_duplicates

    Identifies duplicate documents by computing hash values of their text content and grouping documents with identical hashes.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py | Lines: 22-41

    deduplication document-processing hashing data-cleaning duplicate-detection
  • function hash_text

    Creates a SHA-256 hash of normalized text content to generate a unique identifier for documents, enabling duplicate detection and content comparison.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py | Lines: 5-19

    hashing text-processing deduplication content-fingerprinting sha256
  • function find_similar_documents

    Identifies pairs of similar documents by comparing their embeddings and returns those exceeding a specified similarity threshold, sorted by similarity score.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py | Lines: 38-71

    document-similarity embedding-comparison duplicate-detection cosine-similarity nlp
  • function build_similarity_matrix

    Computes a pairwise cosine similarity matrix for a collection of embedding vectors, where each cell (i,j) represents the similarity between embedding i and embedding j.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py | Lines: 24-35

    embeddings similarity cosine-similarity matrix nlp
  • function calculate_similarity

    Computes the cosine similarity between two embedding vectors, returning a normalized score between 0 and 1 that measures their directional alignment.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py | Lines: 6-21

    cosine-similarity vector-comparison embeddings similarity-metric machine-learning
  • class TextClusterer

    A class that clusters similar documents based on their embeddings using various clustering algorithms (K-means, Agglomerative, DBSCAN) and optionally generates summaries for each cluster.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/clustering/text_clusterer.py | Lines: 8-171

    clustering document-clustering embeddings machine-learning kmeans
  • function test_identical_chunks_with_different_cases

    A unit test function that verifies the HashCleaner's ability to remove duplicate text chunks while being case-sensitive, ensuring that strings differing only in case are treated as distinct entries.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py | Lines: 38-49

    unit-test pytest deduplication case-sensitive text-processing
  • function test_no_identical_chunks

    A unit test function that verifies the HashCleaner's behavior when processing a list of unique text chunks, ensuring no chunks are removed when all are distinct.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py | Lines: 28-36

    unit-test pytest hash-cleaner deduplication text-processing
  • function test_empty_input_v1

    A pytest test function that verifies the HashCleaner's behavior when processing an empty list of text chunks.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py | Lines: 22-26

    testing unit-test pytest edge-case boundary-condition
  • function test_remove_identical_chunks

    A pytest test function that verifies the HashCleaner's ability to remove duplicate text chunks from a list while preserving order and unique entries.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py | Lines: 8-20

    testing pytest unit-test deduplication text-processing