🔍 Code Extractor

Search Components

Full-Text: Fast keyword matching | Semantic: AI-powered understanding of intent (finds similar concepts)

Search Results for "deduplication"

Found 32 matching component(s)

  • function test_reference_system_completeness

    A diagnostic test function that prints a comprehensive overview of a reference system's architecture, including backend storage, API endpoints, reference types, and content flow verification.

    File: /tf/active/vicechatdev/reference_system_verification.py

    testing documentation diagnostic reference-system api-endpoints
  • class OneCo_hybrid_RAG_v2

    A class named OneCo_hybrid_RAG

    File: /tf/active/vicechatdev/OneCo_hybrid_RAG.py

    class oneco_hybrid_rag
  • class DocChatRAG

    Main RAG engine with three operating modes: 1. Basic RAG (similarity search) 2. Extensive (full document retrieval with preprocessing) 3. Full Reading (process all documents)

    File: /tf/active/vicechatdev/docchat/rag_engine.py

    class docchatrag
  • function main_v59

    Command-line interface function that orchestrates the cleaning of ChromaDB collections by removing duplicates and similar documents, with options to skip collections and customize the cleaning process.

    File: /tf/active/vicechatdev/chromadb-cleanup/main.py

    cli command-line chromadb database-cleaning deduplication
  • function clean_collection

    Cleans a ChromaDB collection by removing duplicate and similar documents using hash-based and similarity-based deduplication techniques, then saves the cleaned data to a new collection.

    File: /tf/active/vicechatdev/chromadb-cleanup/main.py

    data-cleaning deduplication chromadb vector-database similarity-detection
  • function load_data_from_chromadb

    Connects to a ChromaDB instance and retrieves all documents from a specified collection, returning them as a list of dictionaries with document IDs, text content, embeddings, and metadata.

    File: /tf/active/vicechatdev/chromadb-cleanup/main.py

    chromadb vector-database data-loading document-retrieval embeddings
  • function main_v50

    Command-line interface function that orchestrates a ChromaDB collection cleaning pipeline by removing duplicate and similar documents through hashing and similarity screening.

    File: /tf/active/vicechatdev/chromadb-cleanup/main copy.py

    cli command-line data-cleaning deduplication chromadb
  • function setup_similarity_cleaner

    A pytest fixture that creates and returns a configured SimilarityCleaner instance with a threshold of 0.8 for use in test cases.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py

    pytest fixture testing similarity data-cleaning
  • function test_identical_text_removal

    A pytest test function that verifies the SimilarityCleaner's ability to remove identical duplicate text entries from a list while preserving unique documents.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py

    testing pytest unit-test deduplication text-processing
  • function test_nearly_similar_text_handling

    A pytest test function that verifies the SimilarityCleaner's ability to identify and remove nearly similar text entries while preserving distinct ones.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py

    testing pytest text-processing similarity-detection deduplication
  • function test_similarity_threshold_effect

    A pytest test function that validates the behavior of SimilarityCleaner with different similarity threshold values, ensuring that higher thresholds retain more texts while lower thresholds are more aggressive in removing similar content.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py

    testing pytest text-deduplication similarity-detection data-cleaning
  • class TestCombinedCleaner

    A unittest test class that validates the functionality of the CombinedCleaner class, testing its ability to remove duplicate and similar texts from collections.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_combined_cleaner.py

    unittest testing text-cleaning deduplication similarity-detection
  • function test_remove_identical_chunks

    A pytest test function that verifies the HashCleaner's ability to remove duplicate text chunks from a list while preserving order and unique entries.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py

    testing pytest unit-test deduplication text-processing
  • function test_no_identical_chunks

    A unit test function that verifies the HashCleaner's behavior when processing a list of unique text chunks, ensuring no chunks are removed when all are distinct.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py

    unit-test pytest hash-cleaner deduplication text-processing
  • function test_identical_chunks_with_different_cases

    A unit test function that verifies the HashCleaner's ability to remove duplicate text chunks while being case-sensitive, ensuring that strings differing only in case are treated as distinct entries.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py

    unit-test pytest deduplication case-sensitive text-processing
  • function find_similar_documents

    Identifies pairs of similar documents by comparing their embeddings and returns those exceeding a specified similarity threshold, sorted by similarity score.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py

    document-similarity embedding-comparison duplicate-detection cosine-similarity nlp
  • function hash_text

    Creates a SHA-256 hash of normalized text content to generate a unique identifier for documents, enabling duplicate detection and content comparison.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py

    hashing text-processing deduplication content-fingerprinting sha256
  • function identify_duplicates

    Identifies duplicate documents by computing hash values of their text content and grouping documents with identical hashes.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py

    deduplication document-processing hashing data-cleaning duplicate-detection
  • function get_unique_documents

    Identifies and separates unique documents from duplicates in a list by comparing hash values of document text content.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py

    deduplication document-processing data-cleaning hashing text-processing
  • class HashCleaner

    A document deduplication cleaner that removes documents with identical content by comparing hash values of document text.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/hash_cleaner.py

    deduplication data-cleaning hash-based document-processing duplicate-removal
  • class CombinedCleaner

    A document cleaner that combines hash-based and similarity-based cleaning approaches to remove both exact and near-duplicate documents in a two-stage process.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/combined_cleaner.py

    document-cleaning deduplication data-processing hash-based similarity-based
  • class SimilarityCleaner

    A document cleaning class that identifies and removes duplicate or highly similar documents based on embedding vector similarity, keeping only representative documents from each similarity group.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/similarity_cleaner.py

    document-processing deduplication similarity embeddings clustering
  • class BaseCleaner

    Abstract base class that defines the interface for document cleaning implementations, providing methods to remove redundancy from document collections and track cleaning statistics.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/base_cleaner.py

    abstract-base-class document-processing data-cleaning redundancy-removal statistics
  • class OneCo_hybrid_RAG_v3

    A class named OneCo_hybrid_RAG

    File: /tf/active/vicechatdev/vice_ai/hybrid_rag_engine.py

    class oneco_hybrid_rag
  • function check_document_hash_exists

    Checks if a document with a given SHA-256 hash already exists in the database by querying the graph database for matching DocumentVersion nodes.

    File: /tf/active/vicechatdev/CDocs/utils/document_processor.py

    database graph-database neo4j document-management deduplication
  • function load_vendor_list

    Loads unique vendor names from the first column of an Excel file, removing any null values and returning them as a list.

    File: /tf/active/vicechatdev/find_email/extract_vendor_batch.py

    data-loading excel pandas vendor-management file-processing
  • function deephash

    Computes a hash value for any Python object by serializing it to JSON using a custom HashableJSON encoder and returning the hash of the resulting string.

    File: /tf/active/vicechatdev/patches/util.py

    hashing serialization json object-comparison caching
  • function unique_iterator

    A generator function that yields unique elements from an input sequence in order of first appearance, filtering out duplicates.

    File: /tf/active/vicechatdev/patches/util.py

    iterator generator deduplication unique sequence-processing
  • function unique_zip

    Returns a unique list of tuples created by zipping multiple iterables together, removing any duplicate tuples while preserving order.

    File: /tf/active/vicechatdev/patches/util.py

    data-processing itertools zip unique deduplication
  • function unique_array

    Returns an array of unique values from the input array while preserving the original order of first occurrence.

    File: /tf/active/vicechatdev/patches/util.py

    array-processing deduplication unique-values data-cleaning order-preserving
  • function merge_dimensions

    Merges multiple lists of Dimension objects by combining their values while preserving unique dimensions and maintaining order of first appearance.

    File: /tf/active/vicechatdev/patches/util.py

    dimension-merging data-consolidation holoviews metadata-processing deduplication
  • function get_unique_keys

    Extracts unique key values from an ndmapping object for specified dimensions, returning an iterator of unique tuples.

    File: /tf/active/vicechatdev/patches/util.py

    data-extraction multi-dimensional key-extraction deduplication iterator

Search Examples