🔍 Code Extractor

Search Components

Full-Text: Fast keyword matching | Semantic: AI-powered understanding of intent (finds similar concepts)

Search Results for "cleaning"

Found 50 matching component(s)

  • function clean_text

    Cleans and normalizes text content by removing HTML tags, normalizing whitespace, and stripping markdown formatting elements.

    File: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py

    text-processing text-cleaning normalization html-removal markdown-removal
  • function clean_text_for_xml

    Sanitizes text by removing or replacing XML-incompatible characters to ensure compatibility with Word document XML structure.

    File: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py

    text-processing xml sanitization word-documents character-encoding
  • function extract_warranty_data_improved

    Parses markdown-formatted warranty documentation to extract structured warranty data including IDs, titles, sections, disclosure text, and reference citations.

    File: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py

    markdown-parsing text-extraction warranty-processing document-parsing regex
  • function clean_text_for_xml_v1

    Sanitizes text strings to ensure XML 1.0 compatibility by removing or replacing invalid control characters and ensuring all characters meet XML specification requirements for Word document generation.

    File: /tf/active/vicechatdev/enhanced_word_converter_fixed.py

    text-processing xml sanitization data-cleaning word-documents
  • function quick_clean

    Cleans flock data by identifying and removing flocks that have treatment records with timing inconsistencies (treatments administered outside the flock's start/end date range).

    File: /tf/active/vicechatdev/quick_cleaner.py

    data-cleaning data-quality flock-management livestock poultry
  • class ImprovedProjectVictoriaGenerator

    Improved Project Victoria Disclosure Generator with proper reference management.

    File: /tf/active/vicechatdev/improved_project_victoria_generator.py

    class improvedprojectvictoriagenerator
  • function create_data_quality_dashboard

    Creates an interactive command-line dashboard for analyzing data quality issues in treatment timing data, specifically focusing on treatments administered outside of flock lifecycle dates.

    File: /tf/active/vicechatdev/data_quality_dashboard.py

    data-quality dashboard interactive menu-driven timing-analysis
  • function show_critical_errors

    Displays critical data quality errors in treatment records, focusing on date anomalies including 1900 dates, extreme future dates, and extreme past dates relative to flock lifecycles.

    File: /tf/active/vicechatdev/data_quality_dashboard.py

    data-quality validation error-reporting date-validation data-cleaning
  • function show_problematic_flocks

    Analyzes and displays problematic flocks from a dataset by identifying those with systematic timing issues in their treatment records, categorizing them by severity and volume.

    File: /tf/active/vicechatdev/data_quality_dashboard.py

    data-quality reporting diagnostics livestock-management data-validation
  • function compare_datasets

    Analyzes and compares two pandas DataFrames containing flock data (original vs cleaned), printing detailed statistics about removed records, type distributions, and impact assessment.

    File: /tf/active/vicechatdev/data_quality_dashboard.py

    data-quality comparison analysis reporting statistics
  • function parse_email_address

    Parses email address strings by handling multiple addresses separated by semicolons and converting them to comma-separated format.

    File: /tf/active/vicechatdev/msg_to_eml.py

    email parsing string-manipulation formatting address-normalization
  • class FileCloudEmailProcessor

    A class that processes email files (.msg format) stored in FileCloud by finding, downloading, converting them to EML and PDF formats, and organizing them into mail_archive folders.

    File: /tf/active/vicechatdev/msg_to_eml.py

    email-processing file-conversion cloud-storage filecloud msg-to-eml
  • function test_collection_creation

    A diagnostic test function that verifies Chroma DB functionality by creating a test collection, adding a document, querying it, and cleaning up.

    File: /tf/active/vicechatdev/test_chroma_collections.py

    testing debugging chroma-db vector-database health-check
  • class ProjectVictoriaDisclosureGenerator

    Main class for generating Project Victoria disclosures from warranty claims.

    File: /tf/active/vicechatdev/project_victoria_disclosure_generator.py

    class projectvictoriadisclosuregenerator
  • function test_incremental_indexing

    Comprehensive test function that validates incremental indexing functionality of a document indexing system, including initial indexing, change detection, re-indexing, and force re-indexing scenarios.

    File: /tf/active/vicechatdev/docchat/test_incremental_indexing.py

    testing incremental-indexing document-indexing integration-test file-system
  • function main_v61

    Command-line interface function that orchestrates the cleaning of ChromaDB collections by removing duplicates and similar documents, with options to skip collections and customize the cleaning process.

    File: /tf/active/vicechatdev/chromadb-cleanup/main.py

    cli command-line chromadb database-cleaning deduplication
  • function clean_collection

    Cleans a ChromaDB collection by removing duplicate and similar documents using hash-based and similarity-based deduplication techniques, then saves the cleaned data to a new collection.

    File: /tf/active/vicechatdev/chromadb-cleanup/main.py

    data-cleaning deduplication chromadb vector-database similarity-detection
  • function load_data_from_chromadb

    Connects to a ChromaDB instance and retrieves all documents from a specified collection, returning them as a list of dictionaries with document IDs, text content, embeddings, and metadata.

    File: /tf/active/vicechatdev/chromadb-cleanup/main.py

    chromadb vector-database data-loading document-retrieval embeddings
  • function main_v52

    Command-line interface function that orchestrates a ChromaDB collection cleaning pipeline by removing duplicate and similar documents through hashing and similarity screening.

    File: /tf/active/vicechatdev/chromadb-cleanup/main copy.py

    cli command-line data-cleaning deduplication chromadb
  • function load_data_from_chromadb_v1

    Retrieves all documents from a specified ChromaDB collection, including their IDs, text content, embeddings, and metadata.

    File: /tf/active/vicechatdev/chromadb-cleanup/main copy.py

    chromadb database document-retrieval vector-database embeddings
  • function setup_similarity_cleaner

    A pytest fixture that creates and returns a configured SimilarityCleaner instance with a threshold of 0.8 for use in test cases.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py

    pytest fixture testing similarity data-cleaning
  • function test_identical_text_removal

    A pytest test function that verifies the SimilarityCleaner's ability to remove identical duplicate text entries from a list while preserving unique documents.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py

    testing pytest unit-test deduplication text-processing
  • function test_nearly_similar_text_handling

    A pytest test function that verifies the SimilarityCleaner's ability to identify and remove nearly similar text entries while preserving distinct ones.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py

    testing pytest text-processing similarity-detection deduplication
  • function test_single_text_input

    A pytest test function that verifies the SimilarityCleaner correctly handles a single text document by returning it unchanged.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py

    testing unit-test pytest text-processing similarity
  • function test_similarity_threshold_effect

    A pytest test function that validates the behavior of SimilarityCleaner with different similarity threshold values, ensuring that higher thresholds retain more texts while lower thresholds are more aggressive in removing similar content.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_similarity_cleaner.py

    testing pytest text-deduplication similarity-detection data-cleaning
  • class TestCombinedCleaner

    A unittest test class that validates the functionality of the CombinedCleaner class, testing its ability to remove duplicate and similar texts from collections.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_combined_cleaner.py

    unittest testing text-cleaning deduplication similarity-detection
  • function test_remove_identical_chunks

    A pytest test function that verifies the HashCleaner's ability to remove duplicate text chunks from a list while preserving order and unique entries.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py

    testing pytest unit-test deduplication text-processing
  • function test_empty_input_v1

    A pytest test function that verifies the HashCleaner's behavior when processing an empty list of text chunks.

    File: /tf/active/vicechatdev/chromadb-cleanup/tests/test_hash_cleaner.py

    testing unit-test pytest edge-case boundary-condition
  • class Config_v6

    A dataclass that stores configuration settings for a ChromaDB cleanup process, including connection parameters, cleaning/clustering options, and summarization settings.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/config.py

    configuration dataclass chromadb settings cleanup
  • function identify_duplicates

    Identifies duplicate documents by computing hash values of their text content and grouping documents with identical hashes.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py

    deduplication document-processing hashing data-cleaning duplicate-detection
  • function get_unique_documents

    Identifies and separates unique documents from duplicates in a list by comparing hash values of document text content.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/utils/hash_utils.py

    deduplication document-processing data-cleaning hashing text-processing
  • class HashCleaner

    A document deduplication cleaner that removes documents with identical content by comparing hash values of document text.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/hash_cleaner.py

    deduplication data-cleaning hash-based document-processing duplicate-removal
  • class CombinedCleaner

    A document cleaner that combines hash-based and similarity-based cleaning approaches to remove both exact and near-duplicate documents in a two-stage process.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/combined_cleaner.py

    document-cleaning deduplication data-processing hash-based similarity-based
  • class SimilarityCleaner

    A document cleaning class that identifies and removes duplicate or highly similar documents based on embedding vector similarity, keeping only representative documents from each similarity group.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/similarity_cleaner.py

    document-processing deduplication similarity embeddings clustering
  • class BaseCleaner

    Abstract base class that defines the interface for document cleaning implementations, providing methods to remove redundancy from document collections and track cleaning statistics.

    File: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/base_cleaner.py

    abstract-base-class document-processing data-cleaning redundancy-removal statistics
  • function raw_cleanup_database

    Performs raw database cleanup on a SQLite database to identify and fix corrupted chat_session_id values in the text_sections table by converting invalid string representations ('{}', '[]', 'null', '') to NULL.

    File: /tf/active/vicechatdev/vice_ai/raw_database_cleanup.py

    database cleanup maintenance sqlite data-integrity
  • class AgentExecutor

    Agent-based script executor that generates standalone Python files, manages dependencies, and provides iterative debugging capabilities

    File: /tf/active/vicechatdev/vice_ai/agent_executor.py

    class agentexecutor
  • function check_and_fix_corruption

    Scans a SQLite database for corrupted chat_session_id values in the text_sections table and automatically fixes them by setting invalid entries to NULL.

    File: /tf/active/vicechatdev/vice_ai/direct_corruption_checker.py

    database sqlite data-integrity corruption-detection data-cleaning
  • function clean_html_tags

    Removes HTML tags and entities from text strings, returning clean plain text suitable for PDF display or other formatted output.

    File: /tf/active/vicechatdev/vice_ai/complex_app.py

    html text-processing sanitization string-manipulation pdf-generation
  • class ScriptExecutor

    A sandboxed Python script executor that safely runs user-provided Python code with timeout controls, security restrictions, and isolated execution environments for data analysis tasks.

    File: /tf/active/vicechatdev/vice_ai/script_executor.py

    sandbox script-execution security code-validation data-analysis
  • function cleanup_old_documents

    Periodically removes documents and their associated files that are older than 2 hours from the uploaded_documents dictionary, cleaning up both file system storage and memory.

    File: /tf/active/vicechatdev/vice_ai/app.py

    cleanup maintenance file-management document-management scheduled-task
  • function convert_european_decimals

    Detects and converts numeric data with European decimal format (comma as decimal separator) to standard format (dot as decimal separator) in a pandas DataFrame, handling mixed formats and missing data patterns.

    File: /tf/active/vicechatdev/vice_ai/smartstat_service.py

    data-processing data-cleaning decimal-conversion european-format locale-handling
  • function extract_table_as_markdown

    Extracts a specified row range from a pandas DataFrame and converts it into a properly formatted markdown table with automatic header detection and data cleaning.

    File: /tf/active/vicechatdev/vice_ai/smartstat_service.py

    markdown table-formatting data-conversion pandas dataframe
  • class SmartStatService

    Service for running SmartStat analysis sessions in Vice AI

    File: /tf/active/vicechatdev/vice_ai/smartstat_service.py

    class smartstatservice
  • class DocumentProcessor_v7

    Lightweight document processor for chat upload functionality

    File: /tf/active/vicechatdev/vice_ai/document_processor.py

    class documentprocessor
  • function clean_for_json_v1

    Recursively traverses nested data structures (dicts, lists) and replaces NaN and Infinity float values with None to ensure JSON serialization compatibility.

    File: /tf/active/vicechatdev/vice_ai/new_app.py

    json serialization data-cleaning nan-handling infinity-handling
  • function clean_html_tags_v1

    Removes all HTML tags from a given text string using regular expression pattern matching, returning clean text without markup.

    File: /tf/active/vicechatdev/vice_ai/new_app.py

    html text-processing sanitization regex string-manipulation
  • function direct_fix

    A database maintenance function that detects and fixes corrupted chat_session_id values in a SQLite database's text_sections table by identifying invalid patterns and setting them to NULL.

    File: /tf/active/vicechatdev/vice_ai/direct_sqlite_fix.py

    database sqlite maintenance corruption-fix data-cleaning
  • function clean_nan_for_json

    Recursively traverses nested data structures (dicts, lists) and converts NaN, null, and invalid numeric values to None for safe JSON serialization.

    File: /tf/active/vicechatdev/vice_ai/data_analysis_service.py

    json-serialization data-cleaning nan-handling recursive data-preprocessing
  • function check_specific_corruption

    Detects and fixes specific corruption patterns in the chat_session_id column of a SQLite database's text_sections table, replacing invalid values with NULL.

    File: /tf/active/vicechatdev/vice_ai/check_specific_corruption.py

    database sqlite data-cleaning corruption-detection data-repair

Search Examples