compare_datasets - Code Extractor

function compare_datasets

Maturity: 44

Analyzes and compares two pandas DataFrames containing flock data (original vs cleaned), printing detailed statistics about removed records, type distributions, and impact assessment.

File:
/tf/active/vicechatdev/data_quality_dashboard.py

Lines:
375 - 435

Complexity:
moderate

Purpose

This function provides comprehensive comparison analysis between original and cleaned datasets, specifically designed for flock data quality assessment. It calculates removal statistics, compares type distributions, identifies most affected flock types, and provides recommendations based on the cleaning impact. The output is formatted as a console report with tables and metrics to help data scientists understand the data cleaning process impact.

Source Code

def compare_datasets(original_flocks, cleaned_flocks):
    """Compare original and cleaned datasets."""
    print("\nDATASET COMPARISON ANALYSIS")
    print("=" * 40)
    
    # Basic statistics
    original_count = len(original_flocks)
    cleaned_count = len(cleaned_flocks)
    removed_count = original_count - cleaned_count
    removal_pct = (removed_count / original_count) * 100
    
    print(f"OVERVIEW:")
    print(f"Original dataset: {original_count:,} flocks")
    print(f"Cleaned dataset: {cleaned_count:,} flocks")
    print(f"Removed flocks: {removed_count:,} ({removal_pct:.2f}%)")
    
    # Type distribution comparison
    print(f"\nTYPE DISTRIBUTION COMPARISON:")
    original_types = original_flocks['Type'].value_counts()
    cleaned_types = cleaned_flocks['Type'].value_counts()
    
    print(f"{'Type':<8} {'Original':<10} {'Cleaned':<10} {'Removed':<8} {'% Removed':<10}")
    print("-" * 50)
    
    for flock_type in original_types.index[:10]:  # Top 10 types
        orig_count = original_types.get(flock_type, 0)
        clean_count = cleaned_types.get(flock_type, 0)
        removed = orig_count - clean_count
        removed_pct = (removed / orig_count * 100) if orig_count > 0 else 0
        
        print(f"{flock_type:<8} {orig_count:<10,} {clean_count:<10,} {removed:<8,} {removed_pct:<10.1f}%")
    
    # Impact assessment
    print(f"\nIMPACT ASSESSMENT:")
    print(f"- Overall data quality: {100 - removal_pct:.1f}% of flocks retained")
    print(f"- Most affected types: ", end="")
    
    # Find types with highest removal rates
    removal_rates = {}
    for flock_type in original_types.index:
        orig_count = original_types.get(flock_type, 0)
        clean_count = cleaned_types.get(flock_type, 0)
        if orig_count >= 10:  # Only consider types with at least 10 flocks
            removal_rate = ((orig_count - clean_count) / orig_count * 100) if orig_count > 0 else 0
            removal_rates[flock_type] = removal_rate
    
    top_affected = sorted(removal_rates.items(), key=lambda x: x[1], reverse=True)[:3]
    affected_types = [f"{t[0]} ({t[1]:.1f}%)" for t in top_affected]
    print(", ".join(affected_types))
    
    # Recommendations
    print(f"\nRECOMMENDations:")
    if removal_pct < 5:
        print("✓ Low impact cleaning - dataset remains highly representative")
    elif removal_pct < 10:
        print("⚠ Moderate impact - consider validating removed flocks")
    else:
        print("⚠ High impact cleaning - review removed flocks for false positives")
    
    print(f"✓ Cleaned dataset ready for analysis")
    print(f"✓ {cleaned_count:,} high-quality flocks available")

Parameters

Name	Type	Default	Kind
`original_flocks`	-	-	positional_or_keyword
`cleaned_flocks`	-	-	positional_or_keyword

Parameter Details

original_flocks: A pandas DataFrame containing the original, uncleaned flock dataset. Must have a 'Type' column containing categorical flock type values. Expected to be the raw dataset before any cleaning operations.

cleaned_flocks: A pandas DataFrame containing the cleaned flock dataset after data quality operations. Must have the same structure as original_flocks with a 'Type' column. Should be a subset of original_flocks with problematic records removed.

Return Value

This function returns None. It produces side effects by printing a formatted comparison report to stdout, including overview statistics, type distribution tables, impact assessment, and recommendations.

Dependencies

pandas

Required Imports

import pandas as pd

Usage Example

import pandas as pd

# Create sample datasets
original_flocks = pd.DataFrame({
    'Type': ['Broiler', 'Layer', 'Broiler', 'Turkey', 'Layer', 'Broiler', 'Duck'],
    'Size': [1000, 500, 1200, 800, 600, 1100, 300],
    'Quality': ['Good', 'Bad', 'Good', 'Good', 'Bad', 'Good', 'Good']
})

# Simulate cleaning by removing 'Bad' quality flocks
cleaned_flocks = original_flocks[original_flocks['Quality'] == 'Good'].copy()

# Compare datasets
compare_datasets(original_flocks, cleaned_flocks)

# Output will show:
# - Total flocks removed
# - Percentage of data retained
# - Type-by-type comparison
# - Most affected flock types
# - Recommendations based on removal percentage

Best Practices

Ensure both DataFrames have a 'Type' column before calling this function
The cleaned_flocks DataFrame should be derived from original_flocks (subset relationship)
Use this function after data cleaning operations to document and validate the cleaning impact
Review the recommendations output to determine if additional validation is needed
Consider redirecting output to a file for documentation purposes using standard output redirection
The function assumes flock types with at least 10 records for meaningful removal rate calculations
Top 10 most common types are displayed in the comparison table; adjust the slice [:10] if needed for your dataset

Similar Components

AI-powered semantic similarity - components with related functionality:

function select_dataset 70.8% similar

Interactive command-line function that prompts users to select between original, cleaned, or comparison of flock datasets for analysis.
From: /tf/active/vicechatdev/data_quality_dashboard.py
function quick_clean 70.2% similar

Cleans flock data by identifying and removing flocks that have treatment records with timing inconsistencies (treatments administered outside the flock's start/end date range).
From: /tf/active/vicechatdev/quick_cleaner.py
function show_problematic_flocks 69.9% similar

Analyzes and displays problematic flocks from a dataset by identifying those with systematic timing issues in their treatment records, categorizing them by severity and volume.
From: /tf/active/vicechatdev/data_quality_dashboard.py
function analyze_flock_type_patterns 67.5% similar

Analyzes and prints timing pattern statistics for flock data by categorizing issues that occur before start time and after end time, grouped by flock type.
From: /tf/active/vicechatdev/data_quality_dashboard.py
function load_analysis_data 61.0% similar

Loads CSV dataset(s) into pandas DataFrames based on dataset configuration, supporting both single dataset loading and comparison mode with two datasets.
From: /tf/active/vicechatdev/data_quality_dashboard.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def compare_datasets(original_flocks, cleaned_flocks):
    """Compare original and cleaned datasets."""
    print("\nDATASET COMPARISON ANALYSIS")
    print("=" * 40)
    
    # Basic statistics
    original_count = len(original_flocks)
    cleaned_count = len(cleaned_flocks)
    removed_count = original_count - cleaned_count
    removal_pct = (removed_count / original_count) * 100
    
    print(f"OVERVIEW:")
    print(f"Original dataset: {original_count:,} flocks")
    print(f"Cleaned dataset: {cleaned_count:,} flocks")
    print(f"Removed flocks: {removed_count:,} ({removal_pct:.2f}%)")
    
    # Type distribution comparison
    print(f"\nTYPE DISTRIBUTION COMPARISON:")
    original_types = original_flocks['Type'].value_counts()
    cleaned_types = cleaned_flocks['Type'].value_counts()
    
    print(f"{'Type':<8} {'Original':<10} {'Cleaned':<10} {'Removed':<8} {'% Removed':<10}")
    print("-" * 50)
    
    for flock_type in original_types.index[:10]:  # Top 10 types
        orig_count = original_types.get(flock_type, 0)
        clean_count = cleaned_types.get(flock_type, 0)
        removed = orig_count - clean_count
        removed_pct = (removed / orig_count * 100) if orig_count > 0 else 0
        
        print(f"{flock_type:<8} {orig_count:<10,} {clean_count:<10,} {removed:<8,} {removed_pct:<10.1f}%")
    
    # Impact assessment
    print(f"\nIMPACT ASSESSMENT:")
    print(f"- Overall data quality: {100 - removal_pct:.1f}% of flocks retained")
    print(f"- Most affected types: ", end="")
    
    # Find types with highest removal rates
    removal_rates = {}
    for flock_type in original_types.index:
        orig_count = original_types.get(flock_type, 0)
        clean_count = cleaned_types.get(flock_type, 0)
        if orig_count >= 10:  # Only consider types with at least 10 flocks
            removal_rate = ((orig_count - clean_count) / orig_count * 100) if orig_count > 0 else 0
            removal_rates[flock_type] = removal_rate
    
    top_affected = sorted(removal_rates.items(), key=lambda x: x[1], reverse=True)[:3]
    affected_types = [f"{t[0]} ({t[1]:.1f}%)" for t in top_affected]
    print(", ".join(affected_types))
    
    # Recommendations
    print(f"\nRECOMMENDations:")
    if removal_pct < 5:
        print("✓ Low impact cleaning - dataset remains highly representative")
    elif removal_pct < 10:
        print("⚠ Moderate impact - consider validating removed flocks")
    else:
        print("⚠ High impact cleaning - review removed flocks for false positives")
    
    print(f"✓ Cleaned dataset ready for analysis")
    print(f"✓ {cleaned_count:,} high-quality flocks available")
                        

Improved Code

🔍 Code Extractor

function compare_datasets

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function select_dataset 70.8% similar

function quick_clean 70.2% similar

function show_problematic_flocks 69.9% similar

function analyze_flock_type_patterns 67.5% similar

function load_analysis_data 61.0% similar

function compare_datasets

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function select_dataset 70.8% similar

function quick_clean 70.2% similar

function show_problematic_flocks 69.9% similar

function analyze_flock_type_patterns 67.5% similar

function load_analysis_data 61.0% similar

✨ Improve Code: compare_datasets

Code Comparison