🔍 Code Extractor

function compare_datasets

Maturity: 44

Analyzes and compares two pandas DataFrames containing flock data (original vs cleaned), printing detailed statistics about removed records, type distributions, and impact assessment.

File:
/tf/active/vicechatdev/data_quality_dashboard.py
Lines:
375 - 435
Complexity:
moderate

Purpose

This function provides comprehensive comparison analysis between original and cleaned datasets, specifically designed for flock data quality assessment. It calculates removal statistics, compares type distributions, identifies most affected flock types, and provides recommendations based on the cleaning impact. The output is formatted as a console report with tables and metrics to help data scientists understand the data cleaning process impact.

Source Code

def compare_datasets(original_flocks, cleaned_flocks):
    """Compare original and cleaned datasets."""
    print("\nDATASET COMPARISON ANALYSIS")
    print("=" * 40)
    
    # Basic statistics
    original_count = len(original_flocks)
    cleaned_count = len(cleaned_flocks)
    removed_count = original_count - cleaned_count
    removal_pct = (removed_count / original_count) * 100
    
    print(f"OVERVIEW:")
    print(f"Original dataset: {original_count:,} flocks")
    print(f"Cleaned dataset: {cleaned_count:,} flocks")
    print(f"Removed flocks: {removed_count:,} ({removal_pct:.2f}%)")
    
    # Type distribution comparison
    print(f"\nTYPE DISTRIBUTION COMPARISON:")
    original_types = original_flocks['Type'].value_counts()
    cleaned_types = cleaned_flocks['Type'].value_counts()
    
    print(f"{'Type':<8} {'Original':<10} {'Cleaned':<10} {'Removed':<8} {'% Removed':<10}")
    print("-" * 50)
    
    for flock_type in original_types.index[:10]:  # Top 10 types
        orig_count = original_types.get(flock_type, 0)
        clean_count = cleaned_types.get(flock_type, 0)
        removed = orig_count - clean_count
        removed_pct = (removed / orig_count * 100) if orig_count > 0 else 0
        
        print(f"{flock_type:<8} {orig_count:<10,} {clean_count:<10,} {removed:<8,} {removed_pct:<10.1f}%")
    
    # Impact assessment
    print(f"\nIMPACT ASSESSMENT:")
    print(f"- Overall data quality: {100 - removal_pct:.1f}% of flocks retained")
    print(f"- Most affected types: ", end="")
    
    # Find types with highest removal rates
    removal_rates = {}
    for flock_type in original_types.index:
        orig_count = original_types.get(flock_type, 0)
        clean_count = cleaned_types.get(flock_type, 0)
        if orig_count >= 10:  # Only consider types with at least 10 flocks
            removal_rate = ((orig_count - clean_count) / orig_count * 100) if orig_count > 0 else 0
            removal_rates[flock_type] = removal_rate
    
    top_affected = sorted(removal_rates.items(), key=lambda x: x[1], reverse=True)[:3]
    affected_types = [f"{t[0]} ({t[1]:.1f}%)" for t in top_affected]
    print(", ".join(affected_types))
    
    # Recommendations
    print(f"\nRECOMMENDations:")
    if removal_pct < 5:
        print("✓ Low impact cleaning - dataset remains highly representative")
    elif removal_pct < 10:
        print("⚠ Moderate impact - consider validating removed flocks")
    else:
        print("⚠ High impact cleaning - review removed flocks for false positives")
    
    print(f"✓ Cleaned dataset ready for analysis")
    print(f"✓ {cleaned_count:,} high-quality flocks available")

Parameters

Name Type Default Kind
original_flocks - - positional_or_keyword
cleaned_flocks - - positional_or_keyword

Parameter Details

original_flocks: A pandas DataFrame containing the original, uncleaned flock dataset. Must have a 'Type' column containing categorical flock type values. Expected to be the raw dataset before any cleaning operations.

cleaned_flocks: A pandas DataFrame containing the cleaned flock dataset after data quality operations. Must have the same structure as original_flocks with a 'Type' column. Should be a subset of original_flocks with problematic records removed.

Return Value

This function returns None. It produces side effects by printing a formatted comparison report to stdout, including overview statistics, type distribution tables, impact assessment, and recommendations.

Dependencies

  • pandas

Required Imports

import pandas as pd

Usage Example

import pandas as pd

# Create sample datasets
original_flocks = pd.DataFrame({
    'Type': ['Broiler', 'Layer', 'Broiler', 'Turkey', 'Layer', 'Broiler', 'Duck'],
    'Size': [1000, 500, 1200, 800, 600, 1100, 300],
    'Quality': ['Good', 'Bad', 'Good', 'Good', 'Bad', 'Good', 'Good']
})

# Simulate cleaning by removing 'Bad' quality flocks
cleaned_flocks = original_flocks[original_flocks['Quality'] == 'Good'].copy()

# Compare datasets
compare_datasets(original_flocks, cleaned_flocks)

# Output will show:
# - Total flocks removed
# - Percentage of data retained
# - Type-by-type comparison
# - Most affected flock types
# - Recommendations based on removal percentage

Best Practices

  • Ensure both DataFrames have a 'Type' column before calling this function
  • The cleaned_flocks DataFrame should be derived from original_flocks (subset relationship)
  • Use this function after data cleaning operations to document and validate the cleaning impact
  • Review the recommendations output to determine if additional validation is needed
  • Consider redirecting output to a file for documentation purposes using standard output redirection
  • The function assumes flock types with at least 10 records for meaningful removal rate calculations
  • Top 10 most common types are displayed in the comparison table; adjust the slice [:10] if needed for your dataset

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function select_dataset 70.8% similar

    Interactive command-line function that prompts users to select between original, cleaned, or comparison of flock datasets for analysis.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
  • function quick_clean 70.2% similar

    Cleans flock data by identifying and removing flocks that have treatment records with timing inconsistencies (treatments administered outside the flock's start/end date range).

    From: /tf/active/vicechatdev/quick_cleaner.py
  • function show_problematic_flocks 69.9% similar

    Analyzes and displays problematic flocks from a dataset by identifying those with systematic timing issues in their treatment records, categorizing them by severity and volume.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
  • function analyze_flock_type_patterns 67.5% similar

    Analyzes and prints timing pattern statistics for flock data by categorizing issues that occur before start time and after end time, grouped by flock type.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
  • function load_analysis_data 61.0% similar

    Loads CSV dataset(s) into pandas DataFrames based on dataset configuration, supporting both single dataset loading and comparison mode with two datasets.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
← Back to Browse