function compare_datasets
Analyzes and compares two pandas DataFrames containing flock data (original vs cleaned), printing detailed statistics about removed records, type distributions, and impact assessment.
/tf/active/vicechatdev/data_quality_dashboard.py
375 - 435
moderate
Purpose
This function provides comprehensive comparison analysis between original and cleaned datasets, specifically designed for flock data quality assessment. It calculates removal statistics, compares type distributions, identifies most affected flock types, and provides recommendations based on the cleaning impact. The output is formatted as a console report with tables and metrics to help data scientists understand the data cleaning process impact.
Source Code
def compare_datasets(original_flocks, cleaned_flocks):
"""Compare original and cleaned datasets."""
print("\nDATASET COMPARISON ANALYSIS")
print("=" * 40)
# Basic statistics
original_count = len(original_flocks)
cleaned_count = len(cleaned_flocks)
removed_count = original_count - cleaned_count
removal_pct = (removed_count / original_count) * 100
print(f"OVERVIEW:")
print(f"Original dataset: {original_count:,} flocks")
print(f"Cleaned dataset: {cleaned_count:,} flocks")
print(f"Removed flocks: {removed_count:,} ({removal_pct:.2f}%)")
# Type distribution comparison
print(f"\nTYPE DISTRIBUTION COMPARISON:")
original_types = original_flocks['Type'].value_counts()
cleaned_types = cleaned_flocks['Type'].value_counts()
print(f"{'Type':<8} {'Original':<10} {'Cleaned':<10} {'Removed':<8} {'% Removed':<10}")
print("-" * 50)
for flock_type in original_types.index[:10]: # Top 10 types
orig_count = original_types.get(flock_type, 0)
clean_count = cleaned_types.get(flock_type, 0)
removed = orig_count - clean_count
removed_pct = (removed / orig_count * 100) if orig_count > 0 else 0
print(f"{flock_type:<8} {orig_count:<10,} {clean_count:<10,} {removed:<8,} {removed_pct:<10.1f}%")
# Impact assessment
print(f"\nIMPACT ASSESSMENT:")
print(f"- Overall data quality: {100 - removal_pct:.1f}% of flocks retained")
print(f"- Most affected types: ", end="")
# Find types with highest removal rates
removal_rates = {}
for flock_type in original_types.index:
orig_count = original_types.get(flock_type, 0)
clean_count = cleaned_types.get(flock_type, 0)
if orig_count >= 10: # Only consider types with at least 10 flocks
removal_rate = ((orig_count - clean_count) / orig_count * 100) if orig_count > 0 else 0
removal_rates[flock_type] = removal_rate
top_affected = sorted(removal_rates.items(), key=lambda x: x[1], reverse=True)[:3]
affected_types = [f"{t[0]} ({t[1]:.1f}%)" for t in top_affected]
print(", ".join(affected_types))
# Recommendations
print(f"\nRECOMMENDations:")
if removal_pct < 5:
print("✓ Low impact cleaning - dataset remains highly representative")
elif removal_pct < 10:
print("⚠ Moderate impact - consider validating removed flocks")
else:
print("⚠ High impact cleaning - review removed flocks for false positives")
print(f"✓ Cleaned dataset ready for analysis")
print(f"✓ {cleaned_count:,} high-quality flocks available")
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
original_flocks |
- | - | positional_or_keyword |
cleaned_flocks |
- | - | positional_or_keyword |
Parameter Details
original_flocks: A pandas DataFrame containing the original, uncleaned flock dataset. Must have a 'Type' column containing categorical flock type values. Expected to be the raw dataset before any cleaning operations.
cleaned_flocks: A pandas DataFrame containing the cleaned flock dataset after data quality operations. Must have the same structure as original_flocks with a 'Type' column. Should be a subset of original_flocks with problematic records removed.
Return Value
This function returns None. It produces side effects by printing a formatted comparison report to stdout, including overview statistics, type distribution tables, impact assessment, and recommendations.
Dependencies
pandas
Required Imports
import pandas as pd
Usage Example
import pandas as pd
# Create sample datasets
original_flocks = pd.DataFrame({
'Type': ['Broiler', 'Layer', 'Broiler', 'Turkey', 'Layer', 'Broiler', 'Duck'],
'Size': [1000, 500, 1200, 800, 600, 1100, 300],
'Quality': ['Good', 'Bad', 'Good', 'Good', 'Bad', 'Good', 'Good']
})
# Simulate cleaning by removing 'Bad' quality flocks
cleaned_flocks = original_flocks[original_flocks['Quality'] == 'Good'].copy()
# Compare datasets
compare_datasets(original_flocks, cleaned_flocks)
# Output will show:
# - Total flocks removed
# - Percentage of data retained
# - Type-by-type comparison
# - Most affected flock types
# - Recommendations based on removal percentage
Best Practices
- Ensure both DataFrames have a 'Type' column before calling this function
- The cleaned_flocks DataFrame should be derived from original_flocks (subset relationship)
- Use this function after data cleaning operations to document and validate the cleaning impact
- Review the recommendations output to determine if additional validation is needed
- Consider redirecting output to a file for documentation purposes using standard output redirection
- The function assumes flock types with at least 10 records for meaningful removal rate calculations
- Top 10 most common types are displayed in the comparison table; adjust the slice [:10] if needed for your dataset
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function select_dataset 70.8% similar
-
function quick_clean 70.2% similar
-
function show_problematic_flocks 69.9% similar
-
function analyze_flock_type_patterns 67.5% similar
-
function load_analysis_data 61.0% similar