🔍 Code Extractor

function select_dataset

Maturity: 44

Interactive command-line function that prompts users to select between original, cleaned, or comparison of flock datasets for analysis.

File:
/tf/active/vicechatdev/data_quality_dashboard.py
Lines:
16 - 54
Complexity:
simple

Purpose

Provides a user-friendly menu interface for dataset selection in a flock data analysis workflow. Checks for the existence of a cleaned dataset, displays statistics about both datasets (including removal counts and percentages), and returns a dictionary containing the selected dataset path(s) and type. Handles cases where the cleaned dataset doesn't exist by defaulting to the original dataset.

Source Code

def select_dataset():
    """Allow user to select which dataset to analyze."""
    print("\nDATASET SELECTION")
    print("-" * 30)
    
    # Check if cleaned dataset exists
    clean_path = "/tf/active/pehestat_data/dbo_Flocks_clean.csv"
    original_path = "/tf/active/pehestat_data/dbo_Flocks.csv"
    
    if os.path.exists(clean_path):
        # Get file stats
        original_size = len(pd.read_csv(original_path))
        clean_size = len(pd.read_csv(clean_path))
        removed_count = original_size - clean_size
        removal_pct = (removed_count / original_size) * 100
        
        print(f"Available datasets:")
        print(f"1. Original dataset ({original_size:,} flocks)")
        print(f"2. Cleaned dataset ({clean_size:,} flocks)")
        print(f"   └─ Removed {removed_count:,} flocks with timing issues ({removal_pct:.2f}%)")
        print(f"3. Compare both datasets")
        print(f"4. Exit")
        
        while True:
            choice = input("\nSelect dataset (1-4): ").strip()
            if choice == '1':
                return {'type': 'original', 'path': original_path}
            elif choice == '2':
                return {'type': 'cleaned', 'path': clean_path}
            elif choice == '3':
                return {'type': 'compare', 'original': original_path, 'cleaned': clean_path}
            elif choice == '4':
                return None
            else:
                print("Invalid choice. Please select 1-4.")
    else:
        print("Cleaned dataset not found. Using original dataset.")
        print("Run the flock data cleaner first to generate cleaned dataset.")
        return {'type': 'original', 'path': original_path}

Return Value

Returns a dictionary with dataset selection information, or None if user chooses to exit. For 'original' or 'cleaned' types: {'type': str, 'path': str}. For 'compare' type: {'type': 'compare', 'original': str, 'cleaned': str}. Returns None if user selects option 4 (Exit). If cleaned dataset doesn't exist, returns {'type': 'original', 'path': original_path}.

Dependencies

  • pandas
  • os

Required Imports

import pandas as pd
import os

Usage Example

import pandas as pd
import os

def select_dataset():
    # ... function code ...
    pass

# Call the function to get user's dataset selection
result = select_dataset()

if result is None:
    print("User chose to exit")
elif result['type'] == 'original':
    df = pd.read_csv(result['path'])
    print(f"Loaded original dataset with {len(df)} records")
elif result['type'] == 'cleaned':
    df = pd.read_csv(result['path'])
    print(f"Loaded cleaned dataset with {len(df)} records")
elif result['type'] == 'compare':
    df_original = pd.read_csv(result['original'])
    df_cleaned = pd.read_csv(result['cleaned'])
    print(f"Loaded both datasets for comparison")

Best Practices

  • Ensure the /tf/active/pehestat_data/ directory exists and contains dbo_Flocks.csv before calling this function
  • Handle the None return value to gracefully exit the program when user selects option 4
  • Check the 'type' key in the returned dictionary to determine which dataset(s) were selected
  • The function loads entire datasets into memory to count rows - consider optimizing for very large files by using row counting methods that don't load all data
  • Input validation loop ensures only valid choices (1-4) are accepted
  • Function assumes CSV files are readable by pandas - ensure proper file permissions and format
  • Hardcoded file paths make this function specific to a particular environment - consider parameterizing paths for reusability

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function compare_datasets 70.8% similar

    Analyzes and compares two pandas DataFrames containing flock data (original vs cleaned), printing detailed statistics about removed records, type distributions, and impact assessment.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
  • function create_data_quality_dashboard 60.4% similar

    Creates an interactive command-line dashboard for analyzing data quality issues in treatment timing data, specifically focusing on treatments administered outside of flock lifecycle dates.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
  • function quick_clean 58.2% similar

    Cleans flock data by identifying and removing flocks that have treatment records with timing inconsistencies (treatments administered outside the flock's start/end date range).

    From: /tf/active/vicechatdev/quick_cleaner.py
  • function show_problematic_flocks 57.5% similar

    Analyzes and displays problematic flocks from a dataset by identifying those with systematic timing issues in their treatment records, categorizing them by severity and volume.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
  • function load_analysis_data 56.5% similar

    Loads CSV dataset(s) into pandas DataFrames based on dataset configuration, supporting both single dataset loading and comparison mode with two datasets.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
← Back to Browse