function select_dataset
Interactive command-line function that prompts users to select between original, cleaned, or comparison of flock datasets for analysis.
/tf/active/vicechatdev/data_quality_dashboard.py
16 - 54
simple
Purpose
Provides a user-friendly menu interface for dataset selection in a flock data analysis workflow. Checks for the existence of a cleaned dataset, displays statistics about both datasets (including removal counts and percentages), and returns a dictionary containing the selected dataset path(s) and type. Handles cases where the cleaned dataset doesn't exist by defaulting to the original dataset.
Source Code
def select_dataset():
"""Allow user to select which dataset to analyze."""
print("\nDATASET SELECTION")
print("-" * 30)
# Check if cleaned dataset exists
clean_path = "/tf/active/pehestat_data/dbo_Flocks_clean.csv"
original_path = "/tf/active/pehestat_data/dbo_Flocks.csv"
if os.path.exists(clean_path):
# Get file stats
original_size = len(pd.read_csv(original_path))
clean_size = len(pd.read_csv(clean_path))
removed_count = original_size - clean_size
removal_pct = (removed_count / original_size) * 100
print(f"Available datasets:")
print(f"1. Original dataset ({original_size:,} flocks)")
print(f"2. Cleaned dataset ({clean_size:,} flocks)")
print(f" └─ Removed {removed_count:,} flocks with timing issues ({removal_pct:.2f}%)")
print(f"3. Compare both datasets")
print(f"4. Exit")
while True:
choice = input("\nSelect dataset (1-4): ").strip()
if choice == '1':
return {'type': 'original', 'path': original_path}
elif choice == '2':
return {'type': 'cleaned', 'path': clean_path}
elif choice == '3':
return {'type': 'compare', 'original': original_path, 'cleaned': clean_path}
elif choice == '4':
return None
else:
print("Invalid choice. Please select 1-4.")
else:
print("Cleaned dataset not found. Using original dataset.")
print("Run the flock data cleaner first to generate cleaned dataset.")
return {'type': 'original', 'path': original_path}
Return Value
Returns a dictionary with dataset selection information, or None if user chooses to exit. For 'original' or 'cleaned' types: {'type': str, 'path': str}. For 'compare' type: {'type': 'compare', 'original': str, 'cleaned': str}. Returns None if user selects option 4 (Exit). If cleaned dataset doesn't exist, returns {'type': 'original', 'path': original_path}.
Dependencies
pandasos
Required Imports
import pandas as pd
import os
Usage Example
import pandas as pd
import os
def select_dataset():
# ... function code ...
pass
# Call the function to get user's dataset selection
result = select_dataset()
if result is None:
print("User chose to exit")
elif result['type'] == 'original':
df = pd.read_csv(result['path'])
print(f"Loaded original dataset with {len(df)} records")
elif result['type'] == 'cleaned':
df = pd.read_csv(result['path'])
print(f"Loaded cleaned dataset with {len(df)} records")
elif result['type'] == 'compare':
df_original = pd.read_csv(result['original'])
df_cleaned = pd.read_csv(result['cleaned'])
print(f"Loaded both datasets for comparison")
Best Practices
- Ensure the /tf/active/pehestat_data/ directory exists and contains dbo_Flocks.csv before calling this function
- Handle the None return value to gracefully exit the program when user selects option 4
- Check the 'type' key in the returned dictionary to determine which dataset(s) were selected
- The function loads entire datasets into memory to count rows - consider optimizing for very large files by using row counting methods that don't load all data
- Input validation loop ensures only valid choices (1-4) are accepted
- Function assumes CSV files are readable by pandas - ensure proper file permissions and format
- Hardcoded file paths make this function specific to a particular environment - consider parameterizing paths for reusability
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function compare_datasets 70.8% similar
-
function create_data_quality_dashboard 60.4% similar
-
function quick_clean 58.2% similar
-
function show_problematic_flocks 57.5% similar
-
function load_analysis_data 56.5% similar