function identify_variables
Categorizes DataFrame columns into Eimeria infection variables, performance measure variables, and grouping variables based on keyword matching in column names.
/tf/active/vicechatdev/vice_ai/smartstat_scripts/5a059cb7-3903-4020-8519-14198d1f39c9/analysis_1.py
103 - 141
simple
Purpose
This function automates the identification and classification of variables in a dataset related to Eimeria (a parasitic infection in poultry) research. It scans column names for specific keywords to categorize them into three groups: infection-related metrics (oocyst counts, lesion scores), performance metrics (weight gain, feed conversion ratio), and experimental grouping variables (treatment groups, challenge types). This is particularly useful for preprocessing veterinary or agricultural research data before statistical analysis.
Source Code
def identify_variables(df, numerical_vars):
"""Identify Eimeria infection and performance measure variables"""
# Keywords for Eimeria infection measures
eimeria_keywords = ['eimeria', 'oocyst', 'lesion', 'coccidia', 'infection']
# Keywords for performance measures
performance_keywords = ['weight', 'gain', 'fcr', 'feed_conversion', 'feed_intake',
'mortality', 'growth', 'performance', 'body_weight', 'bw']
eimeria_vars = []
performance_vars = []
grouping_vars = []
for col in df.columns:
col_lower = col.lower()
# Check for Eimeria variables
if any(keyword in col_lower for keyword in eimeria_keywords):
if col in numerical_vars:
eimeria_vars.append(col)
# Check for performance variables
elif any(keyword in col_lower for keyword in performance_keywords):
if col in numerical_vars:
performance_vars.append(col)
# Check for grouping variables (treatment, challenge, etc.)
elif any(keyword in col_lower for keyword in ['treatment', 'challenge', 'group', 'regimen']):
grouping_vars.append(col)
print("\n" + "="*80)
print("VARIABLE IDENTIFICATION")
print("="*80)
print(f"\nEimeria Infection Variables: {eimeria_vars}")
print(f"\nPerformance Measure Variables: {performance_vars}")
print(f"\nGrouping Variables: {grouping_vars}")
return eimeria_vars, performance_vars, grouping_vars
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
df |
- | - | positional_or_keyword |
numerical_vars |
- | - | positional_or_keyword |
Parameter Details
df: A pandas DataFrame containing the dataset to analyze. The function examines the column names of this DataFrame to identify variable types. Expected to have column names that may contain keywords related to Eimeria infection, performance measures, or grouping factors.
numerical_vars: A list or array-like collection of column names from the DataFrame that are numerical (continuous or discrete numeric data types). Only columns present in this list will be considered for classification as Eimeria or performance variables. This parameter acts as a filter to ensure only quantitative variables are categorized as measurement variables.
Return Value
Returns a tuple of three lists: (eimeria_vars, performance_vars, grouping_vars). 'eimeria_vars' contains column names matching Eimeria-related keywords that are also numerical. 'performance_vars' contains column names matching performance-related keywords that are also numerical. 'grouping_vars' contains column names matching grouping-related keywords (can be any data type). All three lists contain strings representing column names from the input DataFrame.
Dependencies
pandas
Required Imports
import pandas as pd
Usage Example
import pandas as pd
# Create sample dataset
df = pd.DataFrame({
'treatment_group': ['A', 'B', 'A', 'B'],
'eimeria_oocyst_count': [1000, 2000, 1500, 2500],
'lesion_score': [2.5, 3.0, 2.8, 3.2],
'body_weight_gain': [450, 420, 440, 410],
'feed_conversion_ratio': [1.8, 2.1, 1.9, 2.2],
'age_days': [21, 21, 21, 21]
})
# Define numerical variables
numerical_vars = ['eimeria_oocyst_count', 'lesion_score', 'body_weight_gain', 'feed_conversion_ratio', 'age_days']
# Identify variables
eimeria_vars, performance_vars, grouping_vars = identify_variables(df, numerical_vars)
# Results:
# eimeria_vars: ['eimeria_oocyst_count', 'lesion_score']
# performance_vars: ['body_weight_gain', 'feed_conversion_ratio']
# grouping_vars: ['treatment_group']
Best Practices
- Ensure column names in the DataFrame are descriptive and contain relevant keywords for accurate classification
- The numerical_vars parameter should be pre-computed using appropriate pandas methods (e.g., df.select_dtypes(include=[np.number]).columns.tolist())
- Column name matching is case-insensitive, so 'Eimeria', 'EIMERIA', and 'eimeria' will all match
- The function uses 'elif' logic, so if a column matches multiple categories, it will be assigned to the first matching category (Eimeria takes precedence over performance)
- Grouping variables are not filtered by numerical_vars, so they can be categorical or any data type
- The function prints results to console; consider capturing or suppressing output if using in automated pipelines
- If no variables match the keywords, empty lists will be returned for those categories
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function main_v56 72.4% similar
-
function grouped_correlation_analysis 69.8% similar
-
function generate_conclusions 66.2% similar
-
function main_v26 66.0% similar
-
function calculate_correlations 62.0% similar