function calculate_correlations
Calculates both Pearson and Spearman correlation coefficients between Eimeria variables and performance variables, filtering out missing values and identifying statistically significant relationships.
/tf/active/vicechatdev/vice_ai/smartstat_scripts/5a059cb7-3903-4020-8519-14198d1f39c9/analysis_1.py
147 - 185
moderate
Purpose
This function performs comprehensive correlation analysis between two sets of variables (Eimeria-related and performance-related metrics) in a dataset. It computes both parametric (Pearson) and non-parametric (Spearman) correlations to handle both normal and non-normal data distributions. The function identifies statistically significant correlations (p < 0.05), prints formatted results to console, and returns a structured DataFrame containing all correlation statistics. This is particularly useful for exploratory data analysis in biological or veterinary research examining relationships between parasitic infections (Eimeria) and performance outcomes.
Source Code
def calculate_correlations(df, eimeria_vars, performance_vars):
"""Calculate correlations between Eimeria and performance variables"""
print("\n" + "="*80)
print("OVERALL CORRELATION ANALYSIS")
print("="*80)
results = []
for eimeria_var in eimeria_vars:
for perf_var in performance_vars:
# Remove missing values
valid_data = df[[eimeria_var, perf_var]].dropna()
if len(valid_data) > 3:
# Pearson correlation
pearson_r, pearson_p = pearsonr(valid_data[eimeria_var],
valid_data[perf_var])
# Spearman correlation (for non-normal data)
spearman_r, spearman_p = spearmanr(valid_data[eimeria_var],
valid_data[perf_var])
results.append({
'Eimeria_Variable': eimeria_var,
'Performance_Variable': perf_var,
'Pearson_r': pearson_r,
'Pearson_p': pearson_p,
'Spearman_r': spearman_r,
'Spearman_p': spearman_p,
'N': len(valid_data),
'Significant': 'Yes' if pearson_p < 0.05 else 'No'
})
results_df = pd.DataFrame(results)
print("\nCorrelation Results:")
print(results_df.to_string(index=False))
return results_df
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
df |
- | - | positional_or_keyword |
eimeria_vars |
- | - | positional_or_keyword |
performance_vars |
- | - | positional_or_keyword |
Parameter Details
df: A pandas DataFrame containing the dataset with both Eimeria and performance variables as columns. Must include all columns specified in eimeria_vars and performance_vars parameters.
eimeria_vars: A list or iterable of column names (strings) from the DataFrame representing Eimeria-related variables (e.g., infection levels, oocyst counts). These will be correlated against performance variables.
performance_vars: A list or iterable of column names (strings) from the DataFrame representing performance metrics (e.g., weight gain, feed conversion ratio). These will be correlated with Eimeria variables.
Return Value
Returns a pandas DataFrame with the following columns: 'Eimeria_Variable' (name of Eimeria variable), 'Performance_Variable' (name of performance variable), 'Pearson_r' (Pearson correlation coefficient), 'Pearson_p' (Pearson p-value), 'Spearman_r' (Spearman correlation coefficient), 'Spearman_p' (Spearman p-value), 'N' (number of valid observations used), and 'Significant' (string 'Yes' or 'No' indicating if Pearson p-value < 0.05). Each row represents one pairwise correlation between an Eimeria variable and a performance variable.
Dependencies
pandasscipy
Required Imports
import pandas as pd
from scipy.stats import pearsonr
from scipy.stats import spearmanr
Usage Example
import pandas as pd
from scipy.stats import pearsonr, spearmanr
# Create sample data
data = {
'eimeria_count': [100, 200, 150, 300, 250, 180],
'eimeria_severity': [1, 3, 2, 4, 3, 2],
'weight_gain': [500, 450, 480, 400, 420, 470],
'feed_efficiency': [1.8, 1.6, 1.7, 1.5, 1.55, 1.65]
}
df = pd.DataFrame(data)
# Define variable lists
eimeria_vars = ['eimeria_count', 'eimeria_severity']
performance_vars = ['weight_gain', 'feed_efficiency']
# Calculate correlations
results = calculate_correlations(df, eimeria_vars, performance_vars)
# Access results
print(results[results['Significant'] == 'Yes'])
print(f"\nMean Pearson correlation: {results['Pearson_r'].mean():.3f}")
Best Practices
- Ensure the DataFrame contains sufficient non-missing data for meaningful correlations (function requires >3 valid observations per pair)
- Variable names in eimeria_vars and performance_vars must exactly match column names in the DataFrame
- The function uses a significance threshold of p < 0.05; consider adjusting this threshold for multiple comparison corrections (e.g., Bonferroni) when analyzing many variable pairs
- Pearson correlation assumes linear relationships and normally distributed data; Spearman correlation is more robust to non-normal distributions and monotonic relationships
- Review both Pearson and Spearman results as they may differ significantly for non-linear or non-normal data
- The function prints results to console; redirect stdout if you need to suppress output
- Missing values are handled via pairwise deletion (dropna), which may result in different sample sizes for different variable pairs
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function grouped_correlation_analysis 83.0% similar
-
function generate_conclusions 76.0% similar
-
function create_correlation_heatmap 75.9% similar
-
function main_v54 73.6% similar
-
function main_v24 72.9% similar