function correlation_significance
Calculates Pearson correlation coefficient and statistical significance (p-value) between two numeric arrays, handling NaN values automatically.
/tf/active/vicechatdev/vice_ai/smartstat_scripts/d1e252f5-950c-4ad7-b425-86b4b02c3c62/analysis_7.py
348 - 359
simple
Purpose
This function computes the Pearson correlation coefficient to measure linear relationship strength between two variables, along with the p-value to assess statistical significance. It automatically filters out NaN values from both arrays and requires at least 3 valid data points to perform the calculation. Returns None values if insufficient data is available.
Source Code
def correlation_significance(x, y):
"""Calculate correlation and p-value"""
# Remove NaN values
mask = ~(np.isnan(x) | np.isnan(y))
x_clean = x[mask]
y_clean = y[mask]
if len(x_clean) < 3:
return None, None, 0
corr, p_value = stats.pearsonr(x_clean, y_clean)
return corr, p_value, len(x_clean)
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
x |
- | - | positional_or_keyword |
y |
- | - | positional_or_keyword |
Parameter Details
x: First numeric array or array-like object (list, numpy array, pandas Series). Can contain NaN values which will be automatically removed. Should be numeric data representing one variable in the correlation analysis.
y: Second numeric array or array-like object (list, numpy array, pandas Series). Must have the same length as x. Can contain NaN values which will be automatically removed. Should be numeric data representing the second variable in the correlation analysis.
Return Value
Returns a tuple of three values: (corr, p_value, n_samples). 'corr' is the Pearson correlation coefficient (float between -1 and 1, or None if insufficient data). 'p_value' is the two-tailed p-value for testing non-correlation (float between 0 and 1, or None if insufficient data). 'n_samples' is the number of valid (non-NaN) paired observations used in the calculation (integer, minimum 0).
Dependencies
numpyscipy
Required Imports
import numpy as np
from scipy import stats
Usage Example
import numpy as np
from scipy import stats
def correlation_significance(x, y):
mask = ~(np.isnan(x) | np.isnan(y))
x_clean = x[mask]
y_clean = y[mask]
if len(x_clean) < 3:
return None, None, 0
corr, p_value = stats.pearsonr(x_clean, y_clean)
return corr, p_value, len(x_clean)
# Example usage
x = np.array([1, 2, 3, 4, 5, np.nan, 7])
y = np.array([2, 4, 5, 4, 5, 6, np.nan])
corr, p_val, n = correlation_significance(x, y)
print(f"Correlation: {corr:.3f}, P-value: {p_val:.3f}, N: {n}")
# Output: Correlation: 0.500, P-value: 0.391, N: 5
Best Practices
- Ensure both input arrays have the same length before calling the function
- The function requires at least 3 valid (non-NaN) paired observations to calculate correlation; otherwise it returns (None, None, 0)
- NaN values are automatically removed pairwise - if either x[i] or y[i] is NaN, both values at index i are excluded
- The p-value tests the null hypothesis that there is no linear correlation between the variables
- Pearson correlation assumes linear relationships and is sensitive to outliers
- Consider checking the returned n_samples value to ensure sufficient data was available for meaningful analysis
- For non-linear relationships, consider using Spearman or Kendall correlation instead
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function calculate_correlations 64.4% similar
-
function create_correlation_heatmap 52.8% similar
-
function grouped_correlation_analysis 49.8% similar
-
function calculate_cv_v1 47.7% similar
-
function export_results 47.6% similar