function detect_outliers_zscore
Detects outliers in numerical data using the Z-score statistical method, identifying data points that deviate significantly from the mean.
/tf/active/vicechatdev/vice_ai/smartstat_scripts/328d2f87-3367-495e-89f7-e633ff8c5b3d/analysis_2.py
86 - 93
simple
Purpose
This function identifies outliers in a dataset by calculating Z-scores (standard deviations from the mean) for each data point and flagging those that exceed a specified threshold. It's commonly used in data cleaning, anomaly detection, and statistical analysis to identify unusual observations that may warrant further investigation or removal. The Z-score method assumes the data follows a normal distribution and is most effective for univariate outlier detection.
Source Code
def detect_outliers_zscore(data, threshold=3):
"""
Detect outliers using Z-score method
Returns: indices of outliers
"""
z_scores = np.abs(stats.zscore(data))
outliers = z_scores > threshold
return outliers
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
data |
- | - | positional_or_keyword |
threshold |
- | 3 | positional_or_keyword |
Parameter Details
data: A 1D array-like object (list, numpy array, pandas Series) containing numerical values to analyze for outliers. Should contain numeric data without NaN values for accurate Z-score calculation. The data should ideally follow a normal distribution for best results.
threshold: A positive numeric value (default=3) representing the number of standard deviations from the mean beyond which a data point is considered an outlier. Common values are 2 (95% confidence), 2.5, or 3 (99.7% confidence). Higher thresholds result in fewer outliers being detected.
Return Value
Returns a boolean numpy array of the same length as the input data, where True indicates an outlier at that index position and False indicates a normal value. Despite the docstring saying 'indices of outliers', the function actually returns a boolean mask that can be used to filter or index the original data.
Dependencies
numpyscipy
Required Imports
import numpy as np
from scipy import stats
Usage Example
import numpy as np
from scipy import stats
def detect_outliers_zscore(data, threshold=3):
z_scores = np.abs(stats.zscore(data))
outliers = z_scores > threshold
return outliers
# Example usage
data = [10, 12, 11, 13, 12, 100, 11, 12, 14, 13]
outlier_mask = detect_outliers_zscore(data, threshold=3)
print(f"Outlier mask: {outlier_mask}")
print(f"Outlier values: {np.array(data)[outlier_mask]}")
print(f"Outlier indices: {np.where(outlier_mask)[0]}")
# With pandas Series
import pandas as pd
df = pd.DataFrame({'values': data})
df['is_outlier'] = detect_outliers_zscore(df['values'])
print(df)
Best Practices
- Ensure input data does not contain NaN or infinite values, as these will cause scipy.stats.zscore to fail or return NaN
- The Z-score method assumes data follows a normal distribution; consider using alternative methods (IQR, modified Z-score) for skewed distributions
- A threshold of 3 is standard (99.7% confidence interval), but adjust based on your domain and tolerance for false positives
- The function returns a boolean mask, not indices; use np.where(outliers)[0] to get actual indices
- For small datasets (n < 30), Z-score method may not be reliable; consider using other outlier detection methods
- Consider handling outliers appropriately after detection (removal, transformation, or investigation) rather than automatic deletion
- For multivariate outlier detection, consider using Mahalanobis distance or other multivariate methods instead
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function detect_outliers_iqr_v2 65.7% similar
-
function detect_outliers_iqr_v1 65.2% similar
-
function detect_outliers_iqr 59.8% similar
-
function remove_outliers_iqr 58.1% similar
-
function remove_outliers 57.3% similar