🔍 Code Extractor

function detect_outliers_zscore

Maturity: 40

Detects outliers in numerical data using the Z-score statistical method, identifying data points that deviate significantly from the mean.

File:
/tf/active/vicechatdev/vice_ai/smartstat_scripts/328d2f87-3367-495e-89f7-e633ff8c5b3d/analysis_2.py
Lines:
86 - 93
Complexity:
simple

Purpose

This function identifies outliers in a dataset by calculating Z-scores (standard deviations from the mean) for each data point and flagging those that exceed a specified threshold. It's commonly used in data cleaning, anomaly detection, and statistical analysis to identify unusual observations that may warrant further investigation or removal. The Z-score method assumes the data follows a normal distribution and is most effective for univariate outlier detection.

Source Code

def detect_outliers_zscore(data, threshold=3):
    """
    Detect outliers using Z-score method
    Returns: indices of outliers
    """
    z_scores = np.abs(stats.zscore(data))
    outliers = z_scores > threshold
    return outliers

Parameters

Name Type Default Kind
data - - positional_or_keyword
threshold - 3 positional_or_keyword

Parameter Details

data: A 1D array-like object (list, numpy array, pandas Series) containing numerical values to analyze for outliers. Should contain numeric data without NaN values for accurate Z-score calculation. The data should ideally follow a normal distribution for best results.

threshold: A positive numeric value (default=3) representing the number of standard deviations from the mean beyond which a data point is considered an outlier. Common values are 2 (95% confidence), 2.5, or 3 (99.7% confidence). Higher thresholds result in fewer outliers being detected.

Return Value

Returns a boolean numpy array of the same length as the input data, where True indicates an outlier at that index position and False indicates a normal value. Despite the docstring saying 'indices of outliers', the function actually returns a boolean mask that can be used to filter or index the original data.

Dependencies

  • numpy
  • scipy

Required Imports

import numpy as np
from scipy import stats

Usage Example

import numpy as np
from scipy import stats

def detect_outliers_zscore(data, threshold=3):
    z_scores = np.abs(stats.zscore(data))
    outliers = z_scores > threshold
    return outliers

# Example usage
data = [10, 12, 11, 13, 12, 100, 11, 12, 14, 13]
outlier_mask = detect_outliers_zscore(data, threshold=3)
print(f"Outlier mask: {outlier_mask}")
print(f"Outlier values: {np.array(data)[outlier_mask]}")
print(f"Outlier indices: {np.where(outlier_mask)[0]}")

# With pandas Series
import pandas as pd
df = pd.DataFrame({'values': data})
df['is_outlier'] = detect_outliers_zscore(df['values'])
print(df)

Best Practices

  • Ensure input data does not contain NaN or infinite values, as these will cause scipy.stats.zscore to fail or return NaN
  • The Z-score method assumes data follows a normal distribution; consider using alternative methods (IQR, modified Z-score) for skewed distributions
  • A threshold of 3 is standard (99.7% confidence interval), but adjust based on your domain and tolerance for false positives
  • The function returns a boolean mask, not indices; use np.where(outliers)[0] to get actual indices
  • For small datasets (n < 30), Z-score method may not be reliable; consider using other outlier detection methods
  • Consider handling outliers appropriately after detection (removal, transformation, or investigation) rather than automatic deletion
  • For multivariate outlier detection, consider using Mahalanobis distance or other multivariate methods instead

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function detect_outliers_iqr_v2 65.7% similar

    Detects statistical outliers in a dataset using the Interquartile Range (IQR) method with a conservative 3×IQR threshold.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/84b9ac09-e646-4422-9d3a-e9f96529a553/analysis_1.py
  • function detect_outliers_iqr_v1 65.2% similar

    Detects outliers in a dataset using the Interquartile Range (IQR) method, returning boolean indices of outliers and the calculated bounds.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/328d2f87-3367-495e-89f7-e633ff8c5b3d/analysis_2.py
  • function detect_outliers_iqr 59.8% similar

    Detects extreme outliers in a pandas Series using the Interquartile Range (IQR) method with a configurable multiplier (default 3.0).

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/5021ab2a-8cdd-44cb-81ad-201598352e39/analysis_1.py
  • function remove_outliers_iqr 58.1% similar

    Removes outliers from a pandas DataFrame column using the Interquartile Range (IQR) method with a conservative 3*IQR threshold.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/42b81361-ba7e-4d79-9598-3090af68384b/analysis_2.py
  • function remove_outliers 57.3% similar

    Removes outliers from a pandas DataFrame based on the Interquartile Range (IQR) method for a specified column.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/f5da873e-41e6-4f34-b3e4-f7443d4d213b/analysis_5.py
← Back to Browse