🔍 Code Extractor

function remove_outliers_iqr

Maturity: 44

Removes outliers from a pandas DataFrame column using the Interquartile Range (IQR) method with a conservative 3*IQR threshold.

File:
/tf/active/vicechatdev/vice_ai/smartstat_scripts/42b81361-ba7e-4d79-9598-3090af68384b/analysis_2.py
Lines:
63 - 74
Complexity:
simple

Purpose

This function identifies and removes statistical outliers from a specified numeric column in a pandas DataFrame. It uses the IQR method with a 3*IQR multiplier (instead of the standard 1.5*IQR) for more conservative outlier detection, meaning it only removes extreme outliers. The function returns both the cleaned dataset and a count of removed outliers, making it useful for data preprocessing, exploratory data analysis, and preparing datasets for statistical modeling or machine learning.

Source Code

def remove_outliers_iqr(data, column):
    """Remove outliers using IQR method"""
    Q1 = data[column].quantile(0.25)
    Q3 = data[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 3 * IQR  # Using 3*IQR for more conservative outlier removal
    upper_bound = Q3 + 3 * IQR
    
    outliers_count = ((data[column] < lower_bound) | (data[column] > upper_bound)).sum()
    clean_data = data[(data[column] >= lower_bound) & (data[column] <= upper_bound)]
    
    return clean_data, outliers_count

Parameters

Name Type Default Kind
data - - positional_or_keyword
column - - positional_or_keyword

Parameter Details

data: A pandas DataFrame containing the dataset to be cleaned. Must be a valid DataFrame object with at least one numeric column.

column: String name of the column in the DataFrame from which to remove outliers. The column must exist in the DataFrame and should contain numeric data (int or float) for quantile calculations to work properly.

Return Value

Returns a tuple containing two elements: (1) clean_data - a pandas DataFrame with the same structure as the input but with rows containing outliers in the specified column removed, and (2) outliers_count - an integer representing the number of rows that were identified and removed as outliers.

Dependencies

  • pandas

Required Imports

import pandas as pd

Usage Example

import pandas as pd

# Create sample data with outliers
data = pd.DataFrame({
    'values': [10, 12, 11, 13, 12, 14, 100, 11, 13, 12, -50, 14],
    'category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
})

# Remove outliers from 'values' column
clean_data, num_outliers = remove_outliers_iqr(data, 'values')

print(f"Original data shape: {data.shape}")
print(f"Clean data shape: {clean_data.shape}")
print(f"Number of outliers removed: {num_outliers}")

Best Practices

  • Ensure the specified column contains numeric data before calling this function to avoid errors with quantile calculations
  • Be aware that this function uses 3*IQR instead of the standard 1.5*IQR, making it more conservative and only removing extreme outliers
  • Always check the outliers_count return value to understand how much data was removed
  • Consider visualizing the data distribution before and after outlier removal to validate the cleaning process
  • This function modifies the DataFrame by removing rows, so consider making a copy of your original data if you need to preserve it
  • The function assumes the column exists in the DataFrame - add error handling for missing columns in production code
  • For columns with non-numeric data types, this function will raise an error - validate data types beforehand

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function remove_outliers_iqr_v1 97.3% similar

    Removes outliers from a pandas DataFrame column using the Interquartile Range (IQR) method with a 3×IQR threshold.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/42b81361-ba7e-4d79-9598-3090af68384b/project_1/analysis.py
  • function remove_outliers 95.1% similar

    Removes outliers from a pandas DataFrame based on the Interquartile Range (IQR) method for a specified column.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/f5da873e-41e6-4f34-b3e4-f7443d4d213b/analysis_5.py
  • function detect_outliers_iqr_v2 89.6% similar

    Detects statistical outliers in a dataset using the Interquartile Range (IQR) method with a conservative 3×IQR threshold.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/84b9ac09-e646-4422-9d3a-e9f96529a553/analysis_1.py
  • function detect_outliers_iqr 88.6% similar

    Detects extreme outliers in a pandas Series using the Interquartile Range (IQR) method with a configurable multiplier (default 3.0).

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/5021ab2a-8cdd-44cb-81ad-201598352e39/analysis_1.py
  • function detect_outliers_iqr_v1 83.9% similar

    Detects outliers in a dataset using the Interquartile Range (IQR) method, returning boolean indices of outliers and the calculated bounds.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/328d2f87-3367-495e-89f7-e633ff8c5b3d/analysis_2.py
← Back to Browse