🔍 Code Extractor

function remove_outliers

Maturity: 29

Removes outliers from a pandas DataFrame based on the Interquartile Range (IQR) method for a specified column.

File:
/tf/active/vicechatdev/vice_ai/smartstat_scripts/f5da873e-41e6-4f34-b3e4-f7443d4d213b/analysis_5.py
Lines:
26 - 32
Complexity:
simple

Purpose

This function identifies and filters out statistical outliers from a dataset using the IQR method, which is a robust statistical technique. It calculates the first quartile (Q1), third quartile (Q3), and IQR, then removes data points that fall outside 1.5 times the IQR below Q1 or above Q3. This is commonly used in data preprocessing and exploratory data analysis to clean datasets and improve model performance by removing extreme values that may skew results.

Source Code

def remove_outliers(df, column):
    Q1 = df[column].quantile(0.25)
    Q3 = df[column].quantile(0.75)
    IQR = Q3 - Q1
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR
    return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]

Parameters

Name Type Default Kind
df - - positional_or_keyword
column - - positional_or_keyword

Parameter Details

df: A pandas DataFrame containing the data to be filtered. Must be a valid DataFrame object with at least one numeric column.

column: String representing the name of the column in the DataFrame to check for outliers. The column must exist in the DataFrame and should contain numeric data (int or float) for quantile calculations to work properly.

Return Value

Returns a filtered pandas DataFrame containing only the rows where the specified column's values fall within the acceptable range (between lower_bound and upper_bound). The returned DataFrame maintains the same structure and columns as the input DataFrame but with fewer rows. If no outliers are found, returns the original DataFrame unchanged.

Dependencies

  • pandas

Required Imports

import pandas as pd

Usage Example

import pandas as pd

# Create sample data with outliers
data = {'values': [10, 12, 13, 14, 15, 16, 17, 18, 100, 200],
        'category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']}
df = pd.DataFrame(data)

# Remove outliers from 'values' column
df_cleaned = remove_outliers(df, 'values')

print(f"Original shape: {df.shape}")
print(f"Cleaned shape: {df_cleaned.shape}")
print(df_cleaned)

Best Practices

  • Ensure the specified column contains numeric data before calling this function to avoid errors
  • Be aware that this function modifies the DataFrame by filtering rows, which may significantly reduce dataset size if many outliers exist
  • The 1.5 * IQR multiplier is a standard threshold, but consider creating a parameterized version if different sensitivity levels are needed
  • Always inspect the data before and after outlier removal to understand the impact on your dataset
  • This method assumes a roughly symmetric distribution; for highly skewed data, consider alternative outlier detection methods
  • The function returns a view/copy of the DataFrame, so the original DataFrame remains unchanged unless you reassign it
  • Consider handling missing values (NaN) in the column before applying this function, as they may affect quantile calculations

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function remove_outliers_iqr_v1 95.8% similar

    Removes outliers from a pandas DataFrame column using the Interquartile Range (IQR) method with a 3×IQR threshold.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/42b81361-ba7e-4d79-9598-3090af68384b/project_1/analysis.py
  • function remove_outliers_iqr 95.1% similar

    Removes outliers from a pandas DataFrame column using the Interquartile Range (IQR) method with a conservative 3*IQR threshold.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/42b81361-ba7e-4d79-9598-3090af68384b/analysis_2.py
  • function detect_outliers_iqr_v2 88.5% similar

    Detects statistical outliers in a dataset using the Interquartile Range (IQR) method with a conservative 3×IQR threshold.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/84b9ac09-e646-4422-9d3a-e9f96529a553/analysis_1.py
  • function detect_outliers_iqr 86.8% similar

    Detects extreme outliers in a pandas Series using the Interquartile Range (IQR) method with a configurable multiplier (default 3.0).

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/5021ab2a-8cdd-44cb-81ad-201598352e39/analysis_1.py
  • function detect_outliers_iqr_v1 85.3% similar

    Detects outliers in a dataset using the Interquartile Range (IQR) method, returning boolean indices of outliers and the calculated bounds.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/328d2f87-3367-495e-89f7-e633ff8c5b3d/analysis_2.py
← Back to Browse