function remove_outliers_iqr
Removes outliers from a pandas DataFrame column using the Interquartile Range (IQR) method with a conservative 3*IQR threshold.
/tf/active/vicechatdev/vice_ai/smartstat_scripts/42b81361-ba7e-4d79-9598-3090af68384b/analysis_2.py
63 - 74
simple
Purpose
This function identifies and removes statistical outliers from a specified numeric column in a pandas DataFrame. It uses the IQR method with a 3*IQR multiplier (instead of the standard 1.5*IQR) for more conservative outlier detection, meaning it only removes extreme outliers. The function returns both the cleaned dataset and a count of removed outliers, making it useful for data preprocessing, exploratory data analysis, and preparing datasets for statistical modeling or machine learning.
Source Code
def remove_outliers_iqr(data, column):
"""Remove outliers using IQR method"""
Q1 = data[column].quantile(0.25)
Q3 = data[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 3 * IQR # Using 3*IQR for more conservative outlier removal
upper_bound = Q3 + 3 * IQR
outliers_count = ((data[column] < lower_bound) | (data[column] > upper_bound)).sum()
clean_data = data[(data[column] >= lower_bound) & (data[column] <= upper_bound)]
return clean_data, outliers_count
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
data |
- | - | positional_or_keyword |
column |
- | - | positional_or_keyword |
Parameter Details
data: A pandas DataFrame containing the dataset to be cleaned. Must be a valid DataFrame object with at least one numeric column.
column: String name of the column in the DataFrame from which to remove outliers. The column must exist in the DataFrame and should contain numeric data (int or float) for quantile calculations to work properly.
Return Value
Returns a tuple containing two elements: (1) clean_data - a pandas DataFrame with the same structure as the input but with rows containing outliers in the specified column removed, and (2) outliers_count - an integer representing the number of rows that were identified and removed as outliers.
Dependencies
pandas
Required Imports
import pandas as pd
Usage Example
import pandas as pd
# Create sample data with outliers
data = pd.DataFrame({
'values': [10, 12, 11, 13, 12, 14, 100, 11, 13, 12, -50, 14],
'category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
})
# Remove outliers from 'values' column
clean_data, num_outliers = remove_outliers_iqr(data, 'values')
print(f"Original data shape: {data.shape}")
print(f"Clean data shape: {clean_data.shape}")
print(f"Number of outliers removed: {num_outliers}")
Best Practices
- Ensure the specified column contains numeric data before calling this function to avoid errors with quantile calculations
- Be aware that this function uses 3*IQR instead of the standard 1.5*IQR, making it more conservative and only removing extreme outliers
- Always check the outliers_count return value to understand how much data was removed
- Consider visualizing the data distribution before and after outlier removal to validate the cleaning process
- This function modifies the DataFrame by removing rows, so consider making a copy of your original data if you need to preserve it
- The function assumes the column exists in the DataFrame - add error handling for missing columns in production code
- For columns with non-numeric data types, this function will raise an error - validate data types beforehand
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function remove_outliers_iqr_v1 97.3% similar
-
function remove_outliers 95.1% similar
-
function detect_outliers_iqr_v2 89.6% similar
-
function detect_outliers_iqr 88.6% similar
-
function detect_outliers_iqr_v1 83.9% similar