function remove_outliers_iqr_v1
Removes outliers from a pandas DataFrame column using the Interquartile Range (IQR) method with a 3×IQR threshold.
/tf/active/vicechatdev/vice_ai/smartstat_scripts/42b81361-ba7e-4d79-9598-3090af68384b/project_1/analysis.py
74 - 85
simple
Purpose
This function identifies and removes statistical outliers from a specified column in a pandas DataFrame. It uses the IQR method, defining outliers as values that fall below Q1 - 3×IQR or above Q3 + 3×IQR. This is useful for data cleaning and preprocessing tasks where extreme values need to be filtered out to improve data quality and statistical analysis accuracy.
Source Code
def remove_outliers_iqr(data, column):
"""Remove outliers using IQR method"""
Q1 = data[column].quantile(0.25)
Q3 = data[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 3 * IQR
upper_bound = Q3 + 3 * IQR
outliers_mask = (data[column] < lower_bound) | (data[column] > upper_bound)
n_outliers = outliers_mask.sum()
return data[~outliers_mask], n_outliers
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
data |
- | - | positional_or_keyword |
column |
- | - | positional_or_keyword |
Parameter Details
data: A pandas DataFrame containing the data to be processed. Must be a valid DataFrame object with at least one column.
column: String or column identifier specifying which column in the DataFrame to analyze for outliers. The column must exist in the DataFrame and contain numeric data suitable for quantile calculations.
Return Value
Returns a tuple containing two elements: (1) A pandas DataFrame with outlier rows removed from the original data, maintaining all columns but excluding rows where the specified column had outlier values, and (2) An integer representing the count of outlier rows that were removed.
Dependencies
pandas
Required Imports
import pandas as pd
Usage Example
import pandas as pd
def remove_outliers_iqr(data, column):
Q1 = data[column].quantile(0.25)
Q3 = data[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 3 * IQR
upper_bound = Q3 + 3 * IQR
outliers_mask = (data[column] < lower_bound) | (data[column] > upper_bound)
n_outliers = outliers_mask.sum()
return data[~outliers_mask], n_outliers
# Example usage
df = pd.DataFrame({
'values': [10, 12, 13, 14, 15, 100, 11, 13, 14, 12],
'category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']
})
cleaned_df, num_outliers = remove_outliers_iqr(df, 'values')
print(f'Removed {num_outliers} outliers')
print(cleaned_df)
Best Practices
- Ensure the specified column contains numeric data before calling this function to avoid errors
- The function uses a 3×IQR multiplier which is more conservative than the standard 1.5×IQR; adjust the multiplier in the code if different sensitivity is needed
- Consider making a copy of your DataFrame before calling this function if you need to preserve the original data
- Check the n_outliers return value to understand how much data was removed
- This method assumes a roughly symmetric distribution; for highly skewed data, consider alternative outlier detection methods
- The function removes entire rows where outliers are detected in the specified column, affecting all columns in the DataFrame
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function remove_outliers_iqr 97.3% similar
-
function remove_outliers 95.8% similar
-
function detect_outliers_iqr_v2 87.7% similar
-
function detect_outliers_iqr 86.3% similar
-
function detect_outliers_iqr_v1 83.2% similar