function remove_outliers
Removes outliers from a pandas DataFrame based on the Interquartile Range (IQR) method for a specified column.
/tf/active/vicechatdev/vice_ai/smartstat_scripts/f5da873e-41e6-4f34-b3e4-f7443d4d213b/analysis_5.py
26 - 32
simple
Purpose
This function identifies and filters out statistical outliers from a dataset using the IQR method, which is a robust statistical technique. It calculates the first quartile (Q1), third quartile (Q3), and IQR, then removes data points that fall outside 1.5 times the IQR below Q1 or above Q3. This is commonly used in data preprocessing and exploratory data analysis to clean datasets and improve model performance by removing extreme values that may skew results.
Source Code
def remove_outliers(df, column):
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
return df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
df |
- | - | positional_or_keyword |
column |
- | - | positional_or_keyword |
Parameter Details
df: A pandas DataFrame containing the data to be filtered. Must be a valid DataFrame object with at least one numeric column.
column: String representing the name of the column in the DataFrame to check for outliers. The column must exist in the DataFrame and should contain numeric data (int or float) for quantile calculations to work properly.
Return Value
Returns a filtered pandas DataFrame containing only the rows where the specified column's values fall within the acceptable range (between lower_bound and upper_bound). The returned DataFrame maintains the same structure and columns as the input DataFrame but with fewer rows. If no outliers are found, returns the original DataFrame unchanged.
Dependencies
pandas
Required Imports
import pandas as pd
Usage Example
import pandas as pd
# Create sample data with outliers
data = {'values': [10, 12, 13, 14, 15, 16, 17, 18, 100, 200],
'category': ['A', 'B', 'A', 'B', 'A', 'B', 'A', 'B', 'A', 'B']}
df = pd.DataFrame(data)
# Remove outliers from 'values' column
df_cleaned = remove_outliers(df, 'values')
print(f"Original shape: {df.shape}")
print(f"Cleaned shape: {df_cleaned.shape}")
print(df_cleaned)
Best Practices
- Ensure the specified column contains numeric data before calling this function to avoid errors
- Be aware that this function modifies the DataFrame by filtering rows, which may significantly reduce dataset size if many outliers exist
- The 1.5 * IQR multiplier is a standard threshold, but consider creating a parameterized version if different sensitivity levels are needed
- Always inspect the data before and after outlier removal to understand the impact on your dataset
- This method assumes a roughly symmetric distribution; for highly skewed data, consider alternative outlier detection methods
- The function returns a view/copy of the DataFrame, so the original DataFrame remains unchanged unless you reassign it
- Consider handling missing values (NaN) in the column before applying this function, as they may affect quantile calculations
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function remove_outliers_iqr_v1 95.8% similar
-
function remove_outliers_iqr 95.1% similar
-
function detect_outliers_iqr_v2 88.5% similar
-
function detect_outliers_iqr 86.8% similar
-
function detect_outliers_iqr_v1 85.3% similar