function explore_data
Performs comprehensive exploratory data analysis on a pandas DataFrame, printing dataset overview, data types, missing values, descriptive statistics, and identifying categorical and numerical variables.
/tf/active/vicechatdev/vice_ai/smartstat_scripts/5a059cb7-3903-4020-8519-14198d1f39c9/analysis_1.py
72 - 97
simple
Purpose
This function serves as an initial data exploration tool for data science workflows. It provides a quick overview of a dataset's structure, quality, and composition by displaying key information about the DataFrame including sample rows, data types, missing value counts, statistical summaries, and automatically categorizing variables into categorical and numerical types. This is typically used as a first step in data analysis pipelines to understand the dataset before preprocessing or modeling.
Source Code
def explore_data(df):
"""Perform initial data exploration"""
print("\n" + "="*80)
print("DATA EXPLORATION")
print("="*80)
print("\nDataset Overview:")
print(df.head(10))
print("\nData Types:")
print(df.dtypes)
print("\nMissing Values:")
print(df.isnull().sum())
print("\nDescriptive Statistics:")
print(df.describe())
# Identify variable types
categorical_vars = df.select_dtypes(include=['object']).columns.tolist()
numerical_vars = df.select_dtypes(include=[np.number]).columns.tolist()
print(f"\nCategorical Variables: {categorical_vars}")
print(f"\nNumerical Variables: {numerical_vars}")
return categorical_vars, numerical_vars
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
df |
- | - | positional_or_keyword |
Parameter Details
df: A pandas DataFrame object containing the dataset to be explored. Expected to be a valid DataFrame with any combination of numerical and categorical columns. No specific constraints on size or structure, though very large DataFrames may produce verbose output.
Return Value
Returns a tuple containing two lists: (categorical_vars, numerical_vars). The first element is a list of column names with object dtype (categorical variables), and the second element is a list of column names with numerical dtypes (int, float, etc.). Both lists contain strings representing column names from the input DataFrame.
Dependencies
pandasnumpy
Required Imports
import pandas as pd
import numpy as np
Usage Example
import pandas as pd
import numpy as np
# Create sample DataFrame
df = pd.DataFrame({
'age': [25, 30, 35, 40, 45],
'salary': [50000, 60000, 70000, 80000, 90000],
'department': ['HR', 'IT', 'Sales', 'IT', 'HR'],
'city': ['NYC', 'LA', 'Chicago', 'NYC', 'LA']
})
# Explore the data
categorical_vars, numerical_vars = explore_data(df)
# Use the returned variables
print(f"Found {len(categorical_vars)} categorical variables")
print(f"Found {len(numerical_vars)} numerical variables")
Best Practices
- This function prints output directly to console, so it's best used in interactive environments (Jupyter notebooks, scripts) rather than production pipelines
- For very large DataFrames, consider using df.head() with a smaller number of rows to reduce output verbosity
- The function assumes standard pandas DataFrame structure; ensure your data is loaded as a DataFrame before calling
- The categorical/numerical variable identification is based on dtype only; consider manual verification for edge cases like numeric IDs stored as integers
- This function does not modify the input DataFrame, making it safe to use without side effects
- Consider capturing the output or redirecting stdout if you need to log the exploration results to a file
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function load_analysis_data 59.8% similar
-
function main_v56 56.0% similar
-
function compare_datasets 54.6% similar
-
function load_dataset 53.9% similar
-
function load_data 53.4% similar