🔍 Code Extractor

function explore_data

Maturity: 42

Performs comprehensive exploratory data analysis on a pandas DataFrame, printing dataset overview, data types, missing values, descriptive statistics, and identifying categorical and numerical variables.

File:
/tf/active/vicechatdev/vice_ai/smartstat_scripts/5a059cb7-3903-4020-8519-14198d1f39c9/analysis_1.py
Lines:
72 - 97
Complexity:
simple

Purpose

This function serves as an initial data exploration tool for data science workflows. It provides a quick overview of a dataset's structure, quality, and composition by displaying key information about the DataFrame including sample rows, data types, missing value counts, statistical summaries, and automatically categorizing variables into categorical and numerical types. This is typically used as a first step in data analysis pipelines to understand the dataset before preprocessing or modeling.

Source Code

def explore_data(df):
    """Perform initial data exploration"""
    print("\n" + "="*80)
    print("DATA EXPLORATION")
    print("="*80)
    
    print("\nDataset Overview:")
    print(df.head(10))
    
    print("\nData Types:")
    print(df.dtypes)
    
    print("\nMissing Values:")
    print(df.isnull().sum())
    
    print("\nDescriptive Statistics:")
    print(df.describe())
    
    # Identify variable types
    categorical_vars = df.select_dtypes(include=['object']).columns.tolist()
    numerical_vars = df.select_dtypes(include=[np.number]).columns.tolist()
    
    print(f"\nCategorical Variables: {categorical_vars}")
    print(f"\nNumerical Variables: {numerical_vars}")
    
    return categorical_vars, numerical_vars

Parameters

Name Type Default Kind
df - - positional_or_keyword

Parameter Details

df: A pandas DataFrame object containing the dataset to be explored. Expected to be a valid DataFrame with any combination of numerical and categorical columns. No specific constraints on size or structure, though very large DataFrames may produce verbose output.

Return Value

Returns a tuple containing two lists: (categorical_vars, numerical_vars). The first element is a list of column names with object dtype (categorical variables), and the second element is a list of column names with numerical dtypes (int, float, etc.). Both lists contain strings representing column names from the input DataFrame.

Dependencies

  • pandas
  • numpy

Required Imports

import pandas as pd
import numpy as np

Usage Example

import pandas as pd
import numpy as np

# Create sample DataFrame
df = pd.DataFrame({
    'age': [25, 30, 35, 40, 45],
    'salary': [50000, 60000, 70000, 80000, 90000],
    'department': ['HR', 'IT', 'Sales', 'IT', 'HR'],
    'city': ['NYC', 'LA', 'Chicago', 'NYC', 'LA']
})

# Explore the data
categorical_vars, numerical_vars = explore_data(df)

# Use the returned variables
print(f"Found {len(categorical_vars)} categorical variables")
print(f"Found {len(numerical_vars)} numerical variables")

Best Practices

  • This function prints output directly to console, so it's best used in interactive environments (Jupyter notebooks, scripts) rather than production pipelines
  • For very large DataFrames, consider using df.head() with a smaller number of rows to reduce output verbosity
  • The function assumes standard pandas DataFrame structure; ensure your data is loaded as a DataFrame before calling
  • The categorical/numerical variable identification is based on dtype only; consider manual verification for edge cases like numeric IDs stored as integers
  • This function does not modify the input DataFrame, making it safe to use without side effects
  • Consider capturing the output or redirecting stdout if you need to log the exploration results to a file

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function load_analysis_data 59.8% similar

    Loads CSV dataset(s) into pandas DataFrames based on dataset configuration, supporting both single dataset loading and comparison mode with two datasets.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
  • function main_v56 56.0% similar

    Performs comprehensive exploratory data analysis on a broiler chicken performance dataset, analyzing the correlation between Eimeria infection and performance measures (weight gain, feed conversion ratio, mortality rate) across different treatments and challenge regimens.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/343f5578-64e0-4101-84bd-5824b3c15deb/project_1/analysis.py
  • function compare_datasets 54.6% similar

    Analyzes and compares two pandas DataFrames containing flock data (original vs cleaned), printing detailed statistics about removed records, type distributions, and impact assessment.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
  • function load_dataset 53.9% similar

    Loads a CSV dataset from a specified file path using pandas and returns it as a DataFrame with error handling for file not found and general exceptions.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/e1ecec5f-4ea5-49c5-b4f5-d051ce851294/project_1/analysis.py
  • function load_data 53.4% similar

    Loads a CSV dataset from a specified filepath using pandas, with fallback to creating sample data if the file is not found.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/5a059cb7-3903-4020-8519-14198d1f39c9/analysis_1.py
← Back to Browse