🔍 Code Extractor

function create_sample_data_v1

Maturity: 46

Generates a synthetic dataset with 200 samples containing group-based measurements, quality scores, environmental data, and temporal information, then saves it to a CSV file.

File:
/tf/active/vicechatdev/full_smartstat/demo.py
Lines:
24 - 70
Complexity:
simple

Purpose

Creates demonstration data for testing statistical analysis workflows. The function generates a structured dataset with three groups (Group_A, Group_B, Group_C) that have different baseline characteristics, along with associated measurements, quality scores, temperature, humidity, dates, and pass/fail outcomes. This is useful for testing data analysis pipelines, statistical models, and visualization tools without requiring real data.

Source Code

def create_sample_data():
    """Create sample dataset for demonstration"""
    np.random.seed(42)
    
    # Generate sample data
    n_samples = 200
    
    # Groups
    groups = ['Group_A', 'Group_B', 'Group_C'] * (n_samples // 3) + ['Group_A'] * (n_samples % 3)
    
    # Measurements with group effects
    measurements = []
    quality_scores = []
    
    for group in groups:
        if group == 'Group_A':
            base_value = 100
            quality_base = 85
        elif group == 'Group_B':
            base_value = 105
            quality_base = 90
        else:  # Group_C
            base_value = 95
            quality_base = 80
        
        measurements.append(base_value + np.random.normal(0, 10))
        quality_scores.append(quality_base + np.random.normal(0, 5))
    
    # Create DataFrame
    df = pd.DataFrame({
        'Group': groups,
        'Measurement': measurements,
        'Quality_Score': quality_scores,
        'Temperature': np.random.normal(25, 3, n_samples),
        'Humidity': np.random.normal(60, 8, n_samples),
        'Date': pd.date_range('2024-01-01', periods=n_samples, freq='D'),
        'Pass_Fail': np.random.choice(['Pass', 'Fail'], n_samples, p=[0.85, 0.15])
    })
    
    # Save to CSV
    csv_path = '/tf/active/smartstat/demo_data.csv'
    df.to_csv(csv_path, index=False)
    print(f"✅ Sample data created: {csv_path}")
    print(f"   - {len(df)} rows, {len(df.columns)} columns")
    print(f"   - Groups: {df['Group'].value_counts().to_dict()}")
    
    return csv_path, df

Return Value

Returns a tuple containing two elements: (1) a string representing the file path where the CSV was saved ('/tf/active/smartstat/demo_data.csv'), and (2) a pandas DataFrame containing the generated sample data with columns: Group, Measurement, Quality_Score, Temperature, Humidity, Date, and Pass_Fail. The DataFrame has 200 rows with specific group distributions and normally distributed values.

Dependencies

  • numpy
  • pandas

Required Imports

import numpy as np
import pandas as pd

Usage Example

import numpy as np
import pandas as pd

def create_sample_data():
    """Create sample dataset for demonstration"""
    np.random.seed(42)
    n_samples = 200
    groups = ['Group_A', 'Group_B', 'Group_C'] * (n_samples // 3) + ['Group_A'] * (n_samples % 3)
    measurements = []
    quality_scores = []
    for group in groups:
        if group == 'Group_A':
            base_value = 100
            quality_base = 85
        elif group == 'Group_B':
            base_value = 105
            quality_base = 90
        else:
            base_value = 95
            quality_base = 80
        measurements.append(base_value + np.random.normal(0, 10))
        quality_scores.append(quality_base + np.random.normal(0, 5))
    df = pd.DataFrame({
        'Group': groups,
        'Measurement': measurements,
        'Quality_Score': quality_scores,
        'Temperature': np.random.normal(25, 3, n_samples),
        'Humidity': np.random.normal(60, 8, n_samples),
        'Date': pd.date_range('2024-01-01', periods=n_samples, freq='D'),
        'Pass_Fail': np.random.choice(['Pass', 'Fail'], n_samples, p=[0.85, 0.15])
    })
    csv_path = '/tf/active/smartstat/demo_data.csv'
    df.to_csv(csv_path, index=False)
    return csv_path, df

# Usage
file_path, data = create_sample_data()
print(f"Data saved to: {file_path}")
print(data.head())

Best Practices

  • The function uses a fixed random seed (42) for reproducibility, ensuring the same data is generated on each run
  • Ensure the target directory '/tf/active/smartstat/' exists before calling this function, or modify the csv_path to a valid location
  • The function creates 200 samples with specific group distributions: approximately 67 samples each for Group_A and Group_B, and 66 for Group_C
  • Group_A has base measurement of 100 and quality score of 85, Group_B has 105/90, and Group_C has 95/80
  • All numeric measurements use normal distributions with specified means and standard deviations
  • The Pass_Fail column has an 85% pass rate, simulating realistic quality control scenarios
  • Consider parameterizing the file path, number of samples, and group characteristics for more flexible reuse
  • The function prints status information to stdout, which may need to be suppressed in production environments

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function create_test_dataset 72.9% similar

    Creates a test CSV dataset with sample product sales data across different regions and months, saving it to a temporary file.

    From: /tf/active/vicechatdev/vice_ai/test_integration.py
  • function create_sample_data_v2 70.7% similar

    Generates a synthetic dataset of 200 poultry research records with multiple treatment groups, challenge regimens, and performance metrics for demonstration purposes.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/5a059cb7-3903-4020-8519-14198d1f39c9/analysis_1.py
  • function demo_statistical_agent 56.6% similar

    Demonstrates the capabilities of a statistical agent by testing query interpretation on sample data with various statistical analysis queries.

    From: /tf/active/vicechatdev/full_smartstat/demo.py
  • function demo_analysis_workflow 54.4% similar

    Demonstrates a complete end-to-end statistical analysis workflow using the SmartStat system, including session creation, data loading, natural language query processing, analysis execution, and result interpretation.

    From: /tf/active/vicechatdev/full_smartstat/demo.py
  • function create_data_quality_dashboard_v1 53.1% similar

    Creates an interactive data quality dashboard for analyzing treatment timing issues in poultry flock management data by loading and processing CSV files containing timing anomalies.

    From: /tf/active/vicechatdev/data_quality_dashboard.py
← Back to Browse