function create_sample_data_v1
Generates a synthetic dataset with 200 samples containing group-based measurements, quality scores, environmental data, and temporal information, then saves it to a CSV file.
/tf/active/vicechatdev/full_smartstat/demo.py
24 - 70
simple
Purpose
Creates demonstration data for testing statistical analysis workflows. The function generates a structured dataset with three groups (Group_A, Group_B, Group_C) that have different baseline characteristics, along with associated measurements, quality scores, temperature, humidity, dates, and pass/fail outcomes. This is useful for testing data analysis pipelines, statistical models, and visualization tools without requiring real data.
Source Code
def create_sample_data():
"""Create sample dataset for demonstration"""
np.random.seed(42)
# Generate sample data
n_samples = 200
# Groups
groups = ['Group_A', 'Group_B', 'Group_C'] * (n_samples // 3) + ['Group_A'] * (n_samples % 3)
# Measurements with group effects
measurements = []
quality_scores = []
for group in groups:
if group == 'Group_A':
base_value = 100
quality_base = 85
elif group == 'Group_B':
base_value = 105
quality_base = 90
else: # Group_C
base_value = 95
quality_base = 80
measurements.append(base_value + np.random.normal(0, 10))
quality_scores.append(quality_base + np.random.normal(0, 5))
# Create DataFrame
df = pd.DataFrame({
'Group': groups,
'Measurement': measurements,
'Quality_Score': quality_scores,
'Temperature': np.random.normal(25, 3, n_samples),
'Humidity': np.random.normal(60, 8, n_samples),
'Date': pd.date_range('2024-01-01', periods=n_samples, freq='D'),
'Pass_Fail': np.random.choice(['Pass', 'Fail'], n_samples, p=[0.85, 0.15])
})
# Save to CSV
csv_path = '/tf/active/smartstat/demo_data.csv'
df.to_csv(csv_path, index=False)
print(f"✅ Sample data created: {csv_path}")
print(f" - {len(df)} rows, {len(df.columns)} columns")
print(f" - Groups: {df['Group'].value_counts().to_dict()}")
return csv_path, df
Return Value
Returns a tuple containing two elements: (1) a string representing the file path where the CSV was saved ('/tf/active/smartstat/demo_data.csv'), and (2) a pandas DataFrame containing the generated sample data with columns: Group, Measurement, Quality_Score, Temperature, Humidity, Date, and Pass_Fail. The DataFrame has 200 rows with specific group distributions and normally distributed values.
Dependencies
numpypandas
Required Imports
import numpy as np
import pandas as pd
Usage Example
import numpy as np
import pandas as pd
def create_sample_data():
"""Create sample dataset for demonstration"""
np.random.seed(42)
n_samples = 200
groups = ['Group_A', 'Group_B', 'Group_C'] * (n_samples // 3) + ['Group_A'] * (n_samples % 3)
measurements = []
quality_scores = []
for group in groups:
if group == 'Group_A':
base_value = 100
quality_base = 85
elif group == 'Group_B':
base_value = 105
quality_base = 90
else:
base_value = 95
quality_base = 80
measurements.append(base_value + np.random.normal(0, 10))
quality_scores.append(quality_base + np.random.normal(0, 5))
df = pd.DataFrame({
'Group': groups,
'Measurement': measurements,
'Quality_Score': quality_scores,
'Temperature': np.random.normal(25, 3, n_samples),
'Humidity': np.random.normal(60, 8, n_samples),
'Date': pd.date_range('2024-01-01', periods=n_samples, freq='D'),
'Pass_Fail': np.random.choice(['Pass', 'Fail'], n_samples, p=[0.85, 0.15])
})
csv_path = '/tf/active/smartstat/demo_data.csv'
df.to_csv(csv_path, index=False)
return csv_path, df
# Usage
file_path, data = create_sample_data()
print(f"Data saved to: {file_path}")
print(data.head())
Best Practices
- The function uses a fixed random seed (42) for reproducibility, ensuring the same data is generated on each run
- Ensure the target directory '/tf/active/smartstat/' exists before calling this function, or modify the csv_path to a valid location
- The function creates 200 samples with specific group distributions: approximately 67 samples each for Group_A and Group_B, and 66 for Group_C
- Group_A has base measurement of 100 and quality score of 85, Group_B has 105/90, and Group_C has 95/80
- All numeric measurements use normal distributions with specified means and standard deviations
- The Pass_Fail column has an 85% pass rate, simulating realistic quality control scenarios
- Consider parameterizing the file path, number of samples, and group characteristics for more flexible reuse
- The function prints status information to stdout, which may need to be suppressed in production environments
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function create_test_dataset 72.9% similar
-
function create_sample_data_v2 70.7% similar
-
function demo_statistical_agent 56.6% similar
-
function demo_analysis_workflow 54.4% similar
-
function create_data_quality_dashboard_v1 53.1% similar