function create_sample_data_v2
Generates a synthetic dataset of 200 poultry research records with multiple treatment groups, challenge regimens, and performance metrics for demonstration purposes.
/tf/active/vicechatdev/vice_ai/smartstat_scripts/5a059cb7-3903-4020-8519-14198d1f39c9/analysis_1.py
36 - 66
simple
Purpose
Creates a reproducible sample dataset simulating a poultry coccidiosis challenge study with realistic correlations between challenge levels and bird performance metrics. Useful for testing data analysis pipelines, visualization tools, or statistical methods in veterinary/agricultural research contexts without requiring real experimental data.
Source Code
def create_sample_data():
"""Create sample dataset for demonstration"""
np.random.seed(42)
n = 200
treatments = ['Control', 'Treatment_A', 'Treatment_B', 'Treatment_C']
challenge_regimens = ['Non-challenged', 'Low_challenge', 'High_challenge']
data = {
'bird_id': range(1, n+1),
'treatment': np.random.choice(treatments, n),
'challenge_regimen': np.random.choice(challenge_regimens, n),
'eimeria_oocyst_count': np.random.exponential(5000, n),
'eimeria_lesion_score': np.random.randint(0, 5, n),
'body_weight_gain': np.random.normal(2000, 300, n),
'feed_conversion_ratio': np.random.normal(1.8, 0.3, n),
'feed_intake': np.random.normal(3500, 400, n),
'mortality_rate': np.random.uniform(0, 15, n),
'weight_day21': np.random.normal(800, 150, n),
'weight_day35': np.random.normal(2000, 300, n),
'intestinal_health_score': np.random.randint(1, 11, n)
}
df = pd.DataFrame(data)
# Add realistic correlations
df.loc[df['challenge_regimen'] == 'High_challenge', 'eimeria_oocyst_count'] *= 2
df.loc[df['challenge_regimen'] == 'High_challenge', 'body_weight_gain'] *= 0.8
df.loc[df['challenge_regimen'] == 'High_challenge', 'feed_conversion_ratio'] *= 1.2
return df
Return Value
Returns a pandas DataFrame with 200 rows and 12 columns. Columns include: 'bird_id' (int, 1-200), 'treatment' (str, one of 4 treatment groups), 'challenge_regimen' (str, one of 3 challenge levels), 'eimeria_oocyst_count' (float, exponentially distributed around 5000, doubled for high challenge), 'eimeria_lesion_score' (int, 0-4), 'body_weight_gain' (float, normally distributed around 2000g, reduced by 20% for high challenge), 'feed_conversion_ratio' (float, normally distributed around 1.8, increased by 20% for high challenge), 'feed_intake' (float, normally distributed around 3500g), 'mortality_rate' (float, 0-15%), 'weight_day21' (float, normally distributed around 800g), 'weight_day35' (float, normally distributed around 2000g), and 'intestinal_health_score' (int, 1-10).
Dependencies
numpypandas
Required Imports
import numpy as np
import pandas as pd
Usage Example
import numpy as np
import pandas as pd
def create_sample_data():
"""Create sample dataset for demonstration"""
np.random.seed(42)
n = 200
treatments = ['Control', 'Treatment_A', 'Treatment_B', 'Treatment_C']
challenge_regimens = ['Non-challenged', 'Low_challenge', 'High_challenge']
data = {
'bird_id': range(1, n+1),
'treatment': np.random.choice(treatments, n),
'challenge_regimen': np.random.choice(challenge_regimens, n),
'eimeria_oocyst_count': np.random.exponential(5000, n),
'eimeria_lesion_score': np.random.randint(0, 5, n),
'body_weight_gain': np.random.normal(2000, 300, n),
'feed_conversion_ratio': np.random.normal(1.8, 0.3, n),
'feed_intake': np.random.normal(3500, 400, n),
'mortality_rate': np.random.uniform(0, 15, n),
'weight_day21': np.random.normal(800, 150, n),
'weight_day35': np.random.normal(2000, 300, n),
'intestinal_health_score': np.random.randint(1, 11, n)
}
df = pd.DataFrame(data)
df.loc[df['challenge_regimen'] == 'High_challenge', 'eimeria_oocyst_count'] *= 2
df.loc[df['challenge_regimen'] == 'High_challenge', 'body_weight_gain'] *= 0.8
df.loc[df['challenge_regimen'] == 'High_challenge', 'feed_conversion_ratio'] *= 1.2
return df
# Generate sample data
df = create_sample_data()
print(df.head())
print(f"\nDataset shape: {df.shape}")
print(f"\nTreatment groups: {df['treatment'].unique()}")
print(f"Challenge regimens: {df['challenge_regimen'].unique()}")
Best Practices
- The function uses np.random.seed(42) for reproducibility - the same data will be generated on every call
- The function creates realistic correlations between challenge regimen and performance metrics, making it suitable for testing statistical analysis pipelines
- Consider modifying the seed value if you need different random datasets for multiple test scenarios
- The dataset size (n=200) is hardcoded; consider parameterizing if you need different sample sizes
- High challenge birds show expected biological responses: increased oocyst counts, reduced weight gain, and poorer feed conversion
- All numeric values use realistic ranges based on typical poultry research data
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function create_sample_data_v1 70.7% similar
-
function main_v56 65.2% similar
-
function create_data_quality_dashboard_v1 59.0% similar
-
function create_data_quality_dashboard 58.3% similar
-
function create_test_dataset 57.0% similar