function smart_read_csv
Automatically detects CSV file delimiters (comma, semicolon, tab) and handles regional decimal formats (European comma vs US/UK point) to reliably parse CSV files from different locales.
/tf/active/vicechatdev/vice_ai/smartstat_service.py
252 - 336
moderate
Purpose
This function provides robust CSV file reading that adapts to regional formatting differences. It samples the file to detect the delimiter type, identifies European decimal notation (comma as decimal separator), and converts data to a standardized pandas DataFrame. This is particularly useful when processing CSV files from unknown sources or international datasets where delimiter and decimal conventions vary.
Source Code
def smart_read_csv(file_path: str) -> pd.DataFrame:
"""
Smart CSV reader that automatically detects delimiter (comma, semicolon, tab)
and handles European decimal format (comma as decimal separator)
Handles regional differences:
- US/UK format: comma delimiter, point decimal (1,234.56)
- European format: semicolon delimiter, comma decimal (1.234,56 or 1234,56)
- Tab-delimited with either decimal format
"""
import csv
import re
try:
# Read sample to detect format
with open(file_path, 'r', encoding='utf-8') as f:
sample = f.read(8192) # Larger sample for better detection
# Detect delimiter
delimiter = ','
has_european_decimals = False
# Count potential delimiters in sample
comma_count = sample.count(',')
semicolon_count = sample.count(';')
tab_count = sample.count('\t')
# Look for European decimal pattern (numbers with comma as decimal separator)
# Pattern: digits, optional thousand separator (. or space), comma, 1-4 decimal digits
european_decimal_pattern = re.compile(r'\d+[.,]?\d*,\d{1,4}')
european_matches = len(european_decimal_pattern.findall(sample))
# Decision logic for delimiter and decimal format
if tab_count > 10: # Tab-delimited
delimiter = '\t'
logger.info(f"Detected tab delimiter")
# Check for European decimals in tab-delimited file
if european_matches > 5:
has_european_decimals = True
logger.info(f"Detected European decimal format (comma separator) in tab-delimited file")
elif semicolon_count > comma_count and semicolon_count > 10:
# Likely European format with semicolon delimiter
delimiter = ';'
has_european_decimals = True
logger.info(f"Detected semicolon delimiter (European format)")
else:
# Default to comma, but check if it's actually European with comma decimals
# This would be unusual (comma both as delimiter and decimal) but check anyway
delimiter = ','
# If we see patterns like "value1;value2;value3" it's probably semicolon-delimited
if semicolon_count > 5:
delimiter = ';'
has_european_decimals = True
logger.info(f"Detected semicolon delimiter based on frequency")
else:
logger.info(f"Using comma delimiter (default)")
# Read CSV file with detected delimiter
# Use dtype=str initially to preserve numbers for decimal detection
df = pd.read_csv(file_path, delimiter=delimiter, encoding='utf-8', dtype=str, keep_default_na=False)
# If we only got 1 column, try other delimiters
if len(df.columns) == 1 and delimiter == ',':
logger.warning(f"Only 1 column detected with comma, trying semicolon")
df = pd.read_csv(file_path, delimiter=';', encoding='utf-8', dtype=str, keep_default_na=False)
has_european_decimals = True
delimiter = ';'
if len(df.columns) == 1 and delimiter != '\t':
logger.warning(f"Still only 1 column, trying tab")
df = pd.read_csv(file_path, delimiter='\t', encoding='utf-8', dtype=str, keep_default_na=False)
delimiter = '\t'
# Convert European decimal format to standard format
df = convert_european_decimals(df)
# Restore NA handling
df = df.replace('', pd.NA)
logger.info(f"Successfully parsed CSV: {len(df)} rows × {len(df.columns)} columns (delimiter: '{delimiter}')")
return df
except Exception as e:
logger.error(f"Error reading CSV file {file_path}: {e}")
raise
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
file_path |
str | - | positional_or_keyword |
Parameter Details
file_path: String path to the CSV file to be read. Must be a valid file path accessible by the program. The file should be UTF-8 encoded and contain tabular data with consistent delimiters.
Return Value
Type: pd.DataFrame
Returns a pandas DataFrame containing the parsed CSV data with standardized decimal formatting (point as decimal separator). All European decimal commas are converted to points. Empty strings are replaced with pd.NA for proper missing value handling. The DataFrame preserves the original column structure detected from the CSV file.
Dependencies
pandascsvre
Required Imports
import pandas as pd
import csv
import re
Conditional/Optional Imports
These imports are only needed under specific conditions:
import csv
Condition: imported inside function but always used for CSV processing
Required (conditional)import re
Condition: imported inside function but always used for pattern matching European decimals
Required (conditional)Usage Example
import pandas as pd
import logging
# Setup logger (required dependency)
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
# Define convert_european_decimals helper (required dependency)
def convert_european_decimals(df):
for col in df.columns:
if df[col].dtype == 'object':
df[col] = df[col].str.replace(',', '.', regex=False)
return df
# Use the function
df = smart_read_csv('data.csv')
print(f"Loaded {len(df)} rows and {len(df.columns)} columns")
print(df.head())
# Works with various formats:
# US format: data_us.csv with comma delimiter and point decimals
# European format: data_eu.csv with semicolon delimiter and comma decimals
# Tab-delimited: data_tab.csv with tab delimiter
Best Practices
- Ensure the 'logger' object is properly configured in the module before calling this function
- The 'convert_european_decimals' helper function must be defined in the same module
- Files should be UTF-8 encoded; other encodings may cause errors
- The function reads an 8KB sample for detection, so ensure CSV files have consistent formatting throughout
- For very large files, consider the memory implications of loading the entire file into a DataFrame
- The function replaces empty strings with pd.NA, so be aware of this when checking for missing values
- If delimiter detection fails (resulting in a single column), the function automatically retries with alternative delimiters
- All numeric columns are initially read as strings to preserve decimal format during conversion, then converted appropriately
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function test_tab_delimited_european 78.6% similar
-
function test_european_csv 77.0% similar
-
function test_us_csv 72.0% similar
-
function test_european_with_thousands 71.4% similar
-
function convert_european_decimals 70.9% similar