smart_read_csv - Code Extractor

function smart_read_csv

Maturity: 60

Automatically detects CSV file delimiters (comma, semicolon, tab) and handles regional decimal formats (European comma vs US/UK point) to reliably parse CSV files from different locales.

File:
/tf/active/vicechatdev/vice_ai/smartstat_service.py

Lines:
252 - 336

Complexity:
moderate

Purpose

This function provides robust CSV file reading that adapts to regional formatting differences. It samples the file to detect the delimiter type, identifies European decimal notation (comma as decimal separator), and converts data to a standardized pandas DataFrame. This is particularly useful when processing CSV files from unknown sources or international datasets where delimiter and decimal conventions vary.

Source Code

def smart_read_csv(file_path: str) -> pd.DataFrame:
    """
    Smart CSV reader that automatically detects delimiter (comma, semicolon, tab)
    and handles European decimal format (comma as decimal separator)
    
    Handles regional differences:
    - US/UK format: comma delimiter, point decimal (1,234.56)
    - European format: semicolon delimiter, comma decimal (1.234,56 or 1234,56)
    - Tab-delimited with either decimal format
    """
    import csv
    import re
    
    try:
        # Read sample to detect format
        with open(file_path, 'r', encoding='utf-8') as f:
            sample = f.read(8192)  # Larger sample for better detection
        
        # Detect delimiter
        delimiter = ','
        has_european_decimals = False
        
        # Count potential delimiters in sample
        comma_count = sample.count(',')
        semicolon_count = sample.count(';')
        tab_count = sample.count('\t')
        
        # Look for European decimal pattern (numbers with comma as decimal separator)
        # Pattern: digits, optional thousand separator (. or space), comma, 1-4 decimal digits
        european_decimal_pattern = re.compile(r'\d+[.,]?\d*,\d{1,4}')
        european_matches = len(european_decimal_pattern.findall(sample))
        
        # Decision logic for delimiter and decimal format
        if tab_count > 10:  # Tab-delimited
            delimiter = '\t'
            logger.info(f"Detected tab delimiter")
            # Check for European decimals in tab-delimited file
            if european_matches > 5:
                has_european_decimals = True
                logger.info(f"Detected European decimal format (comma separator) in tab-delimited file")
        elif semicolon_count > comma_count and semicolon_count > 10:
            # Likely European format with semicolon delimiter
            delimiter = ';'
            has_european_decimals = True
            logger.info(f"Detected semicolon delimiter (European format)")
        else:
            # Default to comma, but check if it's actually European with comma decimals
            # This would be unusual (comma both as delimiter and decimal) but check anyway
            delimiter = ','
            # If we see patterns like "value1;value2;value3" it's probably semicolon-delimited
            if semicolon_count > 5:
                delimiter = ';'
                has_european_decimals = True
                logger.info(f"Detected semicolon delimiter based on frequency")
            else:
                logger.info(f"Using comma delimiter (default)")
        
        # Read CSV file with detected delimiter
        # Use dtype=str initially to preserve numbers for decimal detection
        df = pd.read_csv(file_path, delimiter=delimiter, encoding='utf-8', dtype=str, keep_default_na=False)
        
        # If we only got 1 column, try other delimiters
        if len(df.columns) == 1 and delimiter == ',':
            logger.warning(f"Only 1 column detected with comma, trying semicolon")
            df = pd.read_csv(file_path, delimiter=';', encoding='utf-8', dtype=str, keep_default_na=False)
            has_european_decimals = True
            delimiter = ';'
            
        if len(df.columns) == 1 and delimiter != '\t':
            logger.warning(f"Still only 1 column, trying tab")
            df = pd.read_csv(file_path, delimiter='\t', encoding='utf-8', dtype=str, keep_default_na=False)
            delimiter = '\t'
        
        # Convert European decimal format to standard format
        df = convert_european_decimals(df)
        
        # Restore NA handling
        df = df.replace('', pd.NA)
        
        logger.info(f"Successfully parsed CSV: {len(df)} rows × {len(df.columns)} columns (delimiter: '{delimiter}')")
        return df
        
    except Exception as e:
        logger.error(f"Error reading CSV file {file_path}: {e}")
        raise

Parameters

Name	Type	Default	Kind
`file_path`	str	-	positional_or_keyword

Parameter Details

file_path: String path to the CSV file to be read. Must be a valid file path accessible by the program. The file should be UTF-8 encoded and contain tabular data with consistent delimiters.

Return Value

Type: pd.DataFrame

Returns a pandas DataFrame containing the parsed CSV data with standardized decimal formatting (point as decimal separator). All European decimal commas are converted to points. Empty strings are replaced with pd.NA for proper missing value handling. The DataFrame preserves the original column structure detected from the CSV file.

Dependencies

pandas
csv
re

Required Imports

import pandas as pd
import csv
import re

Conditional/Optional Imports

These imports are only needed under specific conditions:

import csv

Condition: imported inside function but always used for CSV processing

Required (conditional)

import re

Condition: imported inside function but always used for pattern matching European decimals

Required (conditional)

Usage Example

import pandas as pd
import logging

# Setup logger (required dependency)
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Define convert_european_decimals helper (required dependency)
def convert_european_decimals(df):
    for col in df.columns:
        if df[col].dtype == 'object':
            df[col] = df[col].str.replace(',', '.', regex=False)
    return df

# Use the function
df = smart_read_csv('data.csv')
print(f"Loaded {len(df)} rows and {len(df.columns)} columns")
print(df.head())

# Works with various formats:
# US format: data_us.csv with comma delimiter and point decimals
# European format: data_eu.csv with semicolon delimiter and comma decimals
# Tab-delimited: data_tab.csv with tab delimiter

Best Practices

Ensure the 'logger' object is properly configured in the module before calling this function
The 'convert_european_decimals' helper function must be defined in the same module
Files should be UTF-8 encoded; other encodings may cause errors
The function reads an 8KB sample for detection, so ensure CSV files have consistent formatting throughout
For very large files, consider the memory implications of loading the entire file into a DataFrame
The function replaces empty strings with pd.NA, so be aware of this when checking for missing values
If delimiter detection fails (resulting in a single column), the function automatically retries with alternative delimiters
All numeric columns are initially read as strings to preserve decimal format during conversion, then converted appropriately

Similar Components

AI-powered semantic similarity - components with related functionality:

function test_tab_delimited_european 78.6% similar

A unit test function that validates the smart_read_csv function's ability to correctly parse tab-delimited CSV files containing European-style decimal numbers (using commas instead of periods).
From: /tf/active/vicechatdev/vice_ai/test_regional_formats.py
function test_european_csv 77.0% similar

A test function that validates the ability to read and parse European-formatted CSV files (semicolon delimiters, comma decimal separators) and convert them to proper numeric types.
From: /tf/active/vicechatdev/vice_ai/test_regional_formats.py
function test_us_csv 72.0% similar

A unit test function that validates the smart_read_csv function's ability to correctly parse US-formatted CSV files with comma delimiters and point decimal separators.
From: /tf/active/vicechatdev/vice_ai/test_regional_formats.py
function test_european_with_thousands 71.4% similar

A unit test function that validates the smart_read_csv function's ability to correctly parse European-formatted CSV files with thousand separators (dots) and decimal commas.
From: /tf/active/vicechatdev/vice_ai/test_regional_formats.py
function convert_european_decimals 70.9% similar

Detects and converts numeric data with European decimal format (comma as decimal separator) to standard format (dot as decimal separator) in a pandas DataFrame, handling mixed formats and missing data patterns.
From: /tf/active/vicechatdev/vice_ai/smartstat_service.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def smart_read_csv(file_path: str) -> pd.DataFrame:
    """
    Smart CSV reader that automatically detects delimiter (comma, semicolon, tab)
    and handles European decimal format (comma as decimal separator)
    
    Handles regional differences:
    - US/UK format: comma delimiter, point decimal (1,234.56)
    - European format: semicolon delimiter, comma decimal (1.234,56 or 1234,56)
    - Tab-delimited with either decimal format
    """
    import csv
    import re
    
    try:
        # Read sample to detect format
        with open(file_path, 'r', encoding='utf-8') as f:
            sample = f.read(8192)  # Larger sample for better detection
        
        # Detect delimiter
        delimiter = ','
        has_european_decimals = False
        
        # Count potential delimiters in sample
        comma_count = sample.count(',')
        semicolon_count = sample.count(';')
        tab_count = sample.count('\t')
        
        # Look for European decimal pattern (numbers with comma as decimal separator)
        # Pattern: digits, optional thousand separator (. or space), comma, 1-4 decimal digits
        european_decimal_pattern = re.compile(r'\d+[.,]?\d*,\d{1,4}')
        european_matches = len(european_decimal_pattern.findall(sample))
        
        # Decision logic for delimiter and decimal format
        if tab_count > 10:  # Tab-delimited
            delimiter = '\t'
            logger.info(f"Detected tab delimiter")
            # Check for European decimals in tab-delimited file
            if european_matches > 5:
                has_european_decimals = True
                logger.info(f"Detected European decimal format (comma separator) in tab-delimited file")
        elif semicolon_count > comma_count and semicolon_count > 10:
            # Likely European format with semicolon delimiter
            delimiter = ';'
            has_european_decimals = True
            logger.info(f"Detected semicolon delimiter (European format)")
        else:
            # Default to comma, but check if it's actually European with comma decimals
            # This would be unusual (comma both as delimiter and decimal) but check anyway
            delimiter = ','
            # If we see patterns like "value1;value2;value3" it's probably semicolon-delimited
            if semicolon_count > 5:
                delimiter = ';'
                has_european_decimals = True
                logger.info(f"Detected semicolon delimiter based on frequency")
            else:
                logger.info(f"Using comma delimiter (default)")
        
        # Read CSV file with detected delimiter
        # Use dtype=str initially to preserve numbers for decimal detection
        df = pd.read_csv(file_path, delimiter=delimiter, encoding='utf-8', dtype=str, keep_default_na=False)
        
        # If we only got 1 column, try other delimiters
        if len(df.columns) == 1 and delimiter == ',':
            logger.warning(f"Only 1 column detected with comma, trying semicolon")
            df = pd.read_csv(file_path, delimiter=';', encoding='utf-8', dtype=str, keep_default_na=False)
            has_european_decimals = True
            delimiter = ';'
            
        if len(df.columns) == 1 and delimiter != '\t':
            logger.warning(f"Still only 1 column, trying tab")
            df = pd.read_csv(file_path, delimiter='\t', encoding='utf-8', dtype=str, keep_default_na=False)
            delimiter = '\t'
        
        # Convert European decimal format to standard format
        df = convert_european_decimals(df)
        
        # Restore NA handling
        df = df.replace('', pd.NA)
        
        logger.info(f"Successfully parsed CSV: {len(df)} rows × {len(df.columns)} columns (delimiter: '{delimiter}')")
        return df
        
    except Exception as e:
        logger.error(f"Error reading CSV file {file_path}: {e}")
        raise
                        

Improved Code

🔍 Code Extractor

function smart_read_csv

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function test_tab_delimited_european 78.6% similar

function test_european_csv 77.0% similar

function test_us_csv 72.0% similar

function test_european_with_thousands 71.4% similar

function convert_european_decimals 70.9% similar

function smart_read_csv

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

function test_tab_delimited_european 78.6% similar

function test_european_csv 77.0% similar

function test_us_csv 72.0% similar

function test_european_with_thousands 71.4% similar

function convert_european_decimals 70.9% similar

✨ Improve Code: smart_read_csv

Code Comparison