🔍 Code Extractor

function convert_european_decimals

Maturity: 61

Detects and converts numeric data with European decimal format (comma as decimal separator) to standard format (dot as decimal separator) in a pandas DataFrame, handling mixed formats and missing data patterns.

File:
/tf/active/vicechatdev/vice_ai/smartstat_service.py
Lines:
22 - 250
Complexity:
complex

Purpose

This function is designed to normalize numeric data in DataFrames that may contain European-style decimal notation (e.g., '1.234,56') or US-style notation (e.g., '1,234.56'), or even mixed formats within the same file. It automatically detects the format used in each column, handles various missing data placeholders (like 'N/A', '-', empty strings), and converts all numeric columns to a standardized format suitable for analysis. This is particularly useful when processing CSV files from different locales or when dealing with data exported from European spreadsheet applications.

Source Code

def convert_european_decimals(df: pd.DataFrame) -> pd.DataFrame:
    """
    Helper function to detect and convert European decimal format (comma as decimal separator)
    to standard format (dot as decimal separator) in a DataFrame.
    
    Handles:
    - European format: "12,5", "1.234,56", "1234,56" 
    - US format: "12.5", "1,234.56", "1234.56"
    - Mixed formats in same file (with ';' delimiter)
    - Single '.' or ',' in empty cells (treated as missing data)
    - Automatic detection based on pattern matching
    - Properly handles missing/empty values
    
    Args:
        df: DataFrame to process
        
    Returns:
        DataFrame with numeric columns properly converted
    """
    import re
    
    for col in df.columns:
        # Skip if column is already numeric
        if pd.api.types.is_numeric_dtype(df[col]):
            continue
            
        # Skip empty columns
        if df[col].isna().all() or (df[col].astype(str) == '').all():
            continue
        
        # FIRST: Clean up invalid/placeholder values that indicate missing data
        # Common patterns users use to indicate "no value"
        missing_data_patterns = [
            # Empty/whitespace
            '', ' ', '  ', '   ',
            # Single separators
            '.', ',', ' .', '. ', ' , ', ', ', ' . ', ' , ',
            # Dashes and hyphens
            '-', '--', '---', ' - ', ' -- ', ' --- ',
            # Text indicators (case insensitive)
            'na', 'n/a', 'n.a.', 'n.a', 'NA', 'N/A', 'N.A.', 'N.A',
            'none', 'None', 'NONE',
            'null', 'Null', 'NULL',
            'no value', 'No value', 'NO VALUE', 'no val', 'No Val',
            'novalue', 'NoValue', 'NOVALUE',
            'missing', 'Missing', 'MISSING',
            'not available', 'Not available', 'NOT AVAILABLE',
            'not applicable', 'Not applicable', 'NOT APPLICABLE',
            'n.d.', 'N.D.', 'nd', 'ND',  # Not determined
            'n.m.', 'N.M.', 'nm', 'NM',  # Not measured
            'empty', 'Empty', 'EMPTY',
            '#n/a', '#N/A', '#NA',  # Excel format
            '#DIV/0!', '#VALUE!', '#REF!',  # Excel errors
            '?', '??', '???',
            'x', 'X', 'xx', 'XX',
        ]
        
        # Replace all missing data patterns with NaN
        df[col] = df[col].replace(missing_data_patterns, pd.NA)
        
        # Track how many values were replaced for logging
        pre_clean_count = df[col].notna().sum()
        
        # Also handle values that are ONLY whitespace or special characters
        # Using regex to catch variations
        if df[col].dtype == 'object' or df[col].dtype == 'string':
            # Strip whitespace first to normalize
            df[col] = df[col].apply(lambda x: x.strip() if isinstance(x, str) and x.strip() else pd.NA)
        
        post_clean_count = df[col].notna().sum()
        cleaned_count = pre_clean_count - post_clean_count
        
        if cleaned_count > 0:
            logger.info(f"Column '{col}': Cleaned {cleaned_count} missing data placeholder(s)")
        
        # Get non-empty values for analysis
        # For pattern detection, check ALL values (not just sample) to catch mixed formats
        all_values = df[col].dropna().astype(str)
        all_values = all_values[all_values.str.strip() != '']
        
        # Use sample for regex matching (performance) but all values for separator detection
        sample_values = all_values.head(200) if len(all_values) > 200 else all_values
        if len(sample_values) == 0:
            continue
        
        # Check if this column contains numeric data with European decimal format
        # Patterns: "12,5", "1.234,56", "1234,56", "-12,5"
        european_number_pattern = r'^-?\d{1,3}(\.\d{3})*(,\d+)?$|^-?\d+(,\d+)$'
        us_number_pattern = r'^-?\d{1,3}(,\d{3})*(\.\d+)?$|^-?\d+(\.\d+)?$'
        
        # Count matches for each format
        european_matches = sample_values.str.match(european_number_pattern, na=False).sum()
        us_matches = sample_values.str.match(us_number_pattern, na=False).sum()
        
        # Calculate percentage of values matching each format
        total_samples = len(sample_values)
        euro_pct = european_matches / total_samples if total_samples > 0 else 0
        us_pct = us_matches / total_samples if total_samples > 0 else 0
        
        # Check for MIXED format scenario (common with ';' delimiter)
        # CRITICAL: Check ALL values in column, not just sample, to catch mixed formats
        has_comma_decimals = all_values.str.contains(r'\d+,\d+', regex=True, na=False).sum()
        has_dot_decimals = all_values.str.contains(r'\d+\.\d+', regex=True, na=False).sum()
        
        # Analyze the PATTERN to determine separator usage
        # Key insight: Decimal separators typically have 1-2 digits after them
        #              Thousand separators have exactly 3 digits after them
        comma_as_decimal = all_values.str.contains(r'\d+,\d{1,2}(?!\d)', regex=True, na=False).sum()
        comma_as_thousand = all_values.str.contains(r'\d+,\d{3}', regex=True, na=False).sum()
        dot_as_decimal = all_values.str.contains(r'\d+\.\d{1,2}(?!\d)', regex=True, na=False).sum()
        dot_as_thousand = all_values.str.contains(r'\d+\.\d{3}', regex=True, na=False).sum()
        
        logger.debug(f"Column '{col}' separator analysis: comma_decimal={comma_as_decimal}, comma_thousand={comma_as_thousand}, dot_decimal={dot_as_decimal}, dot_as_thousand={dot_as_thousand}")
        
        # MIXED FORMAT HANDLING: If we see both formats, analyze row by row
        # KEY FIX: Always use mixed handling when BOTH separators are present, regardless of percentages
        if has_comma_decimals > 0 and has_dot_decimals > 0:
            logger.warning(f"Column '{col}': Detected MIXED decimal formats (both '.' and ',' separators)")
            logger.info(f"  Comma values: {has_comma_decimals}, Dot values: {has_dot_decimals}")
            logger.info(f"  Likely comma as decimal: {comma_as_decimal}, comma as thousand: {comma_as_thousand}")
            logger.info(f"  Likely dot as decimal: {dot_as_decimal}, dot as thousand: {dot_as_thousand}")
            
            # Determine predominant usage patterns
            comma_is_decimal_separator = comma_as_decimal > comma_as_thousand
            dot_is_decimal_separator = dot_as_decimal > dot_as_thousand
            
            # Convert each value based on its format
            def smart_convert(val):
                if pd.isna(val) or val == '' or str(val).strip() == '':
                    return pd.NA
                
                val_str = str(val).strip()
                
                # If has BOTH comma and dot, determine which is decimal separator
                if ',' in val_str and '.' in val_str:
                    # Count digits after each separator to determine role
                    # Find rightmost comma and dot
                    comma_pos = val_str.rfind(',')
                    dot_pos = val_str.rfind('.')
                    
                    if comma_pos > dot_pos:
                        # Comma is rightmost - likely decimal separator
                        # Format: "1.234,56" (European)
                        digits_after_comma = len(val_str) - comma_pos - 1
                        if digits_after_comma <= 3:  # Decimal separator
                            val_str = val_str.replace('.', '').replace(',', '.')
                        else:  # Unusual format, try to handle
                            val_str = val_str.replace(',', '').replace('.', '.')
                    else:
                        # Dot is rightmost - likely decimal separator
                        # Format: "1,234.56" (US)
                        digits_after_dot = len(val_str) - dot_pos - 1
                        if digits_after_dot <= 3:  # Decimal separator
                            val_str = val_str.replace(',', '')
                        else:  # Unusual format
                            val_str = val_str.replace('.', '').replace(',', '.')
                
                # Only comma present
                elif ',' in val_str:
                    # Check digits after comma to determine if decimal or thousand separator
                    parts = val_str.split(',')
                    if len(parts) == 2:
                        digits_after = len(parts[1])
                        if digits_after <= 2 or (digits_after == 3 and comma_is_decimal_separator):
                            # Treat as decimal separator
                            val_str = val_str.replace(',', '.')
                        elif digits_after == 3 and not comma_is_decimal_separator:
                            # Treat as thousand separator (remove it)
                            val_str = val_str.replace(',', '')
                        else:
                            # Default: treat as decimal
                            val_str = val_str.replace(',', '.')
                
                # Only dot present
                elif '.' in val_str:
                    # Check digits after dot to determine if decimal or thousand separator
                    parts = val_str.split('.')
                    if len(parts) == 2:
                        digits_after = len(parts[1])
                        if digits_after <= 2 or (digits_after == 3 and dot_is_decimal_separator):
                            # Treat as decimal separator (already correct format)
                            pass
                        elif digits_after == 3 and not dot_is_decimal_separator:
                            # Treat as thousand separator (remove it)
                            val_str = val_str.replace('.', '')
                        # else: keep as is (decimal)
                
                # No separator - whole number
                
                try:
                    return float(val_str)
                except:
                    return pd.NA
            
            df[col] = df[col].apply(smart_convert)
            # Convert to numeric dtype (will use Float64 to support NA)
            df[col] = pd.to_numeric(df[col], errors='coerce')
            logger.info(f"Column '{col}': Converted mixed format, {df[col].notna().sum()} valid values")
            
        # Determine which format this column uses
        elif euro_pct > 0.3:  # >30% match European format
            logger.info(f"Column '{col}': Converting from European decimal format (comma separator)")
            # Replace empty strings with NaN first to preserve missing values
            df[col] = df[col].replace('', pd.NA)
            # Remove thousand separators (dots or spaces), replace comma with dot
            # Only process non-null values
            mask = df[col].notna()
            df.loc[mask, col] = df.loc[mask, col].astype(str).str.replace('.', '', regex=False)  # Remove thousand separator
            df.loc[mask, col] = df.loc[mask, col].str.replace(' ', '', regex=False)  # Remove space thousand separator  
            df.loc[mask, col] = df.loc[mask, col].str.replace(',', '.', regex=False)  # Change decimal separator
            df[col] = pd.to_numeric(df[col], errors='coerce')
        elif us_pct > 0.3:  # >30% match US format
            logger.info(f"Column '{col}': Converting from US decimal format (point separator)")
            # Replace empty strings with NaN first
            df[col] = df[col].replace('', pd.NA)
            # Remove thousand separators (commas), keep dot as decimal
            mask = df[col].notna()
            df.loc[mask, col] = df.loc[mask, col].astype(str).str.replace(',', '', regex=False)  # Remove thousand separator
            df[col] = pd.to_numeric(df[col], errors='coerce')
        else:
            # Try to convert to numeric anyway, or leave as string
            # Replace empty strings first
            df[col] = df[col].replace('', pd.NA)
            numeric_converted = pd.to_numeric(df[col], errors='coerce')
            if numeric_converted.notna().sum() > len(sample_values) * 0.5:
                df[col] = numeric_converted
            # else: leave as string (text column)
    
    return df

Parameters

Name Type Default Kind
df pd.DataFrame - positional_or_keyword

Parameter Details

df: A pandas DataFrame containing columns that may have numeric data in European decimal format (comma as decimal separator), US format (dot as decimal separator), or mixed formats. The function will process all columns, skipping those already in numeric dtype or completely empty. The DataFrame is modified in place and returned.

Return Value

Type: pd.DataFrame

Returns the input DataFrame with numeric columns converted to proper numeric dtypes (Float64 to support NA values). Columns that were in European format have their thousand separators removed and decimal commas replaced with dots. Columns with mixed formats are intelligently converted based on pattern analysis. Non-numeric columns are left unchanged. Missing data placeholders are replaced with pd.NA.

Dependencies

  • pandas
  • re

Required Imports

import pandas as pd
import re

Conditional/Optional Imports

These imports are only needed under specific conditions:

import re

Condition: imported inside the function for regex pattern matching

Required (conditional)

Usage Example

import pandas as pd
import logging

# Setup logger (required by the function)
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Create sample DataFrame with European decimal format
df = pd.DataFrame({
    'price': ['1.234,56', '2.500,00', '999,99'],
    'quantity': ['10', '20', '15'],
    'discount': ['12,5', '15,0', 'N/A'],
    'us_format': ['1,234.56', '2,500.00', '999.99']
})

print('Before conversion:')
print(df)
print(df.dtypes)

# Convert European decimals
df_converted = convert_european_decimals(df)

print('\nAfter conversion:')
print(df_converted)
print(df_converted.dtypes)

Best Practices

  • Ensure a logger object is configured in the module scope before calling this function, as it uses logger.info(), logger.warning(), and logger.debug() for status reporting
  • The function modifies the DataFrame in place but also returns it, so you can use it as df = convert_european_decimals(df)
  • The function uses a 30% threshold for format detection - at least 30% of values must match a pattern to trigger conversion
  • For mixed format columns (containing both comma and dot decimals), the function analyzes ALL values in the column, not just a sample, to ensure accurate detection
  • The function handles extensive missing data patterns including Excel errors (#DIV/0!, #VALUE!), various 'N/A' formats, dashes, and empty strings
  • When dealing with ambiguous cases (e.g., '1.234' could be 1234 or 1.234), the function uses pattern analysis of digit counts after separators to make intelligent decisions
  • The function preserves missing values as pd.NA and converts columns to Float64 dtype to support nullable numeric types
  • For performance, regex pattern matching uses a sample of up to 200 values, but separator detection checks all values to catch mixed formats
  • The function skips columns that are already numeric or completely empty to avoid unnecessary processing

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function smart_read_csv 70.9% similar

    Automatically detects CSV file delimiters (comma, semicolon, tab) and handles regional decimal formats (European comma vs US/UK point) to reliably parse CSV files from different locales.

    From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
  • function test_tab_delimited_european 64.8% similar

    A unit test function that validates the smart_read_csv function's ability to correctly parse tab-delimited CSV files containing European-style decimal numbers (using commas instead of periods).

    From: /tf/active/vicechatdev/vice_ai/test_regional_formats.py
  • function test_european_csv 62.8% similar

    A test function that validates the ability to read and parse European-formatted CSV files (semicolon delimiters, comma decimal separators) and convert them to proper numeric types.

    From: /tf/active/vicechatdev/vice_ai/test_regional_formats.py
  • function test_european_with_thousands 62.3% similar

    A unit test function that validates the smart_read_csv function's ability to correctly parse European-formatted CSV files with thousand separators (dots) and decimal commas.

    From: /tf/active/vicechatdev/vice_ai/test_regional_formats.py
  • function test_us_csv 55.5% similar

    A unit test function that validates the smart_read_csv function's ability to correctly parse US-formatted CSV files with comma delimiters and point decimal separators.

    From: /tf/active/vicechatdev/vice_ai/test_regional_formats.py
← Back to Browse