function convert_european_decimals
Detects and converts numeric data with European decimal format (comma as decimal separator) to standard format (dot as decimal separator) in a pandas DataFrame, handling mixed formats and missing data patterns.
/tf/active/vicechatdev/vice_ai/smartstat_service.py
22 - 250
complex
Purpose
This function is designed to normalize numeric data in DataFrames that may contain European-style decimal notation (e.g., '1.234,56') or US-style notation (e.g., '1,234.56'), or even mixed formats within the same file. It automatically detects the format used in each column, handles various missing data placeholders (like 'N/A', '-', empty strings), and converts all numeric columns to a standardized format suitable for analysis. This is particularly useful when processing CSV files from different locales or when dealing with data exported from European spreadsheet applications.
Source Code
def convert_european_decimals(df: pd.DataFrame) -> pd.DataFrame:
"""
Helper function to detect and convert European decimal format (comma as decimal separator)
to standard format (dot as decimal separator) in a DataFrame.
Handles:
- European format: "12,5", "1.234,56", "1234,56"
- US format: "12.5", "1,234.56", "1234.56"
- Mixed formats in same file (with ';' delimiter)
- Single '.' or ',' in empty cells (treated as missing data)
- Automatic detection based on pattern matching
- Properly handles missing/empty values
Args:
df: DataFrame to process
Returns:
DataFrame with numeric columns properly converted
"""
import re
for col in df.columns:
# Skip if column is already numeric
if pd.api.types.is_numeric_dtype(df[col]):
continue
# Skip empty columns
if df[col].isna().all() or (df[col].astype(str) == '').all():
continue
# FIRST: Clean up invalid/placeholder values that indicate missing data
# Common patterns users use to indicate "no value"
missing_data_patterns = [
# Empty/whitespace
'', ' ', ' ', ' ',
# Single separators
'.', ',', ' .', '. ', ' , ', ', ', ' . ', ' , ',
# Dashes and hyphens
'-', '--', '---', ' - ', ' -- ', ' --- ',
# Text indicators (case insensitive)
'na', 'n/a', 'n.a.', 'n.a', 'NA', 'N/A', 'N.A.', 'N.A',
'none', 'None', 'NONE',
'null', 'Null', 'NULL',
'no value', 'No value', 'NO VALUE', 'no val', 'No Val',
'novalue', 'NoValue', 'NOVALUE',
'missing', 'Missing', 'MISSING',
'not available', 'Not available', 'NOT AVAILABLE',
'not applicable', 'Not applicable', 'NOT APPLICABLE',
'n.d.', 'N.D.', 'nd', 'ND', # Not determined
'n.m.', 'N.M.', 'nm', 'NM', # Not measured
'empty', 'Empty', 'EMPTY',
'#n/a', '#N/A', '#NA', # Excel format
'#DIV/0!', '#VALUE!', '#REF!', # Excel errors
'?', '??', '???',
'x', 'X', 'xx', 'XX',
]
# Replace all missing data patterns with NaN
df[col] = df[col].replace(missing_data_patterns, pd.NA)
# Track how many values were replaced for logging
pre_clean_count = df[col].notna().sum()
# Also handle values that are ONLY whitespace or special characters
# Using regex to catch variations
if df[col].dtype == 'object' or df[col].dtype == 'string':
# Strip whitespace first to normalize
df[col] = df[col].apply(lambda x: x.strip() if isinstance(x, str) and x.strip() else pd.NA)
post_clean_count = df[col].notna().sum()
cleaned_count = pre_clean_count - post_clean_count
if cleaned_count > 0:
logger.info(f"Column '{col}': Cleaned {cleaned_count} missing data placeholder(s)")
# Get non-empty values for analysis
# For pattern detection, check ALL values (not just sample) to catch mixed formats
all_values = df[col].dropna().astype(str)
all_values = all_values[all_values.str.strip() != '']
# Use sample for regex matching (performance) but all values for separator detection
sample_values = all_values.head(200) if len(all_values) > 200 else all_values
if len(sample_values) == 0:
continue
# Check if this column contains numeric data with European decimal format
# Patterns: "12,5", "1.234,56", "1234,56", "-12,5"
european_number_pattern = r'^-?\d{1,3}(\.\d{3})*(,\d+)?$|^-?\d+(,\d+)$'
us_number_pattern = r'^-?\d{1,3}(,\d{3})*(\.\d+)?$|^-?\d+(\.\d+)?$'
# Count matches for each format
european_matches = sample_values.str.match(european_number_pattern, na=False).sum()
us_matches = sample_values.str.match(us_number_pattern, na=False).sum()
# Calculate percentage of values matching each format
total_samples = len(sample_values)
euro_pct = european_matches / total_samples if total_samples > 0 else 0
us_pct = us_matches / total_samples if total_samples > 0 else 0
# Check for MIXED format scenario (common with ';' delimiter)
# CRITICAL: Check ALL values in column, not just sample, to catch mixed formats
has_comma_decimals = all_values.str.contains(r'\d+,\d+', regex=True, na=False).sum()
has_dot_decimals = all_values.str.contains(r'\d+\.\d+', regex=True, na=False).sum()
# Analyze the PATTERN to determine separator usage
# Key insight: Decimal separators typically have 1-2 digits after them
# Thousand separators have exactly 3 digits after them
comma_as_decimal = all_values.str.contains(r'\d+,\d{1,2}(?!\d)', regex=True, na=False).sum()
comma_as_thousand = all_values.str.contains(r'\d+,\d{3}', regex=True, na=False).sum()
dot_as_decimal = all_values.str.contains(r'\d+\.\d{1,2}(?!\d)', regex=True, na=False).sum()
dot_as_thousand = all_values.str.contains(r'\d+\.\d{3}', regex=True, na=False).sum()
logger.debug(f"Column '{col}' separator analysis: comma_decimal={comma_as_decimal}, comma_thousand={comma_as_thousand}, dot_decimal={dot_as_decimal}, dot_as_thousand={dot_as_thousand}")
# MIXED FORMAT HANDLING: If we see both formats, analyze row by row
# KEY FIX: Always use mixed handling when BOTH separators are present, regardless of percentages
if has_comma_decimals > 0 and has_dot_decimals > 0:
logger.warning(f"Column '{col}': Detected MIXED decimal formats (both '.' and ',' separators)")
logger.info(f" Comma values: {has_comma_decimals}, Dot values: {has_dot_decimals}")
logger.info(f" Likely comma as decimal: {comma_as_decimal}, comma as thousand: {comma_as_thousand}")
logger.info(f" Likely dot as decimal: {dot_as_decimal}, dot as thousand: {dot_as_thousand}")
# Determine predominant usage patterns
comma_is_decimal_separator = comma_as_decimal > comma_as_thousand
dot_is_decimal_separator = dot_as_decimal > dot_as_thousand
# Convert each value based on its format
def smart_convert(val):
if pd.isna(val) or val == '' or str(val).strip() == '':
return pd.NA
val_str = str(val).strip()
# If has BOTH comma and dot, determine which is decimal separator
if ',' in val_str and '.' in val_str:
# Count digits after each separator to determine role
# Find rightmost comma and dot
comma_pos = val_str.rfind(',')
dot_pos = val_str.rfind('.')
if comma_pos > dot_pos:
# Comma is rightmost - likely decimal separator
# Format: "1.234,56" (European)
digits_after_comma = len(val_str) - comma_pos - 1
if digits_after_comma <= 3: # Decimal separator
val_str = val_str.replace('.', '').replace(',', '.')
else: # Unusual format, try to handle
val_str = val_str.replace(',', '').replace('.', '.')
else:
# Dot is rightmost - likely decimal separator
# Format: "1,234.56" (US)
digits_after_dot = len(val_str) - dot_pos - 1
if digits_after_dot <= 3: # Decimal separator
val_str = val_str.replace(',', '')
else: # Unusual format
val_str = val_str.replace('.', '').replace(',', '.')
# Only comma present
elif ',' in val_str:
# Check digits after comma to determine if decimal or thousand separator
parts = val_str.split(',')
if len(parts) == 2:
digits_after = len(parts[1])
if digits_after <= 2 or (digits_after == 3 and comma_is_decimal_separator):
# Treat as decimal separator
val_str = val_str.replace(',', '.')
elif digits_after == 3 and not comma_is_decimal_separator:
# Treat as thousand separator (remove it)
val_str = val_str.replace(',', '')
else:
# Default: treat as decimal
val_str = val_str.replace(',', '.')
# Only dot present
elif '.' in val_str:
# Check digits after dot to determine if decimal or thousand separator
parts = val_str.split('.')
if len(parts) == 2:
digits_after = len(parts[1])
if digits_after <= 2 or (digits_after == 3 and dot_is_decimal_separator):
# Treat as decimal separator (already correct format)
pass
elif digits_after == 3 and not dot_is_decimal_separator:
# Treat as thousand separator (remove it)
val_str = val_str.replace('.', '')
# else: keep as is (decimal)
# No separator - whole number
try:
return float(val_str)
except:
return pd.NA
df[col] = df[col].apply(smart_convert)
# Convert to numeric dtype (will use Float64 to support NA)
df[col] = pd.to_numeric(df[col], errors='coerce')
logger.info(f"Column '{col}': Converted mixed format, {df[col].notna().sum()} valid values")
# Determine which format this column uses
elif euro_pct > 0.3: # >30% match European format
logger.info(f"Column '{col}': Converting from European decimal format (comma separator)")
# Replace empty strings with NaN first to preserve missing values
df[col] = df[col].replace('', pd.NA)
# Remove thousand separators (dots or spaces), replace comma with dot
# Only process non-null values
mask = df[col].notna()
df.loc[mask, col] = df.loc[mask, col].astype(str).str.replace('.', '', regex=False) # Remove thousand separator
df.loc[mask, col] = df.loc[mask, col].str.replace(' ', '', regex=False) # Remove space thousand separator
df.loc[mask, col] = df.loc[mask, col].str.replace(',', '.', regex=False) # Change decimal separator
df[col] = pd.to_numeric(df[col], errors='coerce')
elif us_pct > 0.3: # >30% match US format
logger.info(f"Column '{col}': Converting from US decimal format (point separator)")
# Replace empty strings with NaN first
df[col] = df[col].replace('', pd.NA)
# Remove thousand separators (commas), keep dot as decimal
mask = df[col].notna()
df.loc[mask, col] = df.loc[mask, col].astype(str).str.replace(',', '', regex=False) # Remove thousand separator
df[col] = pd.to_numeric(df[col], errors='coerce')
else:
# Try to convert to numeric anyway, or leave as string
# Replace empty strings first
df[col] = df[col].replace('', pd.NA)
numeric_converted = pd.to_numeric(df[col], errors='coerce')
if numeric_converted.notna().sum() > len(sample_values) * 0.5:
df[col] = numeric_converted
# else: leave as string (text column)
return df
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
df |
pd.DataFrame | - | positional_or_keyword |
Parameter Details
df: A pandas DataFrame containing columns that may have numeric data in European decimal format (comma as decimal separator), US format (dot as decimal separator), or mixed formats. The function will process all columns, skipping those already in numeric dtype or completely empty. The DataFrame is modified in place and returned.
Return Value
Type: pd.DataFrame
Returns the input DataFrame with numeric columns converted to proper numeric dtypes (Float64 to support NA values). Columns that were in European format have their thousand separators removed and decimal commas replaced with dots. Columns with mixed formats are intelligently converted based on pattern analysis. Non-numeric columns are left unchanged. Missing data placeholders are replaced with pd.NA.
Dependencies
pandasre
Required Imports
import pandas as pd
import re
Conditional/Optional Imports
These imports are only needed under specific conditions:
import re
Condition: imported inside the function for regex pattern matching
Required (conditional)Usage Example
import pandas as pd
import logging
# Setup logger (required by the function)
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
# Create sample DataFrame with European decimal format
df = pd.DataFrame({
'price': ['1.234,56', '2.500,00', '999,99'],
'quantity': ['10', '20', '15'],
'discount': ['12,5', '15,0', 'N/A'],
'us_format': ['1,234.56', '2,500.00', '999.99']
})
print('Before conversion:')
print(df)
print(df.dtypes)
# Convert European decimals
df_converted = convert_european_decimals(df)
print('\nAfter conversion:')
print(df_converted)
print(df_converted.dtypes)
Best Practices
- Ensure a logger object is configured in the module scope before calling this function, as it uses logger.info(), logger.warning(), and logger.debug() for status reporting
- The function modifies the DataFrame in place but also returns it, so you can use it as df = convert_european_decimals(df)
- The function uses a 30% threshold for format detection - at least 30% of values must match a pattern to trigger conversion
- For mixed format columns (containing both comma and dot decimals), the function analyzes ALL values in the column, not just a sample, to ensure accurate detection
- The function handles extensive missing data patterns including Excel errors (#DIV/0!, #VALUE!), various 'N/A' formats, dashes, and empty strings
- When dealing with ambiguous cases (e.g., '1.234' could be 1234 or 1.234), the function uses pattern analysis of digit counts after separators to make intelligent decisions
- The function preserves missing values as pd.NA and converts columns to Float64 dtype to support nullable numeric types
- For performance, regex pattern matching uses a sample of up to 200 values, but separator detection checks all values to catch mixed formats
- The function skips columns that are already numeric or completely empty to avoid unnecessary processing
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function smart_read_csv 70.9% similar
-
function test_tab_delimited_european 64.8% similar
-
function test_european_csv 62.8% similar
-
function test_european_with_thousands 62.3% similar
-
function test_us_csv 55.5% similar