function validate_sheet_format
Analyzes Excel sheet structure using multiple heuristics to classify it as tabular data, information sheet, or mixed format, returning quality metrics and extraction recommendations.
/tf/active/vicechatdev/vice_ai/smartstat_service.py
455 - 601
moderate
Purpose
This function evaluates the structural characteristics of a pandas DataFrame (typically loaded from an Excel sheet) to determine its suitability for data analysis. It distinguishes between clean tabular data suitable for statistical analysis, information/metadata sheets containing documentation, and mixed-format sheets requiring preprocessing. The function uses seven distinct checks including unnamed column ratios, empty row/column patterns, first column characteristics, data sparsity, dimensional ratios, sheet naming conventions, and text content analysis. It returns a comprehensive assessment with quality scores, identified issues, content hints, and recommended extraction methods to guide downstream processing decisions.
Source Code
def validate_sheet_format(df: pd.DataFrame, sheet_name: str) -> Dict[str, Any]:
"""
Analyze sheet structure to determine if it's tabular data or an information sheet
This function uses multiple heuristics to classify Excel sheets:
- Tabular data: Regular rows/columns with consistent structure
- Information sheets: Metadata, study protocols, notes, documentation
- Mixed format: Contains both structured and unstructured content
Returns:
Dict with 'sheet_type', 'quality_score', 'issues', 'recommendation', and 'content_hints'
"""
issues = []
quality_score = 100
content_hints = []
# Check 1: Unnamed columns (indicates non-tabular structure)
unnamed_count = sum(1 for col in df.columns if 'Unnamed' in str(col))
unnamed_ratio = unnamed_count / len(df.columns) if len(df.columns) > 0 else 0
if unnamed_ratio > 0.5:
issues.append(f"{unnamed_count} unnamed columns ({unnamed_ratio:.0%}) - likely info sheet with free-form content")
content_hints.append("free_form_layout")
quality_score -= 40
elif unnamed_ratio > 0.2:
issues.append(f"{unnamed_count} unnamed columns ({unnamed_ratio:.0%}) - mixed content or multiple sub-tables")
content_hints.append("multi_section")
quality_score -= 25
# Check 2: Empty rows/columns (indicates mixed content or documentation)
empty_rows = df.isna().all(axis=1).sum()
empty_cols = df.isna().all(axis=0).sum()
if empty_rows > len(df) * 0.3:
issues.append(f"{empty_rows} empty rows ({empty_rows/len(df):.0%}) - contains section separators or documentation blocks")
content_hints.append("sectioned_content")
quality_score -= 25
elif empty_rows > len(df) * 0.1:
issues.append(f"{empty_rows} empty rows - might have grouped sections")
content_hints.append("grouped_sections")
quality_score -= 10
if empty_cols > 0:
issues.append(f"{empty_cols} completely empty columns - layout/formatting columns")
content_hints.append("layout_spacing")
quality_score -= 10
# Check 3: First column characteristics (metadata sheets often have labels in first col)
if len(df) > 0 and len(df.columns) > 0:
first_col = df.iloc[:, 0]
non_null_first = first_col.dropna()
# If first column has mostly unique text values, might be labels/metadata
if len(non_null_first) > 0:
unique_ratio = len(non_null_first.unique()) / len(non_null_first)
# Check if first column looks like labels (text with colons, short phrases)
if non_null_first.dtype == 'object':
label_patterns = non_null_first.astype(str).str.contains(r'[:]\s*$|^[A-Z][a-zA-Z\s]{2,30}$', regex=True, na=False)
label_ratio = label_patterns.sum() / len(non_null_first)
if label_ratio > 0.5:
issues.append("First column contains label-like text with colons - likely key-value metadata")
content_hints.append("key_value_pairs")
quality_score -= 20
elif unique_ratio > 0.9:
issues.append("First column contains unique identifiers/labels - possibly metadata/info sheet")
content_hints.append("unique_labels")
quality_score -= 15
# Check 4: Data sparsity (tabular data should be relatively dense)
total_cells = df.shape[0] * df.shape[1]
null_cells = df.isna().sum().sum()
sparsity = null_cells / total_cells if total_cells > 0 else 0
if sparsity > 0.6:
issues.append(f"Very high sparsity ({sparsity:.0%} null values) - likely documentation or sparse metadata")
content_hints.append("very_sparse")
quality_score -= 20
elif sparsity > 0.4:
issues.append(f"High sparsity ({sparsity:.0%} null values) - unusual for clean data table")
content_hints.append("sparse_data")
quality_score -= 15
# Check 5: Column count vs row count ratio (info sheets often have few columns, many rows)
if len(df.columns) <= 3 and len(df) > 20:
content_hints.append("vertical_list")
quality_score -= 10
# Check 6: Look for common information sheet keywords in sheet name
info_keywords = ['info', 'metadata', 'notes', 'readme', 'protocol', 'schedule',
'description', 'legend', 'guide', 'instructions', 'summary']
sheet_lower = sheet_name.lower()
if any(keyword in sheet_lower for keyword in info_keywords):
issues.append(f"Sheet name '{sheet_name}' suggests informational content")
content_hints.append("info_sheet_name")
quality_score -= 15
# Check 7: Text-heavy content (long strings indicate documentation)
if len(df) > 0:
text_cols = df.select_dtypes(include=['object']).columns
if len(text_cols) > 0:
avg_text_length = df[text_cols].astype(str).applymap(len).mean().mean()
if avg_text_length > 100:
issues.append(f"Average text length is {avg_text_length:.0f} characters - contains long-form text")
content_hints.append("long_text_content")
quality_score -= 15
# Determine sheet type based on quality score and hints
if quality_score >= 75:
sheet_type = 'tabular_data'
recommendation = 'suitable_for_analysis'
extraction_method = 'dataframe'
elif quality_score >= 50:
sheet_type = 'mixed_format'
recommendation = 'requires_preprocessing'
extraction_method = 'selective_extraction'
else:
sheet_type = 'information_sheet'
recommendation = 'extract_as_context'
# Determine best extraction method based on content hints
if 'key_value_pairs' in content_hints:
extraction_method = 'markdown_key_value'
elif 'long_text_content' in content_hints:
extraction_method = 'markdown_document'
else:
extraction_method = 'markdown_structured'
logger.info(f"Sheet '{sheet_name}' validation: type={sheet_type}, score={quality_score}, "
f"hints={content_hints}, extraction={extraction_method}")
# Convert numpy types to Python native types for JSON serialization
return {
'sheet_type': sheet_type,
'quality_score': int(quality_score),
'issues': issues,
'recommendation': recommendation,
'extraction_method': extraction_method,
'content_hints': content_hints,
'stats': {
'unnamed_columns': int(unnamed_count),
'empty_rows': int(empty_rows),
'empty_columns': int(empty_cols),
'sparsity': float(round(sparsity, 2)),
'shape': f"{len(df)} rows × {len(df.columns)} columns"
}
}
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
df |
pd.DataFrame | - | positional_or_keyword |
sheet_name |
str | - | positional_or_keyword |
Parameter Details
df: A pandas DataFrame containing the sheet data to be validated. Expected to be loaded from an Excel sheet. Can contain any data types (numeric, text, mixed). The function handles empty DataFrames gracefully.
sheet_name: String representing the name of the Excel sheet being analyzed. Used for contextual analysis (e.g., detecting keywords like 'info', 'metadata', 'notes' that suggest informational content). Should be the original sheet name from the Excel file.
Return Value
Type: Dict[str, Any]
Returns a dictionary with the following keys: 'sheet_type' (str: 'tabular_data', 'mixed_format', or 'information_sheet'), 'quality_score' (int: 0-100, where higher scores indicate better tabular structure), 'issues' (list of str: human-readable descriptions of detected problems), 'recommendation' (str: 'suitable_for_analysis', 'requires_preprocessing', or 'extract_as_context'), 'extraction_method' (str: 'dataframe', 'selective_extraction', 'markdown_key_value', 'markdown_document', or 'markdown_structured'), 'content_hints' (list of str: tags describing content characteristics like 'free_form_layout', 'key_value_pairs', 'sparse_data'), and 'stats' (dict: containing 'unnamed_columns', 'empty_rows', 'empty_columns', 'sparsity', and 'shape' metrics). All numeric values are converted to Python native types for JSON serialization.
Dependencies
pandaslogging
Required Imports
import pandas as pd
import logging
from typing import Dict, Any
Usage Example
import pandas as pd
import logging
from typing import Dict, Any
# Setup logger
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
# Load Excel sheet
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# Validate sheet format
result = validate_sheet_format(df, 'Sheet1')
# Check results
print(f"Sheet Type: {result['sheet_type']}")
print(f"Quality Score: {result['quality_score']}")
print(f"Recommendation: {result['recommendation']}")
print(f"Extraction Method: {result['extraction_method']}")
# Handle based on sheet type
if result['sheet_type'] == 'tabular_data':
# Process as regular data
print("Suitable for analysis")
elif result['sheet_type'] == 'information_sheet':
# Extract as context/metadata
print(f"Extract using: {result['extraction_method']}")
else:
# Mixed format - requires preprocessing
print(f"Issues found: {result['issues']}")
Best Practices
- Ensure the logger is properly configured before calling this function to avoid AttributeError on the logging statement
- The function is designed for Excel sheets but works with any pandas DataFrame - consider the context when interpreting results
- Quality scores below 50 indicate the sheet is likely not suitable for direct statistical analysis
- Use the 'extraction_method' field to determine the appropriate parsing strategy for downstream processing
- The 'content_hints' list provides granular insights that can be used for custom processing logic
- Handle empty DataFrames appropriately - the function returns valid results but with limited analysis
- The function performs read-only analysis and does not modify the input DataFrame
- Consider the 'stats' dictionary for detailed metrics when debugging or logging sheet characteristics
- Sheet names are case-insensitive when checking for information keywords
- The function assumes standard Excel conventions (e.g., 'Unnamed' prefix for columns without headers)
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function extract_sheet_context 61.0% similar
-
function smartstat_select_sheet 59.3% similar
-
function read_excel_file 57.7% similar
-
function test_excel_output 57.1% similar
-
function validate_document_structure 54.7% similar