🔍 Code Extractor

function validate_sheet_format

Maturity: 67

Analyzes Excel sheet structure using multiple heuristics to classify it as tabular data, information sheet, or mixed format, returning quality metrics and extraction recommendations.

File:
/tf/active/vicechatdev/vice_ai/smartstat_service.py
Lines:
455 - 601
Complexity:
moderate

Purpose

This function evaluates the structural characteristics of a pandas DataFrame (typically loaded from an Excel sheet) to determine its suitability for data analysis. It distinguishes between clean tabular data suitable for statistical analysis, information/metadata sheets containing documentation, and mixed-format sheets requiring preprocessing. The function uses seven distinct checks including unnamed column ratios, empty row/column patterns, first column characteristics, data sparsity, dimensional ratios, sheet naming conventions, and text content analysis. It returns a comprehensive assessment with quality scores, identified issues, content hints, and recommended extraction methods to guide downstream processing decisions.

Source Code

def validate_sheet_format(df: pd.DataFrame, sheet_name: str) -> Dict[str, Any]:
    """
    Analyze sheet structure to determine if it's tabular data or an information sheet
    
    This function uses multiple heuristics to classify Excel sheets:
    - Tabular data: Regular rows/columns with consistent structure
    - Information sheets: Metadata, study protocols, notes, documentation
    - Mixed format: Contains both structured and unstructured content
    
    Returns:
        Dict with 'sheet_type', 'quality_score', 'issues', 'recommendation', and 'content_hints'
    """
    issues = []
    quality_score = 100
    content_hints = []
    
    # Check 1: Unnamed columns (indicates non-tabular structure)
    unnamed_count = sum(1 for col in df.columns if 'Unnamed' in str(col))
    unnamed_ratio = unnamed_count / len(df.columns) if len(df.columns) > 0 else 0
    
    if unnamed_ratio > 0.5:
        issues.append(f"{unnamed_count} unnamed columns ({unnamed_ratio:.0%}) - likely info sheet with free-form content")
        content_hints.append("free_form_layout")
        quality_score -= 40
    elif unnamed_ratio > 0.2:
        issues.append(f"{unnamed_count} unnamed columns ({unnamed_ratio:.0%}) - mixed content or multiple sub-tables")
        content_hints.append("multi_section")
        quality_score -= 25
    
    # Check 2: Empty rows/columns (indicates mixed content or documentation)
    empty_rows = df.isna().all(axis=1).sum()
    empty_cols = df.isna().all(axis=0).sum()
    
    if empty_rows > len(df) * 0.3:
        issues.append(f"{empty_rows} empty rows ({empty_rows/len(df):.0%}) - contains section separators or documentation blocks")
        content_hints.append("sectioned_content")
        quality_score -= 25
    elif empty_rows > len(df) * 0.1:
        issues.append(f"{empty_rows} empty rows - might have grouped sections")
        content_hints.append("grouped_sections")
        quality_score -= 10
    
    if empty_cols > 0:
        issues.append(f"{empty_cols} completely empty columns - layout/formatting columns")
        content_hints.append("layout_spacing")
        quality_score -= 10
    
    # Check 3: First column characteristics (metadata sheets often have labels in first col)
    if len(df) > 0 and len(df.columns) > 0:
        first_col = df.iloc[:, 0]
        non_null_first = first_col.dropna()
        
        # If first column has mostly unique text values, might be labels/metadata
        if len(non_null_first) > 0:
            unique_ratio = len(non_null_first.unique()) / len(non_null_first)
            
            # Check if first column looks like labels (text with colons, short phrases)
            if non_null_first.dtype == 'object':
                label_patterns = non_null_first.astype(str).str.contains(r'[:]\s*$|^[A-Z][a-zA-Z\s]{2,30}$', regex=True, na=False)
                label_ratio = label_patterns.sum() / len(non_null_first)
                
                if label_ratio > 0.5:
                    issues.append("First column contains label-like text with colons - likely key-value metadata")
                    content_hints.append("key_value_pairs")
                    quality_score -= 20
                elif unique_ratio > 0.9:
                    issues.append("First column contains unique identifiers/labels - possibly metadata/info sheet")
                    content_hints.append("unique_labels")
                    quality_score -= 15
    
    # Check 4: Data sparsity (tabular data should be relatively dense)
    total_cells = df.shape[0] * df.shape[1]
    null_cells = df.isna().sum().sum()
    sparsity = null_cells / total_cells if total_cells > 0 else 0
    
    if sparsity > 0.6:
        issues.append(f"Very high sparsity ({sparsity:.0%} null values) - likely documentation or sparse metadata")
        content_hints.append("very_sparse")
        quality_score -= 20
    elif sparsity > 0.4:
        issues.append(f"High sparsity ({sparsity:.0%} null values) - unusual for clean data table")
        content_hints.append("sparse_data")
        quality_score -= 15
    
    # Check 5: Column count vs row count ratio (info sheets often have few columns, many rows)
    if len(df.columns) <= 3 and len(df) > 20:
        content_hints.append("vertical_list")
        quality_score -= 10
    
    # Check 6: Look for common information sheet keywords in sheet name
    info_keywords = ['info', 'metadata', 'notes', 'readme', 'protocol', 'schedule', 
                     'description', 'legend', 'guide', 'instructions', 'summary']
    sheet_lower = sheet_name.lower()
    if any(keyword in sheet_lower for keyword in info_keywords):
        issues.append(f"Sheet name '{sheet_name}' suggests informational content")
        content_hints.append("info_sheet_name")
        quality_score -= 15
    
    # Check 7: Text-heavy content (long strings indicate documentation)
    if len(df) > 0:
        text_cols = df.select_dtypes(include=['object']).columns
        if len(text_cols) > 0:
            avg_text_length = df[text_cols].astype(str).applymap(len).mean().mean()
            if avg_text_length > 100:
                issues.append(f"Average text length is {avg_text_length:.0f} characters - contains long-form text")
                content_hints.append("long_text_content")
                quality_score -= 15
    
    # Determine sheet type based on quality score and hints
    if quality_score >= 75:
        sheet_type = 'tabular_data'
        recommendation = 'suitable_for_analysis'
        extraction_method = 'dataframe'
    elif quality_score >= 50:
        sheet_type = 'mixed_format'
        recommendation = 'requires_preprocessing'
        extraction_method = 'selective_extraction'
    else:
        sheet_type = 'information_sheet'
        recommendation = 'extract_as_context'
        # Determine best extraction method based on content hints
        if 'key_value_pairs' in content_hints:
            extraction_method = 'markdown_key_value'
        elif 'long_text_content' in content_hints:
            extraction_method = 'markdown_document'
        else:
            extraction_method = 'markdown_structured'
    
    logger.info(f"Sheet '{sheet_name}' validation: type={sheet_type}, score={quality_score}, "
                f"hints={content_hints}, extraction={extraction_method}")
    
    # Convert numpy types to Python native types for JSON serialization
    return {
        'sheet_type': sheet_type,
        'quality_score': int(quality_score),
        'issues': issues,
        'recommendation': recommendation,
        'extraction_method': extraction_method,
        'content_hints': content_hints,
        'stats': {
            'unnamed_columns': int(unnamed_count),
            'empty_rows': int(empty_rows),
            'empty_columns': int(empty_cols),
            'sparsity': float(round(sparsity, 2)),
            'shape': f"{len(df)} rows × {len(df.columns)} columns"
        }
    }

Parameters

Name Type Default Kind
df pd.DataFrame - positional_or_keyword
sheet_name str - positional_or_keyword

Parameter Details

df: A pandas DataFrame containing the sheet data to be validated. Expected to be loaded from an Excel sheet. Can contain any data types (numeric, text, mixed). The function handles empty DataFrames gracefully.

sheet_name: String representing the name of the Excel sheet being analyzed. Used for contextual analysis (e.g., detecting keywords like 'info', 'metadata', 'notes' that suggest informational content). Should be the original sheet name from the Excel file.

Return Value

Type: Dict[str, Any]

Returns a dictionary with the following keys: 'sheet_type' (str: 'tabular_data', 'mixed_format', or 'information_sheet'), 'quality_score' (int: 0-100, where higher scores indicate better tabular structure), 'issues' (list of str: human-readable descriptions of detected problems), 'recommendation' (str: 'suitable_for_analysis', 'requires_preprocessing', or 'extract_as_context'), 'extraction_method' (str: 'dataframe', 'selective_extraction', 'markdown_key_value', 'markdown_document', or 'markdown_structured'), 'content_hints' (list of str: tags describing content characteristics like 'free_form_layout', 'key_value_pairs', 'sparse_data'), and 'stats' (dict: containing 'unnamed_columns', 'empty_rows', 'empty_columns', 'sparsity', and 'shape' metrics). All numeric values are converted to Python native types for JSON serialization.

Dependencies

  • pandas
  • logging

Required Imports

import pandas as pd
import logging
from typing import Dict, Any

Usage Example

import pandas as pd
import logging
from typing import Dict, Any

# Setup logger
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Load Excel sheet
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')

# Validate sheet format
result = validate_sheet_format(df, 'Sheet1')

# Check results
print(f"Sheet Type: {result['sheet_type']}")
print(f"Quality Score: {result['quality_score']}")
print(f"Recommendation: {result['recommendation']}")
print(f"Extraction Method: {result['extraction_method']}")

# Handle based on sheet type
if result['sheet_type'] == 'tabular_data':
    # Process as regular data
    print("Suitable for analysis")
elif result['sheet_type'] == 'information_sheet':
    # Extract as context/metadata
    print(f"Extract using: {result['extraction_method']}")
else:
    # Mixed format - requires preprocessing
    print(f"Issues found: {result['issues']}")

Best Practices

  • Ensure the logger is properly configured before calling this function to avoid AttributeError on the logging statement
  • The function is designed for Excel sheets but works with any pandas DataFrame - consider the context when interpreting results
  • Quality scores below 50 indicate the sheet is likely not suitable for direct statistical analysis
  • Use the 'extraction_method' field to determine the appropriate parsing strategy for downstream processing
  • The 'content_hints' list provides granular insights that can be used for custom processing logic
  • Handle empty DataFrames appropriately - the function returns valid results but with limited analysis
  • The function performs read-only analysis and does not modify the input DataFrame
  • Consider the 'stats' dictionary for detailed metrics when debugging or logging sheet characteristics
  • Sheet names are case-insensitive when checking for information keywords
  • The function assumes standard Excel conventions (e.g., 'Unnamed' prefix for columns without headers)

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function extract_sheet_context 61.0% similar

    Extracts comprehensive text context from Excel DataFrame sheets that contain mixed structured and unstructured content, converting them into markdown-formatted text while preserving table structures, key-value pairs, and section headers.

    From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
  • function smartstat_select_sheet 59.3% similar

    Flask API endpoint that processes one or more Excel sheets from an uploaded file, validates them, categorizes them as datasets or information sheets, and adds them to a SmartStat analysis session.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function read_excel_file 57.7% similar

    Reads Excel files and returns either metadata for all sheets or detailed data for a specific sheet, including format validation, European decimal conversion, and rich metadata extraction.

    From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
  • function test_excel_output 57.1% similar

    A test function that validates Excel output generation for contract analysis results, ensuring new fields (third_party_emails, third_party_tax_ids) are properly formatted and persisted.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_excel_output.py
  • function validate_document_structure 54.7% similar

    Validates the structural integrity of a DOCX document by checking if it contains all required sections specified in the document type template configuration.

    From: /tf/active/vicechatdev/CDocs/utils/document_processor.py
← Back to Browse