🔍 Code Extractor

function detect_table_boundaries

Maturity: 54

Detects distinct tables within a pandas DataFrame by identifying empty rows as table boundaries and returns metadata about each detected table region.

File:
/tf/active/vicechatdev/vice_ai/smartstat_service.py
Lines:
604 - 642
Complexity:
simple

Purpose

This function is designed to parse Excel sheets or DataFrames that contain multiple tables separated by empty rows. It identifies where each table starts and ends, enabling downstream processing of individual tables within a single sheet. Common use cases include processing complex Excel reports with multiple data sections, extracting structured data from multi-table layouts, and preparing data for statistical analysis where tables need to be processed independently.

Source Code

def detect_table_boundaries(df: pd.DataFrame) -> List[Dict]:
    """
    Detect distinct tables within an Excel sheet based on empty rows and headers
    
    Returns:
        List of table regions with start_row, end_row, and detected headers
    """
    tables = []
    current_table_start = None
    
    for idx in range(len(df)):
        row = df.iloc[idx]
        row_values = [str(v).strip() for v in row if pd.notna(v) and str(v).strip() != '']
        
        if len(row_values) == 0:
            # Empty row - potential table boundary
            if current_table_start is not None:
                # End current table
                tables.append({
                    'start_row': current_table_start,
                    'end_row': idx - 1,
                    'type': 'table'
                })
                current_table_start = None
        else:
            # Non-empty row
            if current_table_start is None:
                # Start new table
                current_table_start = idx
    
    # Close final table if exists
    if current_table_start is not None:
        tables.append({
            'start_row': current_table_start,
            'end_row': len(df) - 1,
            'type': 'table'
        })
    
    return tables

Parameters

Name Type Default Kind
df pd.DataFrame - positional_or_keyword

Parameter Details

df: A pandas DataFrame representing an Excel sheet or tabular data. The DataFrame should contain the raw data with potential empty rows separating distinct tables. The function will analyze each row to determine if it's empty (all NaN or whitespace values) or contains data. No specific column structure is required.

Return Value

Type: List[Dict]

Returns a List of dictionaries, where each dictionary represents a detected table region. Each dictionary contains: 'start_row' (int) - the zero-based index of the first row of the table, 'end_row' (int) - the zero-based index of the last row of the table (inclusive), and 'type' (str) - always set to 'table' to indicate the region type. If no tables are detected (all rows are empty), an empty list is returned. If the DataFrame has no empty rows, a single table spanning the entire DataFrame is returned.

Dependencies

  • pandas

Required Imports

import pandas as pd
from typing import List, Dict

Usage Example

import pandas as pd
from typing import List, Dict

# Create a sample DataFrame with multiple tables separated by empty rows
data = {
    'A': ['Header1', 'Data1', 'Data2', None, 'Header2', 'Data3', 'Data4'],
    'B': ['Header1B', 'Data1B', 'Data2B', None, 'Header2B', 'Data3B', 'Data4B']
}
df = pd.DataFrame(data)

# Detect table boundaries
tables = detect_table_boundaries(df)

# Process each detected table
for table in tables:
    print(f"Table found from row {table['start_row']} to {table['end_row']}")
    table_df = df.iloc[table['start_row']:table['end_row']+1]
    print(table_df)
    print()

# Expected output:
# Table found from row 0 to 2
# Table found from row 4 to 6

Best Practices

  • Ensure the input DataFrame has been read with header=None if you want to detect headers within the data, otherwise pandas may interpret the first row as column names
  • The function considers a row empty if all values are NaN or whitespace strings after stripping
  • Use the returned start_row and end_row indices with df.iloc[start:end+1] to extract each table (note the +1 since end_row is inclusive)
  • If your Excel file has merged cells or complex formatting, consider using openpyxl to read the file first to preserve structure
  • The function does not validate or detect headers within tables - it only identifies table boundaries based on empty rows
  • For DataFrames with no empty rows, the entire DataFrame will be returned as a single table
  • Consider preprocessing the DataFrame to handle special cases like rows with only whitespace or formatting characters

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function extract_table_as_markdown 62.8% similar

    Extracts a specified row range from a pandas DataFrame and converts it into a properly formatted markdown table with automatic header detection and data cleaning.

    From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
  • function extract_sheet_context 49.4% similar

    Extracts comprehensive text context from Excel DataFrame sheets that contain mixed structured and unstructured content, converting them into markdown-formatted text while preserving table structures, key-value pairs, and section headers.

    From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
  • function validate_sheet_format 48.5% similar

    Analyzes Excel sheet structure using multiple heuristics to classify it as tabular data, information sheet, or mixed format, returning quality metrics and extraction recommendations.

    From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
  • function get_database_tables_columns 43.8% similar

    Flask route handler that retrieves database schema information including tables, columns, and relationships, filtered and sorted by relevance for data analysis workflows.

    From: /tf/active/vicechatdev/full_smartstat/app.py
  • function add_table_to_word_v1 43.7% similar

    Adds a formatted table to a Microsoft Word document using the python-docx library, with automatic column detection, header row styling, and debug logging.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
← Back to Browse