function detect_table_boundaries
Detects distinct tables within a pandas DataFrame by identifying empty rows as table boundaries and returns metadata about each detected table region.
/tf/active/vicechatdev/vice_ai/smartstat_service.py
604 - 642
simple
Purpose
This function is designed to parse Excel sheets or DataFrames that contain multiple tables separated by empty rows. It identifies where each table starts and ends, enabling downstream processing of individual tables within a single sheet. Common use cases include processing complex Excel reports with multiple data sections, extracting structured data from multi-table layouts, and preparing data for statistical analysis where tables need to be processed independently.
Source Code
def detect_table_boundaries(df: pd.DataFrame) -> List[Dict]:
"""
Detect distinct tables within an Excel sheet based on empty rows and headers
Returns:
List of table regions with start_row, end_row, and detected headers
"""
tables = []
current_table_start = None
for idx in range(len(df)):
row = df.iloc[idx]
row_values = [str(v).strip() for v in row if pd.notna(v) and str(v).strip() != '']
if len(row_values) == 0:
# Empty row - potential table boundary
if current_table_start is not None:
# End current table
tables.append({
'start_row': current_table_start,
'end_row': idx - 1,
'type': 'table'
})
current_table_start = None
else:
# Non-empty row
if current_table_start is None:
# Start new table
current_table_start = idx
# Close final table if exists
if current_table_start is not None:
tables.append({
'start_row': current_table_start,
'end_row': len(df) - 1,
'type': 'table'
})
return tables
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
df |
pd.DataFrame | - | positional_or_keyword |
Parameter Details
df: A pandas DataFrame representing an Excel sheet or tabular data. The DataFrame should contain the raw data with potential empty rows separating distinct tables. The function will analyze each row to determine if it's empty (all NaN or whitespace values) or contains data. No specific column structure is required.
Return Value
Type: List[Dict]
Returns a List of dictionaries, where each dictionary represents a detected table region. Each dictionary contains: 'start_row' (int) - the zero-based index of the first row of the table, 'end_row' (int) - the zero-based index of the last row of the table (inclusive), and 'type' (str) - always set to 'table' to indicate the region type. If no tables are detected (all rows are empty), an empty list is returned. If the DataFrame has no empty rows, a single table spanning the entire DataFrame is returned.
Dependencies
pandas
Required Imports
import pandas as pd
from typing import List, Dict
Usage Example
import pandas as pd
from typing import List, Dict
# Create a sample DataFrame with multiple tables separated by empty rows
data = {
'A': ['Header1', 'Data1', 'Data2', None, 'Header2', 'Data3', 'Data4'],
'B': ['Header1B', 'Data1B', 'Data2B', None, 'Header2B', 'Data3B', 'Data4B']
}
df = pd.DataFrame(data)
# Detect table boundaries
tables = detect_table_boundaries(df)
# Process each detected table
for table in tables:
print(f"Table found from row {table['start_row']} to {table['end_row']}")
table_df = df.iloc[table['start_row']:table['end_row']+1]
print(table_df)
print()
# Expected output:
# Table found from row 0 to 2
# Table found from row 4 to 6
Best Practices
- Ensure the input DataFrame has been read with header=None if you want to detect headers within the data, otherwise pandas may interpret the first row as column names
- The function considers a row empty if all values are NaN or whitespace strings after stripping
- Use the returned start_row and end_row indices with df.iloc[start:end+1] to extract each table (note the +1 since end_row is inclusive)
- If your Excel file has merged cells or complex formatting, consider using openpyxl to read the file first to preserve structure
- The function does not validate or detect headers within tables - it only identifies table boundaries based on empty rows
- For DataFrames with no empty rows, the entire DataFrame will be returned as a single table
- Consider preprocessing the DataFrame to handle special cases like rows with only whitespace or formatting characters
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function extract_table_as_markdown 62.8% similar
-
function extract_sheet_context 49.4% similar
-
function validate_sheet_format 48.5% similar
-
function get_database_tables_columns 43.8% similar
-
function add_table_to_word_v1 43.7% similar