function extract_sheet_context
Extracts comprehensive text context from Excel DataFrame sheets that contain mixed structured and unstructured content, converting them into markdown-formatted text while preserving table structures, key-value pairs, and section headers.
/tf/active/vicechatdev/vice_ai/smartstat_service.py
724 - 864
complex
Purpose
This function is designed to intelligently parse Excel sheets that don't follow standard tabular formats (like information sheets, metadata sheets, or mixed-content documents). It detects multiple tables within a single sheet, identifies key-value pairs, preserves section titles, and formats everything as markdown. This is particularly useful for extracting context from study information sheets, documentation pages, or metadata blocks that need to be processed by LLMs or stored as searchable text.
Source Code
def extract_sheet_context(df: pd.DataFrame, sheet_name: str) -> str:
"""
Extract comprehensive text context from information sheets preserving table structures
This function intelligently extracts structured and unstructured content from Excel sheets
that don't follow standard tabular data formats. It handles:
- Multiple distinct tables with proper markdown formatting
- Key-value pairs (e.g., "Study ID: 12345")
- Headers and section titles
- Mixed content with metadata blocks
Returns:
Rich markdown-formatted text with preserved table structures
"""
context_lines = [f"# Information Sheet: {sheet_name}\n"]
# Step 1: Detect table boundaries
tables = detect_table_boundaries(df)
if len(tables) > 1:
# Multiple tables detected - extract each separately
context_lines.append("*This sheet contains multiple structured tables:*\n")
for i, table in enumerate(tables):
start_row = table['start_row']
end_row = table['end_row']
# Look for section title above the table
section_title = None
if start_row > 0:
# Check if row before table start might be a title
title_row = df.iloc[start_row - 1]
title_values = [str(v).strip() for v in title_row if pd.notna(v) and str(v).strip() != '']
if len(title_values) == 1 and len(title_values[0]) < 100:
section_title = title_values[0]
elif len(title_values) > 0:
# Multiple values, might be a compound title
section_title = " - ".join(title_values)
# Add section header
if section_title:
context_lines.append(f"\n## {section_title}\n")
else:
context_lines.append(f"\n## Table {i + 1}\n")
# Extract table as markdown
table_markdown = extract_table_as_markdown(df, start_row, end_row)
if table_markdown:
context_lines.append(table_markdown)
context_lines.append("") # Empty line after table
else:
# Fallback to text extraction for this section
section_df = df.iloc[start_row:end_row + 1]
for _, row in section_df.iterrows():
row_values = [str(v).strip() for v in row if pd.notna(v) and str(v).strip() != '']
if row_values:
context_lines.append("- " + " | ".join(row_values))
context_lines.append("")
else:
# Single table or mixed content - use enhanced extraction
# Strategy 1: Try to extract as a single table
if len(tables) == 1:
table = tables[0]
table_markdown = extract_table_as_markdown(df, table['start_row'], table['end_row'])
if table_markdown and len(table_markdown.split('\n')) > 3: # Has meaningful content
context_lines.append("\n## Study Information\n")
context_lines.append(table_markdown)
context_lines.append("")
else:
# Table extraction failed, fall back to key-value extraction
tables = []
# Strategy 2: Key-value pairs and text blocks (if not a clean table)
if len(tables) == 0:
kv_pairs = []
text_blocks = []
for idx, row in df.iterrows():
row_values = [str(v).strip() for v in row if pd.notna(v) and str(v).strip() != '']
if len(row_values) == 0:
if len(text_blocks) > 0 and text_blocks[-1] != '\n':
text_blocks.append('\n')
continue
# Check if this row is a key-value pair
if len(row_values) == 2:
key, value = row_values[0], row_values[1]
if len(key) < 100 and not key.replace('.', '').replace(',', '').isdigit():
kv_pairs.append(f"**{key}**: {value}")
continue
# Check if first value is a header/label
if len(row_values[0]) < 50 and len(row_values) > 1:
label = row_values[0]
values = ', '.join(row_values[1:])
text_blocks.append(f"**{label}**: {values}")
else:
line = ' | '.join(row_values)
text_blocks.append(line)
# Add key-value pairs
if kv_pairs:
context_lines.append("\n## Key Information\n")
context_lines.extend(kv_pairs[:100])
# Add other content
if text_blocks:
context_lines.append("\n## Additional Details\n")
paragraph = []
for line in text_blocks[:200]:
if line == '\n':
if paragraph:
context_lines.append(' '.join(paragraph))
paragraph = []
context_lines.append('\n')
else:
if len(line) < 150 and not line.endswith(('.', '!', '?', ':')):
paragraph.append(line)
else:
if paragraph:
paragraph.append(line)
context_lines.append(' '.join(paragraph))
paragraph = []
else:
context_lines.append(line)
if paragraph:
context_lines.append(' '.join(paragraph))
full_context = '\n'.join(context_lines)
# Add metadata footer
full_context += f"\n\n---\n*Extracted from {len(df)} rows × {len(df.columns)} columns*"
logger.info(f"Extracted {len(full_context)} characters of context from sheet '{sheet_name}' "
f"with {len(tables)} tables detected")
return full_context
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
df |
pd.DataFrame | - | positional_or_keyword |
sheet_name |
str | - | positional_or_keyword |
Parameter Details
df: A pandas DataFrame representing the Excel sheet content. Should contain the raw data from the sheet including headers, metadata, tables, and text blocks. Can have mixed content types and irregular structures.
sheet_name: A string representing the name of the Excel sheet being processed. Used for labeling the output markdown and logging purposes. Should be a descriptive name that identifies the source sheet.
Return Value
Type: str
Returns a string containing markdown-formatted text that represents the extracted content from the DataFrame. The output includes: a main header with the sheet name, section headers for different content blocks, markdown tables for structured data, key-value pairs formatted with bold labels, additional text content organized into paragraphs, and a metadata footer showing the original dimensions (rows × columns). The length and structure vary based on the input content but is limited to prevent excessive output (max 100 key-value pairs, 200 text blocks).
Dependencies
pandaslogging
Required Imports
import pandas as pd
import logging
Usage Example
import pandas as pd
import logging
# Setup logger
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
# Assume helper functions are defined
def detect_table_boundaries(df):
# Mock implementation
return [{'start_row': 0, 'end_row': len(df)-1}]
def extract_table_as_markdown(df, start_row, end_row):
# Mock implementation
return df.iloc[start_row:end_row+1].to_markdown(index=False)
# Load Excel sheet into DataFrame
df = pd.read_excel('study_info.xlsx', sheet_name='Metadata')
# Extract context
context = extract_sheet_context(df, 'Metadata')
# Use the extracted context
print(context)
# Output will be markdown-formatted text with tables, key-value pairs, and sections
# Can be used for LLM prompts, documentation generation, or search indexing
with open('extracted_context.md', 'w') as f:
f.write(context)
Best Practices
- Ensure the DataFrame is properly loaded from Excel with appropriate data types and no pre-processing that might remove important structural information
- The function depends on 'detect_table_boundaries' and 'extract_table_as_markdown' helper functions - ensure these are implemented and available in scope
- Configure logging before calling this function to capture extraction statistics and debugging information
- Be aware of the output size limits: maximum 100 key-value pairs and 200 text blocks are extracted to prevent excessive output
- The function works best with sheets that have clear structural patterns (tables, key-value pairs, sections) rather than completely unstructured text
- For sheets with multiple tables, ensure there are clear boundaries (empty rows or distinct headers) between tables for accurate detection
- The returned markdown can be quite large for complex sheets - consider implementing additional filtering or summarization if needed for LLM context windows
- Test with sample sheets first to verify the extraction quality matches your expectations, as the heuristics may need adjustment for specific Excel formats
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function extract_table_as_markdown 63.3% similar
-
function validate_sheet_format 61.0% similar
-
function extract_excel_metadata 59.2% similar
-
function process_markdown_content_v1 56.0% similar
-
function process_markdown_content 56.0% similar