🔍 Code Extractor

function extract_sheet_context

Maturity: 64

Extracts comprehensive text context from Excel DataFrame sheets that contain mixed structured and unstructured content, converting them into markdown-formatted text while preserving table structures, key-value pairs, and section headers.

File:
/tf/active/vicechatdev/vice_ai/smartstat_service.py
Lines:
724 - 864
Complexity:
complex

Purpose

This function is designed to intelligently parse Excel sheets that don't follow standard tabular formats (like information sheets, metadata sheets, or mixed-content documents). It detects multiple tables within a single sheet, identifies key-value pairs, preserves section titles, and formats everything as markdown. This is particularly useful for extracting context from study information sheets, documentation pages, or metadata blocks that need to be processed by LLMs or stored as searchable text.

Source Code

def extract_sheet_context(df: pd.DataFrame, sheet_name: str) -> str:
    """
    Extract comprehensive text context from information sheets preserving table structures
    
    This function intelligently extracts structured and unstructured content from Excel sheets
    that don't follow standard tabular data formats. It handles:
    - Multiple distinct tables with proper markdown formatting
    - Key-value pairs (e.g., "Study ID: 12345") 
    - Headers and section titles
    - Mixed content with metadata blocks
    
    Returns:
        Rich markdown-formatted text with preserved table structures
    """
    context_lines = [f"# Information Sheet: {sheet_name}\n"]
    
    # Step 1: Detect table boundaries
    tables = detect_table_boundaries(df)
    
    if len(tables) > 1:
        # Multiple tables detected - extract each separately
        context_lines.append("*This sheet contains multiple structured tables:*\n")
        
        for i, table in enumerate(tables):
            start_row = table['start_row']
            end_row = table['end_row']
            
            # Look for section title above the table
            section_title = None
            if start_row > 0:
                # Check if row before table start might be a title
                title_row = df.iloc[start_row - 1]
                title_values = [str(v).strip() for v in title_row if pd.notna(v) and str(v).strip() != '']
                if len(title_values) == 1 and len(title_values[0]) < 100:
                    section_title = title_values[0]
                elif len(title_values) > 0:
                    # Multiple values, might be a compound title
                    section_title = " - ".join(title_values)
            
            # Add section header
            if section_title:
                context_lines.append(f"\n## {section_title}\n")
            else:
                context_lines.append(f"\n## Table {i + 1}\n")
            
            # Extract table as markdown
            table_markdown = extract_table_as_markdown(df, start_row, end_row)
            if table_markdown:
                context_lines.append(table_markdown)
                context_lines.append("")  # Empty line after table
            else:
                # Fallback to text extraction for this section
                section_df = df.iloc[start_row:end_row + 1]
                for _, row in section_df.iterrows():
                    row_values = [str(v).strip() for v in row if pd.notna(v) and str(v).strip() != '']
                    if row_values:
                        context_lines.append("- " + " | ".join(row_values))
                context_lines.append("")
    
    else:
        # Single table or mixed content - use enhanced extraction
        
        # Strategy 1: Try to extract as a single table
        if len(tables) == 1:
            table = tables[0]
            table_markdown = extract_table_as_markdown(df, table['start_row'], table['end_row'])
            if table_markdown and len(table_markdown.split('\n')) > 3:  # Has meaningful content
                context_lines.append("\n## Study Information\n")
                context_lines.append(table_markdown)
                context_lines.append("")
            else:
                # Table extraction failed, fall back to key-value extraction
                tables = []
        
        # Strategy 2: Key-value pairs and text blocks (if not a clean table)
        if len(tables) == 0:
            kv_pairs = []
            text_blocks = []
            
            for idx, row in df.iterrows():
                row_values = [str(v).strip() for v in row if pd.notna(v) and str(v).strip() != '']
                
                if len(row_values) == 0:
                    if len(text_blocks) > 0 and text_blocks[-1] != '\n':
                        text_blocks.append('\n')
                    continue
                
                # Check if this row is a key-value pair
                if len(row_values) == 2:
                    key, value = row_values[0], row_values[1]
                    if len(key) < 100 and not key.replace('.', '').replace(',', '').isdigit():
                        kv_pairs.append(f"**{key}**: {value}")
                        continue
                
                # Check if first value is a header/label
                if len(row_values[0]) < 50 and len(row_values) > 1:
                    label = row_values[0]
                    values = ', '.join(row_values[1:])
                    text_blocks.append(f"**{label}**: {values}")
                else:
                    line = ' | '.join(row_values)
                    text_blocks.append(line)
            
            # Add key-value pairs
            if kv_pairs:
                context_lines.append("\n## Key Information\n")
                context_lines.extend(kv_pairs[:100])
            
            # Add other content
            if text_blocks:
                context_lines.append("\n## Additional Details\n")
                paragraph = []
                for line in text_blocks[:200]:
                    if line == '\n':
                        if paragraph:
                            context_lines.append(' '.join(paragraph))
                            paragraph = []
                        context_lines.append('\n')
                    else:
                        if len(line) < 150 and not line.endswith(('.', '!', '?', ':')):
                            paragraph.append(line)
                        else:
                            if paragraph:
                                paragraph.append(line)
                                context_lines.append(' '.join(paragraph))
                                paragraph = []
                            else:
                                context_lines.append(line)
                
                if paragraph:
                    context_lines.append(' '.join(paragraph))
    
    full_context = '\n'.join(context_lines)
    
    # Add metadata footer
    full_context += f"\n\n---\n*Extracted from {len(df)} rows × {len(df.columns)} columns*"
    
    logger.info(f"Extracted {len(full_context)} characters of context from sheet '{sheet_name}' "
                f"with {len(tables)} tables detected")
    
    return full_context

Parameters

Name Type Default Kind
df pd.DataFrame - positional_or_keyword
sheet_name str - positional_or_keyword

Parameter Details

df: A pandas DataFrame representing the Excel sheet content. Should contain the raw data from the sheet including headers, metadata, tables, and text blocks. Can have mixed content types and irregular structures.

sheet_name: A string representing the name of the Excel sheet being processed. Used for labeling the output markdown and logging purposes. Should be a descriptive name that identifies the source sheet.

Return Value

Type: str

Returns a string containing markdown-formatted text that represents the extracted content from the DataFrame. The output includes: a main header with the sheet name, section headers for different content blocks, markdown tables for structured data, key-value pairs formatted with bold labels, additional text content organized into paragraphs, and a metadata footer showing the original dimensions (rows × columns). The length and structure vary based on the input content but is limited to prevent excessive output (max 100 key-value pairs, 200 text blocks).

Dependencies

  • pandas
  • logging

Required Imports

import pandas as pd
import logging

Usage Example

import pandas as pd
import logging

# Setup logger
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

# Assume helper functions are defined
def detect_table_boundaries(df):
    # Mock implementation
    return [{'start_row': 0, 'end_row': len(df)-1}]

def extract_table_as_markdown(df, start_row, end_row):
    # Mock implementation
    return df.iloc[start_row:end_row+1].to_markdown(index=False)

# Load Excel sheet into DataFrame
df = pd.read_excel('study_info.xlsx', sheet_name='Metadata')

# Extract context
context = extract_sheet_context(df, 'Metadata')

# Use the extracted context
print(context)
# Output will be markdown-formatted text with tables, key-value pairs, and sections

# Can be used for LLM prompts, documentation generation, or search indexing
with open('extracted_context.md', 'w') as f:
    f.write(context)

Best Practices

  • Ensure the DataFrame is properly loaded from Excel with appropriate data types and no pre-processing that might remove important structural information
  • The function depends on 'detect_table_boundaries' and 'extract_table_as_markdown' helper functions - ensure these are implemented and available in scope
  • Configure logging before calling this function to capture extraction statistics and debugging information
  • Be aware of the output size limits: maximum 100 key-value pairs and 200 text blocks are extracted to prevent excessive output
  • The function works best with sheets that have clear structural patterns (tables, key-value pairs, sections) rather than completely unstructured text
  • For sheets with multiple tables, ensure there are clear boundaries (empty rows or distinct headers) between tables for accurate detection
  • The returned markdown can be quite large for complex sheets - consider implementing additional filtering or summarization if needed for LLM context windows
  • Test with sample sheets first to verify the extraction quality matches your expectations, as the heuristics may need adjustment for specific Excel formats

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function extract_table_as_markdown 63.3% similar

    Extracts a specified row range from a pandas DataFrame and converts it into a properly formatted markdown table with automatic header detection and data cleaning.

    From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
  • function validate_sheet_format 61.0% similar

    Analyzes Excel sheet structure using multiple heuristics to classify it as tabular data, information sheet, or mixed format, returning quality metrics and extraction recommendations.

    From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
  • function extract_excel_metadata 59.2% similar

    Extracts comprehensive metadata from Excel files including cell comments, merged regions, named ranges, document properties, and sheet-level information that standard pandas operations miss.

    From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
  • function process_markdown_content_v1 56.0% similar

    Parses markdown-formatted text content and converts it into a structured list of document elements (headers, paragraphs, lists, tables, code blocks) with their types and formatting preserved in original order.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function process_markdown_content 56.0% similar

    Parses markdown-formatted text content and converts it into a structured list of content elements with type annotations and formatting metadata suitable for document export.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
← Back to Browse