🔍 Code Extractor

function extract_table_as_markdown

Maturity: 56

Extracts a specified row range from a pandas DataFrame and converts it into a properly formatted markdown table with automatic header detection and data cleaning.

File:
/tf/active/vicechatdev/vice_ai/smartstat_service.py
Lines:
645 - 721
Complexity:
moderate

Purpose

This function is designed to convert tabular data from a DataFrame into markdown format, making it suitable for documentation, reports, or display in markdown-compatible environments. It intelligently detects header rows using heuristics, removes empty columns, cleans data by escaping special characters, and ensures proper markdown table formatting with aligned columns and separators.

Source Code

def extract_table_as_markdown(df: pd.DataFrame, start_row: int, end_row: int) -> str:
    """
    Extract a table region and format as proper markdown table
    """
    # Extract the table region
    table_df = df.iloc[start_row:end_row + 1].copy()
    
    # Remove completely empty columns
    table_df = table_df.dropna(axis=1, how='all')
    
    if len(table_df) == 0 or len(table_df.columns) == 0:
        return ""
    
    # Clean up the data
    table_df = table_df.fillna('')
    
    # Convert all columns to string and clean
    for col in table_df.columns:
        table_df[col] = table_df[col].astype(str).str.strip()
    
    # Detect header row(s) - look for rows that look like headers
    header_candidates = []
    for idx in range(min(3, len(table_df))):  # Check first 3 rows max
        row_values = table_df.iloc[idx].values
        non_empty = [v for v in row_values if v != '']
        
        # Heuristics for header detection
        if len(non_empty) >= 2:  # Has multiple columns with content
            # Check if values look like headers (short, descriptive text)
            header_like = sum(1 for v in non_empty 
                            if len(v) < 50 and not v.replace('.', '').replace(',', '').isdigit())
            if header_like >= len(non_empty) * 0.7:  # 70% of values look like headers
                header_candidates.append(idx)
    
    # Use first detected header, or row 0 as fallback
    header_row = header_candidates[0] if header_candidates else 0
    
    # Split into header and data
    if header_row < len(table_df):
        headers = table_df.iloc[header_row].values.tolist()
        data_rows = table_df.iloc[header_row + 1:] if header_row + 1 < len(table_df) else pd.DataFrame()
    else:
        # No clear header, use column indices
        headers = [f"Col{i+1}" for i in range(len(table_df.columns))]
        data_rows = table_df
    
    # Clean headers
    headers = [str(h).strip() if str(h).strip() else f"Col{i+1}" for i, h in enumerate(headers)]
    
    # Build markdown table
    markdown_lines = []
    
    # Header row
    markdown_lines.append("| " + " | ".join(headers) + " |")
    
    # Separator row
    markdown_lines.append("| " + " | ".join(["---"] * len(headers)) + " |")
    
    # Data rows
    for _, row in data_rows.iterrows():
        values = row.values.tolist()
        # Pad or truncate to match header count
        while len(values) < len(headers):
            values.append("")
        values = values[:len(headers)]
        
        # Clean values for markdown
        cleaned_values = []
        for v in values:
            v_str = str(v).strip()
            # Escape markdown special characters
            v_str = v_str.replace("|", "\\|").replace("\n", " ").replace("\r", "")
            cleaned_values.append(v_str)
        
        markdown_lines.append("| " + " | ".join(cleaned_values) + " |")
    
    return "\n".join(markdown_lines)

Parameters

Name Type Default Kind
df pd.DataFrame - positional_or_keyword
start_row int - positional_or_keyword
end_row int - positional_or_keyword

Parameter Details

df: A pandas DataFrame containing the source data from which the table will be extracted. Can contain any data types as they will be converted to strings.

start_row: Integer index (0-based) indicating the first row to include in the extracted table. Must be a valid row index within the DataFrame.

end_row: Integer index (0-based) indicating the last row to include in the extracted table (inclusive). Must be greater than or equal to start_row and be a valid row index within the DataFrame.

Return Value

Type: str

Returns a string containing the markdown-formatted table with pipe-delimited columns, header row, separator row with dashes, and data rows. Returns an empty string if the extracted region has no data or columns after cleaning. Special characters like pipes, newlines, and carriage returns are escaped or removed to maintain markdown integrity.

Dependencies

  • pandas

Required Imports

import pandas as pd

Usage Example

import pandas as pd

# Create sample DataFrame
data = {
    'Name': ['Product', 'Item A', 'Item B', 'Item C'],
    'Price': ['Cost', '10.50', '25.00', '15.75'],
    'Quantity': ['Qty', '100', '50', '75']
}
df = pd.DataFrame(data)

# Extract rows 0-3 as markdown table
markdown_table = extract_table_as_markdown(df, start_row=0, end_row=3)
print(markdown_table)

# Output:
# | Product | Cost | Qty |
# | --- | --- | --- |
# | Item A | 10.50 | 100 |
# | Item B | 25.00 | 50 |
# | Item C | 15.75 | 75 |

Best Practices

  • Ensure start_row and end_row are valid indices within the DataFrame bounds to avoid IndexError
  • The function automatically detects headers by analyzing the first 3 rows; if your data has headers beyond row 3, they may not be detected
  • Empty columns are automatically removed, which may result in fewer columns than expected
  • The function escapes pipe characters (|) in cell values to prevent breaking the markdown table structure
  • Numeric data is converted to strings, so formatting control should be done before passing the DataFrame
  • For large tables, consider the readability of the resulting markdown as very wide tables may not display well
  • The header detection heuristic assumes headers are short text (< 50 characters) and not purely numeric; adjust data accordingly if this doesn't match your use case

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function extract_sheet_context 63.3% similar

    Extracts comprehensive text context from Excel DataFrame sheets that contain mixed structured and unstructured content, converting them into markdown-formatted text while preserving table structures, key-value pairs, and section headers.

    From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
  • function detect_table_boundaries 62.8% similar

    Detects distinct tables within a pandas DataFrame by identifying empty rows as table boundaries and returns metadata about each detected table region.

    From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
  • function simple_markdown_to_html 58.3% similar

    Converts a subset of Markdown syntax to clean HTML, supporting headers, bold text, unordered lists, and paragraphs.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function main_v16 57.9% similar

    Converts a markdown file containing warranty disclosure data into multiple tabular formats (CSV, Excel, Word) with timestamped output files.

    From: /tf/active/vicechatdev/convert_disclosures_to_table.py
  • function markdown_to_html 57.6% similar

    Converts Markdown formatted text to HTML using the python-markdown library with multiple extensions, falling back to basic conversion if the library is unavailable.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
← Back to Browse