extract_table_as_markdown - Code Extractor

function extract_table_as_markdown

Maturity: 56

Extracts a specified row range from a pandas DataFrame and converts it into a properly formatted markdown table with automatic header detection and data cleaning.

File:
/tf/active/vicechatdev/vice_ai/smartstat_service.py

Lines:
645 - 721

Complexity:
moderate

Purpose

This function is designed to convert tabular data from a DataFrame into markdown format, making it suitable for documentation, reports, or display in markdown-compatible environments. It intelligently detects header rows using heuristics, removes empty columns, cleans data by escaping special characters, and ensures proper markdown table formatting with aligned columns and separators.

Source Code

def extract_table_as_markdown(df: pd.DataFrame, start_row: int, end_row: int) -> str:
    """
    Extract a table region and format as proper markdown table
    """
    # Extract the table region
    table_df = df.iloc[start_row:end_row + 1].copy()
    
    # Remove completely empty columns
    table_df = table_df.dropna(axis=1, how='all')
    
    if len(table_df) == 0 or len(table_df.columns) == 0:
        return ""
    
    # Clean up the data
    table_df = table_df.fillna('')
    
    # Convert all columns to string and clean
    for col in table_df.columns:
        table_df[col] = table_df[col].astype(str).str.strip()
    
    # Detect header row(s) - look for rows that look like headers
    header_candidates = []
    for idx in range(min(3, len(table_df))):  # Check first 3 rows max
        row_values = table_df.iloc[idx].values
        non_empty = [v for v in row_values if v != '']
        
        # Heuristics for header detection
        if len(non_empty) >= 2:  # Has multiple columns with content
            # Check if values look like headers (short, descriptive text)
            header_like = sum(1 for v in non_empty 
                            if len(v) < 50 and not v.replace('.', '').replace(',', '').isdigit())
            if header_like >= len(non_empty) * 0.7:  # 70% of values look like headers
                header_candidates.append(idx)
    
    # Use first detected header, or row 0 as fallback
    header_row = header_candidates[0] if header_candidates else 0
    
    # Split into header and data
    if header_row < len(table_df):
        headers = table_df.iloc[header_row].values.tolist()
        data_rows = table_df.iloc[header_row + 1:] if header_row + 1 < len(table_df) else pd.DataFrame()
    else:
        # No clear header, use column indices
        headers = [f"Col{i+1}" for i in range(len(table_df.columns))]
        data_rows = table_df
    
    # Clean headers
    headers = [str(h).strip() if str(h).strip() else f"Col{i+1}" for i, h in enumerate(headers)]
    
    # Build markdown table
    markdown_lines = []
    
    # Header row
    markdown_lines.append("| " + " | ".join(headers) + " |")
    
    # Separator row
    markdown_lines.append("| " + " | ".join(["---"] * len(headers)) + " |")
    
    # Data rows
    for _, row in data_rows.iterrows():
        values = row.values.tolist()
        # Pad or truncate to match header count
        while len(values) < len(headers):
            values.append("")
        values = values[:len(headers)]
        
        # Clean values for markdown
        cleaned_values = []
        for v in values:
            v_str = str(v).strip()
            # Escape markdown special characters
            v_str = v_str.replace("|", "\\|").replace("\n", " ").replace("\r", "")
            cleaned_values.append(v_str)
        
        markdown_lines.append("| " + " | ".join(cleaned_values) + " |")
    
    return "\n".join(markdown_lines)

Parameters

Name	Type	Default	Kind
`df`	pd.DataFrame	-	positional_or_keyword
`start_row`	int	-	positional_or_keyword
`end_row`	int	-	positional_or_keyword

Parameter Details

df: A pandas DataFrame containing the source data from which the table will be extracted. Can contain any data types as they will be converted to strings.

start_row: Integer index (0-based) indicating the first row to include in the extracted table. Must be a valid row index within the DataFrame.

end_row: Integer index (0-based) indicating the last row to include in the extracted table (inclusive). Must be greater than or equal to start_row and be a valid row index within the DataFrame.

Return Value

Type: str

Returns a string containing the markdown-formatted table with pipe-delimited columns, header row, separator row with dashes, and data rows. Returns an empty string if the extracted region has no data or columns after cleaning. Special characters like pipes, newlines, and carriage returns are escaped or removed to maintain markdown integrity.

Dependencies

pandas

Required Imports

import pandas as pd

Usage Example

import pandas as pd

# Create sample DataFrame
data = {
    'Name': ['Product', 'Item A', 'Item B', 'Item C'],
    'Price': ['Cost', '10.50', '25.00', '15.75'],
    'Quantity': ['Qty', '100', '50', '75']
}
df = pd.DataFrame(data)

# Extract rows 0-3 as markdown table
markdown_table = extract_table_as_markdown(df, start_row=0, end_row=3)
print(markdown_table)

# Output:
# | Product | Cost | Qty |
# | --- | --- | --- |
# | Item A | 10.50 | 100 |
# | Item B | 25.00 | 50 |
# | Item C | 15.75 | 75 |

Best Practices

Ensure start_row and end_row are valid indices within the DataFrame bounds to avoid IndexError
The function automatically detects headers by analyzing the first 3 rows; if your data has headers beyond row 3, they may not be detected
Empty columns are automatically removed, which may result in fewer columns than expected
The function escapes pipe characters (|) in cell values to prevent breaking the markdown table structure
Numeric data is converted to strings, so formatting control should be done before passing the DataFrame
For large tables, consider the readability of the resulting markdown as very wide tables may not display well
The header detection heuristic assumes headers are short text (< 50 characters) and not purely numeric; adjust data accordingly if this doesn't match your use case

Similar Components

AI-powered semantic similarity - components with related functionality:

function extract_sheet_context 63.3% similar

Extracts comprehensive text context from Excel DataFrame sheets that contain mixed structured and unstructured content, converting them into markdown-formatted text while preserving table structures, key-value pairs, and section headers.
From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
function detect_table_boundaries 62.8% similar

Detects distinct tables within a pandas DataFrame by identifying empty rows as table boundaries and returns metadata about each detected table region.
From: /tf/active/vicechatdev/vice_ai/smartstat_service.py
function simple_markdown_to_html 58.3% similar

Converts a subset of Markdown syntax to clean HTML, supporting headers, bold text, unordered lists, and paragraphs.
From: /tf/active/vicechatdev/vice_ai/new_app.py
function main_v16 57.9% similar

Converts a markdown file containing warranty disclosure data into multiple tabular formats (CSV, Excel, Word) with timestamped output files.
From: /tf/active/vicechatdev/convert_disclosures_to_table.py
function markdown_to_html 57.6% similar

Converts Markdown formatted text to HTML using the python-markdown library with multiple extensions, falling back to basic conversion if the library is unavailable.
From: /tf/active/vicechatdev/vice_ai/complex_app.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def extract_table_as_markdown(df: pd.DataFrame, start_row: int, end_row: int) -> str:
    """
    Extract a table region and format as proper markdown table
    """
    # Extract the table region
    table_df = df.iloc[start_row:end_row + 1].copy()
    
    # Remove completely empty columns
    table_df = table_df.dropna(axis=1, how='all')
    
    if len(table_df) == 0 or len(table_df.columns) == 0:
        return ""
    
    # Clean up the data
    table_df = table_df.fillna('')
    
    # Convert all columns to string and clean
    for col in table_df.columns:
        table_df[col] = table_df[col].astype(str).str.strip()
    
    # Detect header row(s) - look for rows that look like headers
    header_candidates = []
    for idx in range(min(3, len(table_df))):  # Check first 3 rows max
        row_values = table_df.iloc[idx].values
        non_empty = [v for v in row_values if v != '']
        
        # Heuristics for header detection
        if len(non_empty) >= 2:  # Has multiple columns with content
            # Check if values look like headers (short, descriptive text)
            header_like = sum(1 for v in non_empty 
                            if len(v) < 50 and not v.replace('.', '').replace(',', '').isdigit())
            if header_like >= len(non_empty) * 0.7:  # 70% of values look like headers
                header_candidates.append(idx)
    
    # Use first detected header, or row 0 as fallback
    header_row = header_candidates[0] if header_candidates else 0
    
    # Split into header and data
    if header_row < len(table_df):
        headers = table_df.iloc[header_row].values.tolist()
        data_rows = table_df.iloc[header_row + 1:] if header_row + 1 < len(table_df) else pd.DataFrame()
    else:
        # No clear header, use column indices
        headers = [f"Col{i+1}" for i in range(len(table_df.columns))]
        data_rows = table_df
    
    # Clean headers
    headers = [str(h).strip() if str(h).strip() else f"Col{i+1}" for i, h in enumerate(headers)]
    
    # Build markdown table
    markdown_lines = []
    
    # Header row
    markdown_lines.append("| " + " | ".join(headers) + " |")
    
    # Separator row
    markdown_lines.append("| " + " | ".join(["---"] * len(headers)) + " |")
    
    # Data rows
    for _, row in data_rows.iterrows():
        values = row.values.tolist()
        # Pad or truncate to match header count
        while len(values) < len(headers):
            values.append("")
        values = values[:len(headers)]
        
        # Clean values for markdown
        cleaned_values = []
        for v in values:
            v_str = str(v).strip()
            # Escape markdown special characters
            v_str = v_str.replace("|", "\\|").replace("\n", " ").replace("\r", "")
            cleaned_values.append(v_str)
        
        markdown_lines.append("| " + " | ".join(cleaned_values) + " |")
    
    return "\n".join(markdown_lines)
                        

Improved Code

🔍 Code Extractor

function extract_table_as_markdown

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function extract_sheet_context 63.3% similar

function detect_table_boundaries 62.8% similar

function simple_markdown_to_html 58.3% similar

function main_v16 57.9% similar

function markdown_to_html 57.6% similar

function extract_table_as_markdown

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function extract_sheet_context 63.3% similar

function detect_table_boundaries 62.8% similar

function simple_markdown_to_html 58.3% similar

function main_v16 57.9% similar

function markdown_to_html 57.6% similar

✨ Improve Code: extract_table_as_markdown

Code Comparison