🔍 Code Extractor

function process_markdown_content

Maturity: 50

Parses markdown-formatted text content and converts it into a structured list of content elements with type annotations and formatting metadata suitable for document export.

File:
/tf/active/vicechatdev/vice_ai/complex_app.py
Lines:
1306 - 1477
Complexity:
complex

Purpose

This function serves as a markdown parser that transforms raw markdown text into a structured data format for document generation systems. It identifies and categorizes various markdown elements (headers, paragraphs, lists, tables, code blocks) and processes inline formatting. The output structure is designed to facilitate conversion to formats like DOCX or PDF by providing clear element types and hierarchical organization. It handles complex markdown features including multi-line paragraphs, nested formatting, markdown tables with headers and data rows, and various list types.

Source Code

def process_markdown_content(content):
    """
    Convert markdown-formatted content to structured format for document export
    Returns a list of content elements with their types and formatting
    """
    if not content:
        return []
    
    # Import html parser for markdown conversion
    import html
    import re
    
    elements = []
    lines = content.split('\n')
    current_paragraph = []
    in_table = False
    table_rows = []
    
    for i, line in enumerate(lines):
        line = line.strip()
        
        if not line:
            # Empty line - end current paragraph/table if any
            if in_table:
                # End table
                if table_rows:
                    elements.append({
                        'type': 'table',
                        'content': table_rows,
                        'formatting': []
                    })
                    table_rows = []
                in_table = False
            elif current_paragraph:
                elements.append({
                    'type': 'paragraph',
                    'content': ' '.join(current_paragraph),
                    'formatting': []
                })
                current_paragraph = []
            continue
        
        # Check for table rows (contains |)
        if '|' in line and not in_table:
            # Start of table - check if next line has separators
            if i + 1 < len(lines) and re.match(r'^\s*\|?\s*[-:]+\s*\|', lines[i + 1].strip()):
                print(f"DEBUG: Starting table at line {i}: '{line}'")
                # End current paragraph if any
                if current_paragraph:
                    elements.append({
                        'type': 'paragraph',
                        'content': ' '.join(current_paragraph),
                        'formatting': []
                    })
                    current_paragraph = []
                
                in_table = True
                # Parse header row
                cells = [cell.strip() for cell in line.split('|') if cell.strip()]
                print(f"DEBUG: Header cells: {cells}")
                if cells:
                    table_rows.append({
                        'type': 'header',
                        'cells': cells
                    })
                continue
        
        # Skip table separator line
        if in_table and re.match(r'^\s*\|?\s*[-:]+\s*\|', line):
            continue
        
        # Table data row
        if in_table and '|' in line:
            cells = [cell.strip() for cell in line.split('|') if cell.strip()]
            print(f"DEBUG: Data row cells: {cells}")
            if cells:
                table_rows.append({
                    'type': 'data',
                    'cells': cells
                })
            continue
        
        # End table if we're in one but line doesn't contain |
        if in_table:
            if table_rows:
                elements.append({
                    'type': 'table',
                    'content': table_rows,
                    'formatting': []
                })
                table_rows = []
            in_table = False
        
        # Check for headers
        header_match = re.match(r'^(#{1,6})\s+(.+)$', line)
        if header_match:
            # End current paragraph if any
            if current_paragraph:
                elements.append({
                    'type': 'paragraph',
                    'content': ' '.join(current_paragraph),
                    'formatting': []
                })
                current_paragraph = []
            
            # Add header
            level = len(header_match.group(1))
            elements.append({
                'type': 'header',
                'level': level,
                'content': header_match.group(2),
                'formatting': []
            })
            continue
        
        # Check for list items
        list_match = re.match(r'^[-*+]\s+(.+)$', line)
        if list_match:
            elements.append({
                'type': 'list_item',
                'content': list_match.group(1),
                'formatting': []
            })
            continue
        
        # Check for numbered list items
        num_list_match = re.match(r'^\d+\.\s+(.+)$', line)
        if num_list_match:
            elements.append({
                'type': 'numbered_list_item',
                'content': num_list_match.group(1),
                'formatting': []
            })
            continue
        
        # Check for code blocks
        if line.startswith('```'):
            elements.append({
                'type': 'code_block_start',
                'content': line[3:],  # Language if specified
                'formatting': []
            })
            continue
        
        # Regular text line - add to current paragraph
        current_paragraph.append(line)
    
    # Handle end of content
    if in_table and table_rows:
        print(f"DEBUG: End of content - adding table with {len(table_rows)} rows")
        elements.append({
            'type': 'table',
            'content': table_rows,
            'formatting': []
        })
    elif current_paragraph:
        elements.append({
            'type': 'paragraph',
            'content': ' '.join(current_paragraph),
            'formatting': []
        })
    
    # Process inline formatting for text elements
    for element in elements:
        if element['type'] in ['paragraph', 'list_item', 'numbered_list_item']:
            element['content'] = process_inline_markdown(element['content'])
        elif element['type'] == 'table':
            # Process inline formatting in table cells
            for row in element['content']:
                row['cells'] = [process_inline_markdown(cell) for cell in row['cells']]
    
    return elements

Parameters

Name Type Default Kind
content - - positional_or_keyword

Parameter Details

content: A string containing markdown-formatted text to be parsed. Can include headers (# syntax), bullet lists (-, *, + prefixes), numbered lists (1. syntax), tables (pipe-delimited), code blocks ( delimiters), and inline formatting. Can be None or empty string, which returns an empty list. Multi-line content should use newline characters (\n) as line separators.

Return Value

Returns a list of dictionaries, where each dictionary represents a parsed markdown element. Each element has a 'type' key (values: 'paragraph', 'header', 'list_item', 'numbered_list_item', 'table', 'code_block_start'), a 'content' key (string for most types, list of row dictionaries for tables), and a 'formatting' key (list, currently empty but reserved for future formatting metadata). Header elements include a 'level' key (1-6). Table elements contain 'content' as a list of row dictionaries, each with 'type' ('header' or 'data') and 'cells' (list of strings). Returns empty list if content is None or empty.

Dependencies

  • html
  • re

Required Imports

import html
import re

Conditional/Optional Imports

These imports are only needed under specific conditions:

import html

Condition: imported inside the function for HTML entity handling in markdown conversion

Required (conditional)
import re

Condition: imported inside the function for regex pattern matching of markdown syntax

Required (conditional)

Usage Example

import html
import re

# Define the helper function (required dependency)
def process_inline_markdown(text):
    # Simplified version - replace with actual implementation
    return text

def process_markdown_content(content):
    # ... (function code here)
    pass

# Example usage
markdown_text = '''# Main Header

This is a paragraph with some text.

## Subheader

- First bullet point
- Second bullet point

| Column 1 | Column 2 |
|----------|----------|
| Data 1   | Data 2   |
| Data 3   | Data 4   |

1. First numbered item
2. Second numbered item
'''

structured_content = process_markdown_content(markdown_text)

# Output structure:
for element in structured_content:
    print(f"Type: {element['type']}")
    if element['type'] == 'header':
        print(f"  Level: {element['level']}, Content: {element['content']}")
    elif element['type'] == 'table':
        print(f"  Rows: {len(element['content'])}")
        for row in element['content']:
            print(f"    {row['type']}: {row['cells']}")
    else:
        print(f"  Content: {element['content']}")

Best Practices

  • Ensure the process_inline_markdown function is defined before calling this function, as it's a required dependency for processing inline formatting
  • Input content should use standard markdown syntax with newline characters (\n) as line separators
  • The function includes debug print statements that should be removed or replaced with proper logging in production environments
  • Table detection requires a separator row (with dashes and pipes) immediately following the header row
  • Empty lines are used as delimiters between different content blocks (paragraphs, tables)
  • The function does not currently populate the 'formatting' list in returned elements - this is reserved for future enhancements
  • For large documents, consider processing in chunks to manage memory usage
  • The function handles edge cases like empty content, but callers should validate input before processing
  • Table parsing expects pipe-delimited format with consistent column counts across rows

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function process_markdown_content_v1 95.0% similar

    Parses markdown-formatted text content and converts it into a structured list of document elements (headers, paragraphs, lists, tables, code blocks) with their types and formatting preserved in original order.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function test_markdown_processing 74.2% similar

    A test function that validates markdown processing capabilities by testing content parsing, element extraction, and HTML conversion functionality.

    From: /tf/active/vicechatdev/vice_ai/test_markdown.py
  • function simple_markdown_to_html 72.5% similar

    Converts a subset of Markdown syntax to clean HTML, supporting headers, bold text, unordered lists, and paragraphs.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function html_to_markdown_v1 71.8% similar

    Converts HTML markup to Markdown syntax, handling headers, code blocks, text formatting, links, lists, and paragraphs with proper spacing.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function add_formatted_content_to_word_v1 71.2% similar

    Converts processed markdown elements into formatted content within a Microsoft Word document, handling headers, paragraphs, lists, tables, and code blocks with appropriate styling.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
← Back to Browse