🔍 Code Extractor

function html_to_markdown

Maturity: 48

Converts HTML text back to Markdown format using regex-based pattern matching and replacement, handling headers, code blocks, formatting, links, lists, and HTML entities.

File:
/tf/active/vicechatdev/vice_ai/complex_app.py
Lines:
1933 - 1997
Complexity:
moderate

Purpose

This function provides bidirectional conversion from HTML to Markdown, enabling users to edit or export HTML content in Markdown format. It's particularly useful in applications where content is stored or displayed as HTML but needs to be edited in the more human-readable Markdown syntax. The function handles common HTML elements including headers (h1-h6), code blocks (pre/code), text formatting (bold, italic), links, ordered/unordered lists, paragraphs, line breaks, and HTML entities.

Source Code

def html_to_markdown(html_text):
    """Convert HTML back to Markdown for editing/export"""
    if not html_text:
        return ""
    
    # Basic HTML to Markdown conversion with improved spacing
    text = html_text.strip()
    
    # Convert headers with proper spacing
    text = re.sub(r'<h1[^>]*>(.*?)</h1>', r'\n# \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h2[^>]*>(.*?)</h2>', r'\n## \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h3[^>]*>(.*?)</h3>', r'\n### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h4[^>]*>(.*?)</h4>', r'\n#### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h5[^>]*>(.*?)</h5>', r'\n##### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h6[^>]*>(.*?)</h6>', r'\n###### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Convert code blocks first (before other formatting)
    text = re.sub(r'<pre[^>]*><code[^>]*>(.*?)</code></pre>', r'\n```\n\1\n```\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<div[^>]*class=["\']highlight["\'][^>]*><pre[^>]*>(.*?)</pre></div>', r'\n```\n\1\n```\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Bold and italic
    text = re.sub(r'<strong[^>]*>(.*?)</strong>', r'**\1**', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<b[^>]*>(.*?)</b>', r'**\1**', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<em[^>]*>(.*?)</em>', r'*\1*', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<i[^>]*>(.*?)</i>', r'*\1*', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Code (inline)
    text = re.sub(r'<code[^>]*>(.*?)</code>', r'`\1`', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Links
    text = re.sub(r'<a[^>]*href=["\']([^"\']*)["\'][^>]*>(.*?)</a>', r'[\2](\1)', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Lists with proper formatting
    def convert_ul(match):
        list_content = match.group(1)
        items = re.findall(r'<li[^>]*>(.*?)</li>', list_content, flags=re.IGNORECASE | re.DOTALL)
        return '\n' + '\n'.join(f'- {item.strip()}' for item in items) + '\n\n'
    
    def convert_ol(match):
        list_content = match.group(1)
        items = re.findall(r'<li[^>]*>(.*?)</li>', list_content, flags=re.IGNORECASE | re.DOTALL)
        return '\n' + '\n'.join(f'{i+1}. {item.strip()}' for i, item in enumerate(items)) + '\n\n'
    
    text = re.sub(r'<ul[^>]*>(.*?)</ul>', convert_ul, text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<ol[^>]*>(.*?)</ol>', convert_ol, text, flags=re.IGNORECASE | re.DOTALL)
    
    # Paragraphs with proper spacing
    text = re.sub(r'<p[^>]*>(.*?)</p>', r'\1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Line breaks
    text = re.sub(r'<br[^>]*/?>', '\n', text, flags=re.IGNORECASE)
    
    # Remove any remaining HTML tags and spans (like syntax highlighting)
    text = re.sub(r'<[^>]+>', '', text)
    
    # Decode HTML entities
    text = text.replace('&amp;', '&').replace('&lt;', '<').replace('&gt;', '>').replace('&quot;', '"').replace('&#39;', "'")
    
    # Clean up whitespace and newlines
    text = re.sub(r'\n{3,}', '\n\n', text)  # Multiple newlines to double newlines
    text = re.sub(r'[ \t]+', ' ', text)  # Normalize spaces
    text = re.sub(r'[ \t]*\n', '\n', text)  # Remove trailing spaces
    text = text.strip()
    
    return text

Parameters

Name Type Default Kind
html_text - - positional_or_keyword

Parameter Details

html_text: A string containing HTML markup to be converted to Markdown. Can be None or empty string, which will return an empty string. Should contain valid HTML tags that correspond to Markdown syntax elements (headers, lists, code blocks, formatting tags, etc.). The function is case-insensitive and handles multi-line HTML content.

Return Value

Returns a string containing the Markdown representation of the input HTML. The output includes proper spacing and newlines for readability. Returns an empty string if input is None or empty. The Markdown output follows standard conventions: # for headers, ** for bold, * for italic, ` for inline code, for code blocks, - for unordered lists, numbers for ordered lists, and [text](url) for links. Multiple consecutive newlines are normalized to double newlines, and trailing whitespace is removed.

Dependencies

  • re

Required Imports

import re

Usage Example

import re

def html_to_markdown(html_text):
    # ... (function code) ...
    pass

# Example 1: Convert simple HTML with headers and paragraphs
html = '<h1>Title</h1><p>This is a paragraph with <strong>bold</strong> text.</p>'
markdown = html_to_markdown(html)
print(markdown)
# Output: '# Title\n\nThis is a paragraph with **bold** text.'

# Example 2: Convert HTML with lists
html_list = '<ul><li>Item 1</li><li>Item 2</li></ul>'
markdown_list = html_to_markdown(html_list)
print(markdown_list)
# Output: '- Item 1\n- Item 2'

# Example 3: Convert HTML with code blocks
html_code = '<pre><code>def hello():\n    print("Hello")</code></pre>'
markdown_code = html_to_markdown(html_code)
print(markdown_code)
# Output: '\ndef hello():\n    print("Hello")\n'

# Example 4: Convert HTML with links
html_link = '<a href="https://example.com">Click here</a>'
markdown_link = html_to_markdown(html_link)
print(markdown_link)
# Output: '[Click here](https://example.com)'

# Example 5: Handle empty input
result = html_to_markdown(None)
print(result)  # Output: ''

Best Practices

  • This function uses regex-based parsing which works well for simple HTML but may not handle deeply nested or complex HTML structures perfectly. For production use with complex HTML, consider using a proper HTML parser like BeautifulSoup.
  • The function processes conversions in a specific order (headers first, then code blocks, then formatting) to avoid conflicts. Modifying the order may produce unexpected results.
  • HTML entities are only partially decoded (common ones like &amp;, &lt;, &gt;, &quot;, &#39;). For comprehensive entity decoding, consider using html.unescape().
  • The function assumes well-formed HTML. Malformed HTML may produce unexpected Markdown output.
  • Code blocks are converted before inline code to prevent nested code tag conflicts.
  • The function strips all HTML attributes except href in anchor tags, so styling information is lost in conversion.
  • Multiple consecutive newlines are normalized to double newlines for cleaner Markdown output.
  • For bidirectional conversion (Markdown to HTML and back), some information loss may occur due to HTML attributes and styling being stripped.
  • The function is case-insensitive for HTML tags, making it more robust against inconsistent HTML formatting.

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function html_to_markdown_v1 94.0% similar

    Converts HTML markup to Markdown syntax, handling headers, code blocks, text formatting, links, lists, and paragraphs with proper spacing.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function simple_markdown_to_html 83.4% similar

    Converts a subset of Markdown syntax to clean HTML, supporting headers, bold text, unordered lists, and paragraphs.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function basic_markdown_to_html 83.2% similar

    Converts basic Markdown syntax to HTML without using external Markdown libraries, handling headers, lists, code blocks, and inline formatting.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
  • function format_inline_markdown 79.1% similar

    Converts inline Markdown syntax (bold, italic, code, links) to HTML tags while escaping HTML entities for safe rendering.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
  • function convert_markdown_to_html_v1 76.5% similar

    Converts basic Markdown syntax to HTML markup compatible with ReportLab PDF generation, including support for clickable links, bold, italic, and inline code formatting.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
← Back to Browse