🔍 Code Extractor

function html_to_markdown_v1

Maturity: 48

Converts HTML markup to Markdown syntax, handling headers, code blocks, text formatting, links, lists, and paragraphs with proper spacing.

File:
/tf/active/vicechatdev/vice_ai/new_app.py
Lines:
3616 - 3674
Complexity:
moderate

Purpose

This function performs reverse conversion from HTML to Markdown format, useful for editing HTML content in Markdown editors or exporting HTML documents to Markdown format. It handles common HTML elements including headers (h1-h6), code blocks (pre/code), text formatting (bold, italic), inline code, hyperlinks, ordered/unordered lists, and paragraphs. The function preserves content structure while converting HTML tags to their Markdown equivalents and ensures proper spacing between elements.

Source Code

def html_to_markdown(html_text):
    """Convert HTML back to Markdown for editing/export"""
    if not html_text:
        return ""
    
    # Basic HTML to Markdown conversion with improved spacing
    text = html_text.strip()
    
    # Convert headers with proper spacing
    text = re.sub(r'<h1[^>]*>(.*?)</h1>', r'\n# \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h2[^>]*>(.*?)</h2>', r'\n## \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h3[^>]*>(.*?)</h3>', r'\n### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h4[^>]*>(.*?)</h4>', r'\n#### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h5[^>]*>(.*?)</h5>', r'\n##### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h6[^>]*>(.*?)</h6>', r'\n###### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Convert code blocks first (before other formatting)
    text = re.sub(r'<pre[^>]*><code[^>]*>(.*?)</code></pre>', r'\n```\n\1\n```\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<div[^>]*class=["\']highlight["\'][^>]*><pre[^>]*>(.*?)</pre></div>', r'\n```\n\1\n```\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Bold and italic
    text = re.sub(r'<strong[^>]*>(.*?)</strong>', r'**\1**', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<b[^>]*>(.*?)</b>', r'**\1**', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<em[^>]*>(.*?)</em>', r'*\1*', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<i[^>]*>(.*?)</i>', r'*\1*', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Code (inline)
    text = re.sub(r'<code[^>]*>(.*?)</code>', r'`\1`', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Links
    text = re.sub(r'<a[^>]*href=["\']([^"\']*)["\'][^>]*>(.*?)</a>', r'[\2](\1)', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Lists with proper formatting
    def convert_ul(match):
        list_content = match.group(1)
        items = re.findall(r'<li[^>]*>(.*?)</li>', list_content, flags=re.IGNORECASE | re.DOTALL)
        return '\n' + '\n'.join(f'- {item.strip()}' for item in items) + '\n\n'
    
    def convert_ol(match):
        list_content = match.group(1)
        items = re.findall(r'<li[^>]*>(.*?)</li>', list_content, flags=re.IGNORECASE | re.DOTALL)
        return '\n' + '\n'.join(f'{i+1}. {item.strip()}' for i, item in enumerate(items)) + '\n\n'
    
    text = re.sub(r'<ul[^>]*>(.*?)</ul>', convert_ul, text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<ol[^>]*>(.*?)</ol>', convert_ol, text, flags=re.IGNORECASE | re.DOTALL)
    
    # Paragraphs with proper spacing
    text = re.sub(r'<p[^>]*>(.*?)</p>', r'\1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Remove remaining HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    
    # Clean up multiple newlines
    text = re.sub(r'\n{3,}', '\n\n', text)
    
    # Decode HTML entities
    text = html.unescape(text)
    
    return text.strip()

Parameters

Name Type Default Kind
html_text - - positional_or_keyword

Parameter Details

html_text: String containing HTML markup to be converted to Markdown. Can be None or empty string, which will return an empty string. Accepts any valid HTML content including nested tags and attributes.

Return Value

Returns a string containing the Markdown-formatted version of the input HTML. The output is stripped of leading/trailing whitespace and has normalized spacing (no more than two consecutive newlines). HTML entities are decoded to their character equivalents. Returns empty string if input is None or empty.

Dependencies

  • re
  • html

Required Imports

import re
import html

Usage Example

import re
import html

def html_to_markdown(html_text):
    # ... (function code) ...
    pass

# Example usage
html_content = '<h1>Title</h1><p>This is <strong>bold</strong> and <em>italic</em> text.</p><ul><li>Item 1</li><li>Item 2</li></ul>'
markdown_output = html_to_markdown(html_content)
print(markdown_output)
# Output:
# # Title
# 
# This is **bold** and *italic* text.
# 
# - Item 1
# - Item 2

# Example with code blocks
html_with_code = '<pre><code>def hello():\n    print("Hello")</code></pre>'
markdown_code = html_to_markdown(html_with_code)
print(markdown_code)
# Output:
# 
# def hello():
#     print("Hello")
# 

# Example with links
html_with_link = '<a href="https://example.com">Example Link</a>'
markdown_link = html_to_markdown(html_with_link)
print(markdown_link)
# Output: [Example Link](https://example.com)

Best Practices

  • The function uses regex-based parsing which works for simple to moderately complex HTML but may not handle deeply nested or malformed HTML perfectly
  • Code blocks are converted before other formatting to prevent interference with nested tags
  • The function is case-insensitive for HTML tags and handles attributes on tags
  • Multiple consecutive newlines are normalized to maximum two newlines for clean output
  • HTML entities are decoded at the end to ensure proper character representation
  • For production use with complex HTML, consider using a dedicated HTML parser library like BeautifulSoup or html2text
  • The function strips leading/trailing whitespace from the final output
  • List items are automatically numbered for ordered lists and prefixed with dashes for unordered lists
  • Empty or None input is safely handled and returns an empty string

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function html_to_markdown 94.0% similar

    Converts HTML text back to Markdown format using regex-based pattern matching and replacement, handling headers, code blocks, formatting, links, lists, and HTML entities.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
  • function simple_markdown_to_html 87.1% similar

    Converts a subset of Markdown syntax to clean HTML, supporting headers, bold text, unordered lists, and paragraphs.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function basic_markdown_to_html 85.0% similar

    Converts basic Markdown syntax to HTML without using external Markdown libraries, handling headers, lists, code blocks, and inline formatting.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
  • function format_inline_markdown 78.8% similar

    Converts inline Markdown syntax (bold, italic, code, links) to HTML tags while escaping HTML entities for safe rendering.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
  • function convert_markdown_to_html_v1 78.2% similar

    Converts basic Markdown syntax to HTML markup compatible with ReportLab PDF generation, including support for clickable links, bold, italic, and inline code formatting.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
← Back to Browse