html_to_markdown_v1 - Code Extractor

function html_to_markdown_v1

Maturity: 48

Converts HTML markup to Markdown syntax, handling headers, code blocks, text formatting, links, lists, and paragraphs with proper spacing.

File:
/tf/active/vicechatdev/vice_ai/new_app.py

Lines:
3616 - 3674

Complexity:
moderate

Purpose

This function performs reverse conversion from HTML to Markdown format, useful for editing HTML content in Markdown editors or exporting HTML documents to Markdown format. It handles common HTML elements including headers (h1-h6), code blocks (pre/code), text formatting (bold, italic), inline code, hyperlinks, ordered/unordered lists, and paragraphs. The function preserves content structure while converting HTML tags to their Markdown equivalents and ensures proper spacing between elements.

Source Code

def html_to_markdown(html_text):
    """Convert HTML back to Markdown for editing/export"""
    if not html_text:
        return ""
    
    # Basic HTML to Markdown conversion with improved spacing
    text = html_text.strip()
    
    # Convert headers with proper spacing
    text = re.sub(r'<h1[^>]*>(.*?)</h1>', r'\n# \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h2[^>]*>(.*?)</h2>', r'\n## \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h3[^>]*>(.*?)</h3>', r'\n### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h4[^>]*>(.*?)</h4>', r'\n#### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h5[^>]*>(.*?)</h5>', r'\n##### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h6[^>]*>(.*?)</h6>', r'\n###### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Convert code blocks first (before other formatting)
    text = re.sub(r'<pre[^>]*><code[^>]*>(.*?)</code></pre>', r'\n```\n\1\n```\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<div[^>]*class=["\']highlight["\'][^>]*><pre[^>]*>(.*?)</pre></div>', r'\n```\n\1\n```\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Bold and italic
    text = re.sub(r'<strong[^>]*>(.*?)</strong>', r'**\1**', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<b[^>]*>(.*?)</b>', r'**\1**', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<em[^>]*>(.*?)</em>', r'*\1*', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<i[^>]*>(.*?)</i>', r'*\1*', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Code (inline)
    text = re.sub(r'<code[^>]*>(.*?)</code>', r'`\1`', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Links
    text = re.sub(r'<a[^>]*href=["\']([^"\']*)["\'][^>]*>(.*?)</a>', r'[\2](\1)', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Lists with proper formatting
    def convert_ul(match):
        list_content = match.group(1)
        items = re.findall(r'<li[^>]*>(.*?)</li>', list_content, flags=re.IGNORECASE | re.DOTALL)
        return '\n' + '\n'.join(f'- {item.strip()}' for item in items) + '\n\n'
    
    def convert_ol(match):
        list_content = match.group(1)
        items = re.findall(r'<li[^>]*>(.*?)</li>', list_content, flags=re.IGNORECASE | re.DOTALL)
        return '\n' + '\n'.join(f'{i+1}. {item.strip()}' for i, item in enumerate(items)) + '\n\n'
    
    text = re.sub(r'<ul[^>]*>(.*?)</ul>', convert_ul, text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<ol[^>]*>(.*?)</ol>', convert_ol, text, flags=re.IGNORECASE | re.DOTALL)
    
    # Paragraphs with proper spacing
    text = re.sub(r'<p[^>]*>(.*?)</p>', r'\1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Remove remaining HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    
    # Clean up multiple newlines
    text = re.sub(r'\n{3,}', '\n\n', text)
    
    # Decode HTML entities
    text = html.unescape(text)
    
    return text.strip()

Parameters

Name	Type	Default	Kind
`html_text`	-	-	positional_or_keyword

Parameter Details

html_text: String containing HTML markup to be converted to Markdown. Can be None or empty string, which will return an empty string. Accepts any valid HTML content including nested tags and attributes.

Return Value

Returns a string containing the Markdown-formatted version of the input HTML. The output is stripped of leading/trailing whitespace and has normalized spacing (no more than two consecutive newlines). HTML entities are decoded to their character equivalents. Returns empty string if input is None or empty.

Dependencies

re
html

Required Imports

import re
import html

Usage Example

import re
import html

def html_to_markdown(html_text):
    # ... (function code) ...
    pass

# Example usage
html_content = '<h1>Title</h1><p>This is <strong>bold</strong> and <em>italic</em> text.</p><ul><li>Item 1</li><li>Item 2</li></ul>'
markdown_output = html_to_markdown(html_content)
print(markdown_output)
# Output:
# # Title
# 
# This is **bold** and *italic* text.
# 
# - Item 1
# - Item 2

# Example with code blocks
html_with_code = '<pre><code>def hello():\n    print("Hello")</code></pre>'
markdown_code = html_to_markdown(html_with_code)
print(markdown_code)
# Output:
# 
# def hello():
#     print("Hello")
# 

# Example with links
html_with_link = '<a href="https://example.com">Example Link</a>'
markdown_link = html_to_markdown(html_with_link)
print(markdown_link)
# Output: [Example Link](https://example.com)

Best Practices

The function uses regex-based parsing which works for simple to moderately complex HTML but may not handle deeply nested or malformed HTML perfectly
Code blocks are converted before other formatting to prevent interference with nested tags
The function is case-insensitive for HTML tags and handles attributes on tags
Multiple consecutive newlines are normalized to maximum two newlines for clean output
HTML entities are decoded at the end to ensure proper character representation
For production use with complex HTML, consider using a dedicated HTML parser library like BeautifulSoup or html2text
The function strips leading/trailing whitespace from the final output
List items are automatically numbered for ordered lists and prefixed with dashes for unordered lists
Empty or None input is safely handled and returns an empty string

Similar Components

AI-powered semantic similarity - components with related functionality:

function html_to_markdown 94.0% similar

Converts HTML text back to Markdown format using regex-based pattern matching and replacement, handling headers, code blocks, formatting, links, lists, and HTML entities.
From: /tf/active/vicechatdev/vice_ai/complex_app.py
function simple_markdown_to_html 87.1% similar

Converts a subset of Markdown syntax to clean HTML, supporting headers, bold text, unordered lists, and paragraphs.
From: /tf/active/vicechatdev/vice_ai/new_app.py
function basic_markdown_to_html 85.0% similar

Converts basic Markdown syntax to HTML without using external Markdown libraries, handling headers, lists, code blocks, and inline formatting.
From: /tf/active/vicechatdev/vice_ai/complex_app.py
function format_inline_markdown 78.8% similar

Converts inline Markdown syntax (bold, italic, code, links) to HTML tags while escaping HTML entities for safe rendering.
From: /tf/active/vicechatdev/vice_ai/complex_app.py
function convert_markdown_to_html_v1 78.2% similar

Converts basic Markdown syntax to HTML markup compatible with ReportLab PDF generation, including support for clickable links, bold, italic, and inline code formatting.
From: /tf/active/vicechatdev/vice_ai/new_app.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def html_to_markdown(html_text):
    """Convert HTML back to Markdown for editing/export"""
    if not html_text:
        return ""
    
    # Basic HTML to Markdown conversion with improved spacing
    text = html_text.strip()
    
    # Convert headers with proper spacing
    text = re.sub(r'<h1[^>]*>(.*?)</h1>', r'\n# \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h2[^>]*>(.*?)</h2>', r'\n## \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h3[^>]*>(.*?)</h3>', r'\n### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h4[^>]*>(.*?)</h4>', r'\n#### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h5[^>]*>(.*?)</h5>', r'\n##### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<h6[^>]*>(.*?)</h6>', r'\n###### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Convert code blocks first (before other formatting)
    text = re.sub(r'<pre[^>]*><code[^>]*>(.*?)</code></pre>', r'\n```\n\1\n```\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<div[^>]*class=["\']highlight["\'][^>]*><pre[^>]*>(.*?)</pre></div>', r'\n```\n\1\n```\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Bold and italic
    text = re.sub(r'<strong[^>]*>(.*?)</strong>', r'**\1**', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<b[^>]*>(.*?)</b>', r'**\1**', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<em[^>]*>(.*?)</em>', r'*\1*', text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<i[^>]*>(.*?)</i>', r'*\1*', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Code (inline)
    text = re.sub(r'<code[^>]*>(.*?)</code>', r'`\1`', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Links
    text = re.sub(r'<a[^>]*href=["\']([^"\']*)["\'][^>]*>(.*?)</a>', r'[\2](\1)', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Lists with proper formatting
    def convert_ul(match):
        list_content = match.group(1)
        items = re.findall(r'<li[^>]*>(.*?)</li>', list_content, flags=re.IGNORECASE | re.DOTALL)
        return '\n' + '\n'.join(f'- {item.strip()}' for item in items) + '\n\n'
    
    def convert_ol(match):
        list_content = match.group(1)
        items = re.findall(r'<li[^>]*>(.*?)</li>', list_content, flags=re.IGNORECASE | re.DOTALL)
        return '\n' + '\n'.join(f'{i+1}. {item.strip()}' for i, item in enumerate(items)) + '\n\n'
    
    text = re.sub(r'<ul[^>]*>(.*?)</ul>', convert_ul, text, flags=re.IGNORECASE | re.DOTALL)
    text = re.sub(r'<ol[^>]*>(.*?)</ol>', convert_ol, text, flags=re.IGNORECASE | re.DOTALL)
    
    # Paragraphs with proper spacing
    text = re.sub(r'<p[^>]*>(.*?)</p>', r'\1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
    
    # Remove remaining HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    
    # Clean up multiple newlines
    text = re.sub(r'\n{3,}', '\n\n', text)
    
    # Decode HTML entities
    text = html.unescape(text)
    
    return text.strip()
                        

Improved Code

🔍 Code Extractor

function html_to_markdown_v1

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function html_to_markdown 94.0% similar

function simple_markdown_to_html 87.1% similar

function basic_markdown_to_html 85.0% similar

function format_inline_markdown 78.8% similar

function convert_markdown_to_html_v1 78.2% similar

function html_to_markdown_v1

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function html_to_markdown 94.0% similar

function simple_markdown_to_html 87.1% similar

function basic_markdown_to_html 85.0% similar

function format_inline_markdown 78.8% similar

function convert_markdown_to_html_v1 78.2% similar

✨ Improve Code: html_to_markdown_v1

Code Comparison