function html_to_markdown
Converts HTML text back to Markdown format using regex-based pattern matching and replacement, handling headers, code blocks, formatting, links, lists, and HTML entities.
/tf/active/vicechatdev/vice_ai/complex_app.py
1933 - 1997
moderate
Purpose
This function provides bidirectional conversion from HTML to Markdown, enabling users to edit or export HTML content in Markdown format. It's particularly useful in applications where content is stored or displayed as HTML but needs to be edited in the more human-readable Markdown syntax. The function handles common HTML elements including headers (h1-h6), code blocks (pre/code), text formatting (bold, italic), links, ordered/unordered lists, paragraphs, line breaks, and HTML entities.
Source Code
def html_to_markdown(html_text):
"""Convert HTML back to Markdown for editing/export"""
if not html_text:
return ""
# Basic HTML to Markdown conversion with improved spacing
text = html_text.strip()
# Convert headers with proper spacing
text = re.sub(r'<h1[^>]*>(.*?)</h1>', r'\n# \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<h2[^>]*>(.*?)</h2>', r'\n## \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<h3[^>]*>(.*?)</h3>', r'\n### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<h4[^>]*>(.*?)</h4>', r'\n#### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<h5[^>]*>(.*?)</h5>', r'\n##### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<h6[^>]*>(.*?)</h6>', r'\n###### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
# Convert code blocks first (before other formatting)
text = re.sub(r'<pre[^>]*><code[^>]*>(.*?)</code></pre>', r'\n```\n\1\n```\n\n', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<div[^>]*class=["\']highlight["\'][^>]*><pre[^>]*>(.*?)</pre></div>', r'\n```\n\1\n```\n\n', text, flags=re.IGNORECASE | re.DOTALL)
# Bold and italic
text = re.sub(r'<strong[^>]*>(.*?)</strong>', r'**\1**', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<b[^>]*>(.*?)</b>', r'**\1**', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<em[^>]*>(.*?)</em>', r'*\1*', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<i[^>]*>(.*?)</i>', r'*\1*', text, flags=re.IGNORECASE | re.DOTALL)
# Code (inline)
text = re.sub(r'<code[^>]*>(.*?)</code>', r'`\1`', text, flags=re.IGNORECASE | re.DOTALL)
# Links
text = re.sub(r'<a[^>]*href=["\']([^"\']*)["\'][^>]*>(.*?)</a>', r'[\2](\1)', text, flags=re.IGNORECASE | re.DOTALL)
# Lists with proper formatting
def convert_ul(match):
list_content = match.group(1)
items = re.findall(r'<li[^>]*>(.*?)</li>', list_content, flags=re.IGNORECASE | re.DOTALL)
return '\n' + '\n'.join(f'- {item.strip()}' for item in items) + '\n\n'
def convert_ol(match):
list_content = match.group(1)
items = re.findall(r'<li[^>]*>(.*?)</li>', list_content, flags=re.IGNORECASE | re.DOTALL)
return '\n' + '\n'.join(f'{i+1}. {item.strip()}' for i, item in enumerate(items)) + '\n\n'
text = re.sub(r'<ul[^>]*>(.*?)</ul>', convert_ul, text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<ol[^>]*>(.*?)</ol>', convert_ol, text, flags=re.IGNORECASE | re.DOTALL)
# Paragraphs with proper spacing
text = re.sub(r'<p[^>]*>(.*?)</p>', r'\1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
# Line breaks
text = re.sub(r'<br[^>]*/?>', '\n', text, flags=re.IGNORECASE)
# Remove any remaining HTML tags and spans (like syntax highlighting)
text = re.sub(r'<[^>]+>', '', text)
# Decode HTML entities
text = text.replace('&', '&').replace('<', '<').replace('>', '>').replace('"', '"').replace(''', "'")
# Clean up whitespace and newlines
text = re.sub(r'\n{3,}', '\n\n', text) # Multiple newlines to double newlines
text = re.sub(r'[ \t]+', ' ', text) # Normalize spaces
text = re.sub(r'[ \t]*\n', '\n', text) # Remove trailing spaces
text = text.strip()
return text
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
html_text |
- | - | positional_or_keyword |
Parameter Details
html_text: A string containing HTML markup to be converted to Markdown. Can be None or empty string, which will return an empty string. Should contain valid HTML tags that correspond to Markdown syntax elements (headers, lists, code blocks, formatting tags, etc.). The function is case-insensitive and handles multi-line HTML content.
Return Value
Returns a string containing the Markdown representation of the input HTML. The output includes proper spacing and newlines for readability. Returns an empty string if input is None or empty. The Markdown output follows standard conventions: # for headers, ** for bold, * for italic, ` for inline code, for code blocks, - for unordered lists, numbers for ordered lists, and [text](url) for links. Multiple consecutive newlines are normalized to double newlines, and trailing whitespace is removed.
Dependencies
re
Required Imports
import re
Usage Example
import re
def html_to_markdown(html_text):
# ... (function code) ...
pass
# Example 1: Convert simple HTML with headers and paragraphs
html = '<h1>Title</h1><p>This is a paragraph with <strong>bold</strong> text.</p>'
markdown = html_to_markdown(html)
print(markdown)
# Output: '# Title\n\nThis is a paragraph with **bold** text.'
# Example 2: Convert HTML with lists
html_list = '<ul><li>Item 1</li><li>Item 2</li></ul>'
markdown_list = html_to_markdown(html_list)
print(markdown_list)
# Output: '- Item 1\n- Item 2'
# Example 3: Convert HTML with code blocks
html_code = '<pre><code>def hello():\n print("Hello")</code></pre>'
markdown_code = html_to_markdown(html_code)
print(markdown_code)
# Output: '\ndef hello():\n print("Hello")\n'
# Example 4: Convert HTML with links
html_link = '<a href="https://example.com">Click here</a>'
markdown_link = html_to_markdown(html_link)
print(markdown_link)
# Output: '[Click here](https://example.com)'
# Example 5: Handle empty input
result = html_to_markdown(None)
print(result) # Output: ''
Best Practices
- This function uses regex-based parsing which works well for simple HTML but may not handle deeply nested or complex HTML structures perfectly. For production use with complex HTML, consider using a proper HTML parser like BeautifulSoup.
- The function processes conversions in a specific order (headers first, then code blocks, then formatting) to avoid conflicts. Modifying the order may produce unexpected results.
- HTML entities are only partially decoded (common ones like &, <, >, ", '). For comprehensive entity decoding, consider using html.unescape().
- The function assumes well-formed HTML. Malformed HTML may produce unexpected Markdown output.
- Code blocks are converted before inline code to prevent nested code tag conflicts.
- The function strips all HTML attributes except href in anchor tags, so styling information is lost in conversion.
- Multiple consecutive newlines are normalized to double newlines for cleaner Markdown output.
- For bidirectional conversion (Markdown to HTML and back), some information loss may occur due to HTML attributes and styling being stripped.
- The function is case-insensitive for HTML tags, making it more robust against inconsistent HTML formatting.
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function html_to_markdown_v1 94.0% similar
-
function simple_markdown_to_html 83.4% similar
-
function basic_markdown_to_html 83.2% similar
-
function format_inline_markdown 79.1% similar
-
function convert_markdown_to_html_v1 76.5% similar