function html_to_markdown_v1
Converts HTML markup to Markdown syntax, handling headers, code blocks, text formatting, links, lists, and paragraphs with proper spacing.
/tf/active/vicechatdev/vice_ai/new_app.py
3616 - 3674
moderate
Purpose
This function performs reverse conversion from HTML to Markdown format, useful for editing HTML content in Markdown editors or exporting HTML documents to Markdown format. It handles common HTML elements including headers (h1-h6), code blocks (pre/code), text formatting (bold, italic), inline code, hyperlinks, ordered/unordered lists, and paragraphs. The function preserves content structure while converting HTML tags to their Markdown equivalents and ensures proper spacing between elements.
Source Code
def html_to_markdown(html_text):
"""Convert HTML back to Markdown for editing/export"""
if not html_text:
return ""
# Basic HTML to Markdown conversion with improved spacing
text = html_text.strip()
# Convert headers with proper spacing
text = re.sub(r'<h1[^>]*>(.*?)</h1>', r'\n# \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<h2[^>]*>(.*?)</h2>', r'\n## \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<h3[^>]*>(.*?)</h3>', r'\n### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<h4[^>]*>(.*?)</h4>', r'\n#### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<h5[^>]*>(.*?)</h5>', r'\n##### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<h6[^>]*>(.*?)</h6>', r'\n###### \1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
# Convert code blocks first (before other formatting)
text = re.sub(r'<pre[^>]*><code[^>]*>(.*?)</code></pre>', r'\n```\n\1\n```\n\n', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<div[^>]*class=["\']highlight["\'][^>]*><pre[^>]*>(.*?)</pre></div>', r'\n```\n\1\n```\n\n', text, flags=re.IGNORECASE | re.DOTALL)
# Bold and italic
text = re.sub(r'<strong[^>]*>(.*?)</strong>', r'**\1**', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<b[^>]*>(.*?)</b>', r'**\1**', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<em[^>]*>(.*?)</em>', r'*\1*', text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<i[^>]*>(.*?)</i>', r'*\1*', text, flags=re.IGNORECASE | re.DOTALL)
# Code (inline)
text = re.sub(r'<code[^>]*>(.*?)</code>', r'`\1`', text, flags=re.IGNORECASE | re.DOTALL)
# Links
text = re.sub(r'<a[^>]*href=["\']([^"\']*)["\'][^>]*>(.*?)</a>', r'[\2](\1)', text, flags=re.IGNORECASE | re.DOTALL)
# Lists with proper formatting
def convert_ul(match):
list_content = match.group(1)
items = re.findall(r'<li[^>]*>(.*?)</li>', list_content, flags=re.IGNORECASE | re.DOTALL)
return '\n' + '\n'.join(f'- {item.strip()}' for item in items) + '\n\n'
def convert_ol(match):
list_content = match.group(1)
items = re.findall(r'<li[^>]*>(.*?)</li>', list_content, flags=re.IGNORECASE | re.DOTALL)
return '\n' + '\n'.join(f'{i+1}. {item.strip()}' for i, item in enumerate(items)) + '\n\n'
text = re.sub(r'<ul[^>]*>(.*?)</ul>', convert_ul, text, flags=re.IGNORECASE | re.DOTALL)
text = re.sub(r'<ol[^>]*>(.*?)</ol>', convert_ol, text, flags=re.IGNORECASE | re.DOTALL)
# Paragraphs with proper spacing
text = re.sub(r'<p[^>]*>(.*?)</p>', r'\1\n\n', text, flags=re.IGNORECASE | re.DOTALL)
# Remove remaining HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Clean up multiple newlines
text = re.sub(r'\n{3,}', '\n\n', text)
# Decode HTML entities
text = html.unescape(text)
return text.strip()
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
html_text |
- | - | positional_or_keyword |
Parameter Details
html_text: String containing HTML markup to be converted to Markdown. Can be None or empty string, which will return an empty string. Accepts any valid HTML content including nested tags and attributes.
Return Value
Returns a string containing the Markdown-formatted version of the input HTML. The output is stripped of leading/trailing whitespace and has normalized spacing (no more than two consecutive newlines). HTML entities are decoded to their character equivalents. Returns empty string if input is None or empty.
Dependencies
rehtml
Required Imports
import re
import html
Usage Example
import re
import html
def html_to_markdown(html_text):
# ... (function code) ...
pass
# Example usage
html_content = '<h1>Title</h1><p>This is <strong>bold</strong> and <em>italic</em> text.</p><ul><li>Item 1</li><li>Item 2</li></ul>'
markdown_output = html_to_markdown(html_content)
print(markdown_output)
# Output:
# # Title
#
# This is **bold** and *italic* text.
#
# - Item 1
# - Item 2
# Example with code blocks
html_with_code = '<pre><code>def hello():\n print("Hello")</code></pre>'
markdown_code = html_to_markdown(html_with_code)
print(markdown_code)
# Output:
#
# def hello():
# print("Hello")
#
# Example with links
html_with_link = '<a href="https://example.com">Example Link</a>'
markdown_link = html_to_markdown(html_with_link)
print(markdown_link)
# Output: [Example Link](https://example.com)
Best Practices
- The function uses regex-based parsing which works for simple to moderately complex HTML but may not handle deeply nested or malformed HTML perfectly
- Code blocks are converted before other formatting to prevent interference with nested tags
- The function is case-insensitive for HTML tags and handles attributes on tags
- Multiple consecutive newlines are normalized to maximum two newlines for clean output
- HTML entities are decoded at the end to ensure proper character representation
- For production use with complex HTML, consider using a dedicated HTML parser library like BeautifulSoup or html2text
- The function strips leading/trailing whitespace from the final output
- List items are automatically numbered for ordered lists and prefixed with dashes for unordered lists
- Empty or None input is safely handled and returns an empty string
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function html_to_markdown 94.0% similar
-
function simple_markdown_to_html 87.1% similar
-
function basic_markdown_to_html 85.0% similar
-
function format_inline_markdown 78.8% similar
-
function convert_markdown_to_html_v1 78.2% similar