function html_to_plain_text_with_formatting
Parses HTML content and converts it to plain text while preserving formatting information, returning a list of text segments with their associated format types (headers, bold, or normal).
/tf/active/vicechatdev/vice_ai/new_app.py
2608 - 2669
moderate
Purpose
This function is designed to extract text content from HTML while maintaining awareness of semantic formatting. It's useful for document processing workflows where you need to preserve the structure and emphasis of HTML content when converting to other formats (like Word documents or PDFs). The function identifies headers (h1-h6), bold/strong text, and normal paragraphs, making it ideal for document export services or content transformation pipelines.
Source Code
def html_to_plain_text_with_formatting(html_content):
"""Convert HTML to plain text while preserving basic formatting markers
Returns a list of tuples: (text, format_type) where format_type is 'h1', 'h2', 'h3', 'bold', 'normal'
"""
from html.parser import HTMLParser
class HTMLFormatter(HTMLParser):
def __init__(self):
super().__init__()
self.elements = []
self.current_text = []
self.current_format = 'normal'
self.tag_stack = []
def handle_starttag(self, tag, attrs):
# Flush current text before changing format
if self.current_text:
text = ''.join(self.current_text).strip()
if text:
self.elements.append((text, self.current_format))
self.current_text = []
self.tag_stack.append(tag)
# Set format based on tag
if tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
self.current_format = tag
elif tag in ['b', 'strong']:
self.current_format = 'bold'
elif tag == 'p':
self.current_format = 'normal'
def handle_endtag(self, tag):
# Flush current text
if self.current_text:
text = ''.join(self.current_text).strip()
if text:
self.elements.append((text, self.current_format))
self.current_text = []
# Pop tag from stack
if self.tag_stack and self.tag_stack[-1] == tag:
self.tag_stack.pop()
# Reset format to normal after closing header or bold
if tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'b', 'strong']:
self.current_format = 'normal'
def handle_data(self, data):
self.current_text.append(data)
def get_elements(self):
# Flush any remaining text
if self.current_text:
text = ''.join(self.current_text).strip()
if text:
self.elements.append((text, self.current_format))
return self.elements
parser = HTMLFormatter()
parser.feed(html_content)
return parser.get_elements()
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
html_content |
- | - | positional_or_keyword |
Parameter Details
html_content: A string containing valid HTML markup. Can include any HTML tags, but the function specifically recognizes and preserves formatting for h1-h6 (headers), b/strong (bold text), and p (paragraphs). Other tags are ignored but their text content is still extracted. The HTML does not need to be well-formed or complete.
Return Value
Returns a list of tuples where each tuple contains two elements: (text: str, format_type: str). The 'text' is the extracted plain text content (whitespace-stripped), and 'format_type' is one of: 'h1', 'h2', 'h3', 'h4', 'h5', 'h6' (for headers), 'bold' (for bold/strong tags), or 'normal' (for paragraphs or unformatted text). Empty text segments are filtered out. The list preserves the order of elements as they appear in the HTML.
Dependencies
html.parser
Required Imports
from html.parser import HTMLParser
Usage Example
from html.parser import HTMLParser
def html_to_plain_text_with_formatting(html_content):
# ... (function code) ...
pass
# Example usage
html_input = """
<h1>Main Title</h1>
<p>This is a normal paragraph.</p>
<h2>Subtitle</h2>
<p>Another paragraph with <b>bold text</b> inside.</p>
<strong>Strong emphasis</strong>
"""
result = html_to_plain_text_with_formatting(html_input)
for text, format_type in result:
print(f"{format_type}: {text}")
# Output:
# h1: Main Title
# normal: This is a normal paragraph.
# h2: Subtitle
# normal: Another paragraph with
# bold: bold text
# normal: inside.
# bold: Strong emphasis
Best Practices
- The function strips whitespace from extracted text, so leading/trailing spaces in HTML will be removed
- Nested formatting tags may not be fully preserved - the function uses the most recent tag to determine format type
- Only h1-h6, b, strong, and p tags are recognized for formatting; other tags are ignored but their content is extracted
- The function flushes text content when encountering tag boundaries, so text within a single tag will be grouped together
- Empty text segments are automatically filtered out from the results
- The tag_stack is maintained but format resets to 'normal' after closing header or bold tags, which may not handle deeply nested structures perfectly
- For malformed HTML, the HTMLParser will attempt to handle it gracefully, but results may vary
- Consider sanitizing or validating HTML input if it comes from untrusted sources
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function html_to_text 70.6% similar
-
function html_to_markdown_v1 70.3% similar
-
function simple_markdown_to_html 69.8% similar
-
function html_to_markdown 67.7% similar
-
function process_markdown_content_v1 67.4% similar