🔍 Code Extractor

function html_to_plain_text_with_formatting

Maturity: 49

Parses HTML content and converts it to plain text while preserving formatting information, returning a list of text segments with their associated format types (headers, bold, or normal).

File:
/tf/active/vicechatdev/vice_ai/new_app.py
Lines:
2608 - 2669
Complexity:
moderate

Purpose

This function is designed to extract text content from HTML while maintaining awareness of semantic formatting. It's useful for document processing workflows where you need to preserve the structure and emphasis of HTML content when converting to other formats (like Word documents or PDFs). The function identifies headers (h1-h6), bold/strong text, and normal paragraphs, making it ideal for document export services or content transformation pipelines.

Source Code

def html_to_plain_text_with_formatting(html_content):
    """Convert HTML to plain text while preserving basic formatting markers
    Returns a list of tuples: (text, format_type) where format_type is 'h1', 'h2', 'h3', 'bold', 'normal'
    """
    from html.parser import HTMLParser
    
    class HTMLFormatter(HTMLParser):
        def __init__(self):
            super().__init__()
            self.elements = []
            self.current_text = []
            self.current_format = 'normal'
            self.tag_stack = []
            
        def handle_starttag(self, tag, attrs):
            # Flush current text before changing format
            if self.current_text:
                text = ''.join(self.current_text).strip()
                if text:
                    self.elements.append((text, self.current_format))
                self.current_text = []
            
            self.tag_stack.append(tag)
            
            # Set format based on tag
            if tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
                self.current_format = tag
            elif tag in ['b', 'strong']:
                self.current_format = 'bold'
            elif tag == 'p':
                self.current_format = 'normal'
                
        def handle_endtag(self, tag):
            # Flush current text
            if self.current_text:
                text = ''.join(self.current_text).strip()
                if text:
                    self.elements.append((text, self.current_format))
                self.current_text = []
            
            # Pop tag from stack
            if self.tag_stack and self.tag_stack[-1] == tag:
                self.tag_stack.pop()
            
            # Reset format to normal after closing header or bold
            if tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'b', 'strong']:
                self.current_format = 'normal'
                
        def handle_data(self, data):
            self.current_text.append(data)
            
        def get_elements(self):
            # Flush any remaining text
            if self.current_text:
                text = ''.join(self.current_text).strip()
                if text:
                    self.elements.append((text, self.current_format))
            return self.elements
    
    parser = HTMLFormatter()
    parser.feed(html_content)
    return parser.get_elements()

Parameters

Name Type Default Kind
html_content - - positional_or_keyword

Parameter Details

html_content: A string containing valid HTML markup. Can include any HTML tags, but the function specifically recognizes and preserves formatting for h1-h6 (headers), b/strong (bold text), and p (paragraphs). Other tags are ignored but their text content is still extracted. The HTML does not need to be well-formed or complete.

Return Value

Returns a list of tuples where each tuple contains two elements: (text: str, format_type: str). The 'text' is the extracted plain text content (whitespace-stripped), and 'format_type' is one of: 'h1', 'h2', 'h3', 'h4', 'h5', 'h6' (for headers), 'bold' (for bold/strong tags), or 'normal' (for paragraphs or unformatted text). Empty text segments are filtered out. The list preserves the order of elements as they appear in the HTML.

Dependencies

  • html.parser

Required Imports

from html.parser import HTMLParser

Usage Example

from html.parser import HTMLParser

def html_to_plain_text_with_formatting(html_content):
    # ... (function code) ...
    pass

# Example usage
html_input = """
<h1>Main Title</h1>
<p>This is a normal paragraph.</p>
<h2>Subtitle</h2>
<p>Another paragraph with <b>bold text</b> inside.</p>
<strong>Strong emphasis</strong>
"""

result = html_to_plain_text_with_formatting(html_input)
for text, format_type in result:
    print(f"{format_type}: {text}")

# Output:
# h1: Main Title
# normal: This is a normal paragraph.
# h2: Subtitle
# normal: Another paragraph with
# bold: bold text
# normal: inside.
# bold: Strong emphasis

Best Practices

  • The function strips whitespace from extracted text, so leading/trailing spaces in HTML will be removed
  • Nested formatting tags may not be fully preserved - the function uses the most recent tag to determine format type
  • Only h1-h6, b, strong, and p tags are recognized for formatting; other tags are ignored but their content is extracted
  • The function flushes text content when encountering tag boundaries, so text within a single tag will be grouped together
  • Empty text segments are automatically filtered out from the results
  • The tag_stack is maintained but format resets to 'normal' after closing header or bold tags, which may not handle deeply nested structures perfectly
  • For malformed HTML, the HTMLParser will attempt to handle it gracefully, but results may vary
  • Consider sanitizing or validating HTML input if it comes from untrusted sources

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function html_to_text 70.6% similar

    Converts HTML content to plain text by removing HTML tags, decoding common HTML entities, and normalizing whitespace.

    From: /tf/active/vicechatdev/CDocs/utils/notifications.py
  • function html_to_markdown_v1 70.3% similar

    Converts HTML markup to Markdown syntax, handling headers, code blocks, text formatting, links, lists, and paragraphs with proper spacing.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function simple_markdown_to_html 69.8% similar

    Converts a subset of Markdown syntax to clean HTML, supporting headers, bold text, unordered lists, and paragraphs.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function html_to_markdown 67.7% similar

    Converts HTML text back to Markdown format using regex-based pattern matching and replacement, handling headers, code blocks, formatting, links, lists, and HTML entities.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
  • function process_markdown_content_v1 67.4% similar

    Parses markdown-formatted text content and converts it into a structured list of document elements (headers, paragraphs, lists, tables, code blocks) with their types and formatting preserved in original order.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
← Back to Browse