html_to_plain_text_with_formatting

function html_to_plain_text_with_formatting

Maturity: 49

Parses HTML content and converts it to plain text while preserving formatting information, returning a list of text segments with their associated format types (headers, bold, or normal).

File:
/tf/active/vicechatdev/vice_ai/new_app.py

Lines:
2608 - 2669

Complexity:
moderate

Purpose

This function is designed to extract text content from HTML while maintaining awareness of semantic formatting. It's useful for document processing workflows where you need to preserve the structure and emphasis of HTML content when converting to other formats (like Word documents or PDFs). The function identifies headers (h1-h6), bold/strong text, and normal paragraphs, making it ideal for document export services or content transformation pipelines.

Source Code

def html_to_plain_text_with_formatting(html_content):
    """Convert HTML to plain text while preserving basic formatting markers
    Returns a list of tuples: (text, format_type) where format_type is 'h1', 'h2', 'h3', 'bold', 'normal'
    """
    from html.parser import HTMLParser
    
    class HTMLFormatter(HTMLParser):
        def __init__(self):
            super().__init__()
            self.elements = []
            self.current_text = []
            self.current_format = 'normal'
            self.tag_stack = []
            
        def handle_starttag(self, tag, attrs):
            # Flush current text before changing format
            if self.current_text:
                text = ''.join(self.current_text).strip()
                if text:
                    self.elements.append((text, self.current_format))
                self.current_text = []
            
            self.tag_stack.append(tag)
            
            # Set format based on tag
            if tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
                self.current_format = tag
            elif tag in ['b', 'strong']:
                self.current_format = 'bold'
            elif tag == 'p':
                self.current_format = 'normal'
                
        def handle_endtag(self, tag):
            # Flush current text
            if self.current_text:
                text = ''.join(self.current_text).strip()
                if text:
                    self.elements.append((text, self.current_format))
                self.current_text = []
            
            # Pop tag from stack
            if self.tag_stack and self.tag_stack[-1] == tag:
                self.tag_stack.pop()
            
            # Reset format to normal after closing header or bold
            if tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'b', 'strong']:
                self.current_format = 'normal'
                
        def handle_data(self, data):
            self.current_text.append(data)
            
        def get_elements(self):
            # Flush any remaining text
            if self.current_text:
                text = ''.join(self.current_text).strip()
                if text:
                    self.elements.append((text, self.current_format))
            return self.elements
    
    parser = HTMLFormatter()
    parser.feed(html_content)
    return parser.get_elements()

Parameters

Name	Type	Default	Kind
`html_content`	-	-	positional_or_keyword

Parameter Details

html_content: A string containing valid HTML markup. Can include any HTML tags, but the function specifically recognizes and preserves formatting for h1-h6 (headers), b/strong (bold text), and p (paragraphs). Other tags are ignored but their text content is still extracted. The HTML does not need to be well-formed or complete.

Return Value

Returns a list of tuples where each tuple contains two elements: (text: str, format_type: str). The 'text' is the extracted plain text content (whitespace-stripped), and 'format_type' is one of: 'h1', 'h2', 'h3', 'h4', 'h5', 'h6' (for headers), 'bold' (for bold/strong tags), or 'normal' (for paragraphs or unformatted text). Empty text segments are filtered out. The list preserves the order of elements as they appear in the HTML.

Dependencies

html.parser

Required Imports

from html.parser import HTMLParser

Usage Example

from html.parser import HTMLParser

def html_to_plain_text_with_formatting(html_content):
    # ... (function code) ...
    pass

# Example usage
html_input = """
<h1>Main Title</h1>
<p>This is a normal paragraph.</p>
<h2>Subtitle</h2>
<p>Another paragraph with <b>bold text</b> inside.</p>
<strong>Strong emphasis</strong>
"""

result = html_to_plain_text_with_formatting(html_input)
for text, format_type in result:
    print(f"{format_type}: {text}")

# Output:
# h1: Main Title
# normal: This is a normal paragraph.
# h2: Subtitle
# normal: Another paragraph with
# bold: bold text
# normal: inside.
# bold: Strong emphasis

Best Practices

The function strips whitespace from extracted text, so leading/trailing spaces in HTML will be removed
Nested formatting tags may not be fully preserved - the function uses the most recent tag to determine format type
Only h1-h6, b, strong, and p tags are recognized for formatting; other tags are ignored but their content is extracted
The function flushes text content when encountering tag boundaries, so text within a single tag will be grouped together
Empty text segments are automatically filtered out from the results
The tag_stack is maintained but format resets to 'normal' after closing header or bold tags, which may not handle deeply nested structures perfectly
For malformed HTML, the HTMLParser will attempt to handle it gracefully, but results may vary
Consider sanitizing or validating HTML input if it comes from untrusted sources

Similar Components

AI-powered semantic similarity - components with related functionality:

function html_to_text 70.6% similar

Converts HTML content to plain text by removing HTML tags, decoding common HTML entities, and normalizing whitespace.
From: /tf/active/vicechatdev/CDocs/utils/notifications.py
function html_to_markdown_v1 70.3% similar

Converts HTML markup to Markdown syntax, handling headers, code blocks, text formatting, links, lists, and paragraphs with proper spacing.
From: /tf/active/vicechatdev/vice_ai/new_app.py
function simple_markdown_to_html 69.8% similar

Converts a subset of Markdown syntax to clean HTML, supporting headers, bold text, unordered lists, and paragraphs.
From: /tf/active/vicechatdev/vice_ai/new_app.py
function html_to_markdown 67.7% similar

Converts HTML text back to Markdown format using regex-based pattern matching and replacement, handling headers, code blocks, formatting, links, lists, and HTML entities.
From: /tf/active/vicechatdev/vice_ai/complex_app.py
function process_markdown_content_v1 67.4% similar

Parses markdown-formatted text content and converts it into a structured list of document elements (headers, paragraphs, lists, tables, code blocks) with their types and formatting preserved in original order.
From: /tf/active/vicechatdev/vice_ai/new_app.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def html_to_plain_text_with_formatting(html_content):
    """Convert HTML to plain text while preserving basic formatting markers
    Returns a list of tuples: (text, format_type) where format_type is 'h1', 'h2', 'h3', 'bold', 'normal'
    """
    from html.parser import HTMLParser
    
    class HTMLFormatter(HTMLParser):
        def __init__(self):
            super().__init__()
            self.elements = []
            self.current_text = []
            self.current_format = 'normal'
            self.tag_stack = []
            
        def handle_starttag(self, tag, attrs):
            # Flush current text before changing format
            if self.current_text:
                text = ''.join(self.current_text).strip()
                if text:
                    self.elements.append((text, self.current_format))
                self.current_text = []
            
            self.tag_stack.append(tag)
            
            # Set format based on tag
            if tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6']:
                self.current_format = tag
            elif tag in ['b', 'strong']:
                self.current_format = 'bold'
            elif tag == 'p':
                self.current_format = 'normal'
                
        def handle_endtag(self, tag):
            # Flush current text
            if self.current_text:
                text = ''.join(self.current_text).strip()
                if text:
                    self.elements.append((text, self.current_format))
                self.current_text = []
            
            # Pop tag from stack
            if self.tag_stack and self.tag_stack[-1] == tag:
                self.tag_stack.pop()
            
            # Reset format to normal after closing header or bold
            if tag in ['h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'b', 'strong']:
                self.current_format = 'normal'
                
        def handle_data(self, data):
            self.current_text.append(data)
            
        def get_elements(self):
            # Flush any remaining text
            if self.current_text:
                text = ''.join(self.current_text).strip()
                if text:
                    self.elements.append((text, self.current_format))
            return self.elements
    
    parser = HTMLFormatter()
    parser.feed(html_content)
    return parser.get_elements()
                        

Improved Code

🔍 Code Extractor

function html_to_plain_text_with_formatting

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function html_to_text 70.6% similar

function html_to_markdown_v1 70.3% similar

function simple_markdown_to_html 69.8% similar

function html_to_markdown 67.7% similar

function process_markdown_content_v1 67.4% similar

function html_to_plain_text_with_formatting

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function html_to_text 70.6% similar

function html_to_markdown_v1 70.3% similar

function simple_markdown_to_html 69.8% similar

function html_to_markdown 67.7% similar

function process_markdown_content_v1 67.4% similar

✨ Improve Code: html_to_plain_text_with_formatting

Code Comparison