🔍 Code Extractor

function html_to_text

Maturity: 50

Converts HTML content to plain text by removing HTML tags, decoding common HTML entities, and normalizing whitespace.

File:
/tf/active/vicechatdev/CDocs/utils/notifications.py
Lines:
852 - 878
Complexity:
simple

Purpose

This function provides a simple HTML-to-text conversion utility for extracting readable plain text from HTML content. It strips all HTML tags, converts common HTML entities ( , <, >, &, ") to their text equivalents, and normalizes whitespace. While suitable for basic use cases, the function includes a note that production applications should consider using a more robust library like BeautifulSoup for complex HTML parsing needs.

Source Code

def html_to_text(html: str) -> str:
    """
    Convert HTML to plain text.
    
    Args:
        html: HTML content
        
    Returns:
        Plain text version
    """
    # Simple HTML to text conversion - for production use consider a library like BeautifulSoup
    text = html
    
    # Remove HTML tags
    text = re.sub(r'<[^>]*>', '', text)
    
    # Handle some common entities
    text = text.replace('&nbsp;', ' ')
    text = text.replace('&lt;', '<')
    text = text.replace('&gt;', '>')
    text = text.replace('&amp;', '&')
    text = text.replace('&quot;', '"')
    
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    
    return text.strip()

Parameters

Name Type Default Kind
html str - positional_or_keyword

Parameter Details

html: A string containing HTML content to be converted. Can include any valid HTML markup with tags, entities, and whitespace. Empty strings are acceptable and will return an empty string.

Return Value

Type: str

Returns a string containing the plain text version of the input HTML. All HTML tags are removed, common HTML entities are decoded to their character equivalents, multiple whitespace characters are collapsed to single spaces, and leading/trailing whitespace is stripped. Returns an empty string if the input is empty or contains only HTML tags and whitespace.

Dependencies

  • re

Required Imports

import re

Usage Example

import re

def html_to_text(html: str) -> str:
    text = html
    text = re.sub(r'<[^>]*>', '', text)
    text = text.replace('&nbsp;', ' ')
    text = text.replace('&lt;', '<')
    text = text.replace('&gt;', '>')
    text = text.replace('&amp;', '&')
    text = text.replace('&quot;', '"')
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

# Example usage
html_content = '<p>Hello &amp; welcome to <strong>our site</strong>!</p>'
plain_text = html_to_text(html_content)
print(plain_text)  # Output: 'Hello & welcome to our site!'

# Example with entities and whitespace
html_with_entities = '<div>Price:&nbsp;&lt;$100&gt;&nbsp;&nbsp;&quot;Sale&quot;</div>'
result = html_to_text(html_with_entities)
print(result)  # Output: 'Price: <$100> "Sale"'

Best Practices

  • This function provides basic HTML-to-text conversion suitable for simple use cases. For production applications with complex HTML, consider using BeautifulSoup or html2text libraries for more robust parsing.
  • The function only handles five common HTML entities (&nbsp;, &lt;, &gt;, &amp;, &quot;). If your HTML contains other entities (e.g., &copy;, &euro;), they will not be decoded.
  • The regex pattern for tag removal is simple and may not handle malformed HTML or edge cases like unclosed tags optimally.
  • Entity replacement order matters: &amp; is replaced last to avoid double-decoding issues with entities that contain ampersands.
  • The function does not preserve any HTML structure like paragraphs or line breaks - all content is flattened to a single line with normalized spacing.
  • Input validation is minimal - ensure the input is actually a string to avoid AttributeError exceptions.

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function html_to_plain_text_with_formatting 70.6% similar

    Parses HTML content and converts it to plain text while preserving formatting information, returning a list of text segments with their associated format types (headers, bold, or normal).

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function clean_html_tags 66.0% similar

    Removes HTML tags and entities from text strings, returning clean plain text suitable for PDF display or other formatted output.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
  • function html_to_markdown 64.9% similar

    Converts HTML text back to Markdown format using regex-based pattern matching and replacement, handling headers, code blocks, formatting, links, lists, and HTML entities.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
  • function html_to_markdown_v1 64.5% similar

    Converts HTML markup to Markdown syntax, handling headers, code blocks, text formatting, links, lists, and paragraphs with proper spacing.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function clean_text 63.4% similar

    Cleans and normalizes text content by removing HTML tags, normalizing whitespace, and stripping markdown formatting elements.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
← Back to Browse