html_to_text - Code Extractor

function html_to_text

Maturity: 50

Converts HTML content to plain text by removing HTML tags, decoding common HTML entities, and normalizing whitespace.

File:
/tf/active/vicechatdev/CDocs/utils/notifications.py

Lines:
852 - 878

Complexity:
simple

Purpose

This function provides a simple HTML-to-text conversion utility for extracting readable plain text from HTML content. It strips all HTML tags, converts common HTML entities ( , <, >, &, ") to their text equivalents, and normalizes whitespace. While suitable for basic use cases, the function includes a note that production applications should consider using a more robust library like BeautifulSoup for complex HTML parsing needs.

Source Code

def html_to_text(html: str) -> str:
    """
    Convert HTML to plain text.
    
    Args:
        html: HTML content
        
    Returns:
        Plain text version
    """
    # Simple HTML to text conversion - for production use consider a library like BeautifulSoup
    text = html
    
    # Remove HTML tags
    text = re.sub(r'<[^>]*>', '', text)
    
    # Handle some common entities
    text = text.replace('&nbsp;', ' ')
    text = text.replace('&lt;', '<')
    text = text.replace('&gt;', '>')
    text = text.replace('&amp;', '&')
    text = text.replace('&quot;', '"')
    
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text)
    
    return text.strip()

Parameters

Name	Type	Default	Kind
`html`	str	-	positional_or_keyword

Parameter Details

html: A string containing HTML content to be converted. Can include any valid HTML markup with tags, entities, and whitespace. Empty strings are acceptable and will return an empty string.

Return Value

Type: str

Returns a string containing the plain text version of the input HTML. All HTML tags are removed, common HTML entities are decoded to their character equivalents, multiple whitespace characters are collapsed to single spaces, and leading/trailing whitespace is stripped. Returns an empty string if the input is empty or contains only HTML tags and whitespace.

Dependencies

re

Required Imports

import re

Usage Example

import re

def html_to_text(html: str) -> str:
    text = html
    text = re.sub(r'<[^>]*>', '', text)
    text = text.replace('&nbsp;', ' ')
    text = text.replace('&lt;', '<')
    text = text.replace('&gt;', '>')
    text = text.replace('&amp;', '&')
    text = text.replace('&quot;', '"')
    text = re.sub(r'\s+', ' ', text)
    return text.strip()

# Example usage
html_content = '<p>Hello &amp; welcome to <strong>our site</strong>!</p>'
plain_text = html_to_text(html_content)
print(plain_text)  # Output: 'Hello & welcome to our site!'

# Example with entities and whitespace
html_with_entities = '<div>Price:&nbsp;&lt;$100&gt;&nbsp;&nbsp;&quot;Sale&quot;</div>'
result = html_to_text(html_with_entities)
print(result)  # Output: 'Price: <$100> "Sale"'

Best Practices

This function provides basic HTML-to-text conversion suitable for simple use cases. For production applications with complex HTML, consider using BeautifulSoup or html2text libraries for more robust parsing.
The function only handles five common HTML entities ( , <, >, &, "). If your HTML contains other entities (e.g., ©, €), they will not be decoded.
The regex pattern for tag removal is simple and may not handle malformed HTML or edge cases like unclosed tags optimally.
Entity replacement order matters: & is replaced last to avoid double-decoding issues with entities that contain ampersands.
The function does not preserve any HTML structure like paragraphs or line breaks - all content is flattened to a single line with normalized spacing.
Input validation is minimal - ensure the input is actually a string to avoid AttributeError exceptions.

Similar Components

AI-powered semantic similarity - components with related functionality:

function html_to_plain_text_with_formatting 70.6% similar

Parses HTML content and converts it to plain text while preserving formatting information, returning a list of text segments with their associated format types (headers, bold, or normal).
From: /tf/active/vicechatdev/vice_ai/new_app.py
function clean_html_tags 66.0% similar

Removes HTML tags and entities from text strings, returning clean plain text suitable for PDF display or other formatted output.
From: /tf/active/vicechatdev/vice_ai/complex_app.py
function html_to_markdown 64.9% similar

Converts HTML text back to Markdown format using regex-based pattern matching and replacement, handling headers, code blocks, formatting, links, lists, and HTML entities.
From: /tf/active/vicechatdev/vice_ai/complex_app.py
function html_to_markdown_v1 64.5% similar

Converts HTML markup to Markdown syntax, handling headers, code blocks, text formatting, links, lists, and paragraphs with proper spacing.
From: /tf/active/vicechatdev/vice_ai/new_app.py
function clean_text 63.4% similar

Cleans and normalizes text content by removing HTML tags, normalizing whitespace, and stripping markdown formatting elements.
From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py

🔍 Code Extractor

function html_to_text

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function html_to_plain_text_with_formatting 70.6% similar

function clean_html_tags 66.0% similar

function html_to_markdown 64.9% similar

function html_to_markdown_v1 64.5% similar

function clean_text 63.4% similar

function html_to_text

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function html_to_plain_text_with_formatting 70.6% similar

function clean_html_tags 66.0% similar

function html_to_markdown 64.9% similar

function html_to_markdown_v1 64.5% similar

function clean_text 63.4% similar

✨ Improve Code: html_to_text

Code Comparison