function html_to_text
Converts HTML content to plain text by removing HTML tags, decoding common HTML entities, and normalizing whitespace.
/tf/active/vicechatdev/CDocs/utils/notifications.py
852 - 878
simple
Purpose
This function provides a simple HTML-to-text conversion utility for extracting readable plain text from HTML content. It strips all HTML tags, converts common HTML entities ( , <, >, &, ") to their text equivalents, and normalizes whitespace. While suitable for basic use cases, the function includes a note that production applications should consider using a more robust library like BeautifulSoup for complex HTML parsing needs.
Source Code
def html_to_text(html: str) -> str:
"""
Convert HTML to plain text.
Args:
html: HTML content
Returns:
Plain text version
"""
# Simple HTML to text conversion - for production use consider a library like BeautifulSoup
text = html
# Remove HTML tags
text = re.sub(r'<[^>]*>', '', text)
# Handle some common entities
text = text.replace(' ', ' ')
text = text.replace('<', '<')
text = text.replace('>', '>')
text = text.replace('&', '&')
text = text.replace('"', '"')
# Normalize whitespace
text = re.sub(r'\s+', ' ', text)
return text.strip()
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
html |
str | - | positional_or_keyword |
Parameter Details
html: A string containing HTML content to be converted. Can include any valid HTML markup with tags, entities, and whitespace. Empty strings are acceptable and will return an empty string.
Return Value
Type: str
Returns a string containing the plain text version of the input HTML. All HTML tags are removed, common HTML entities are decoded to their character equivalents, multiple whitespace characters are collapsed to single spaces, and leading/trailing whitespace is stripped. Returns an empty string if the input is empty or contains only HTML tags and whitespace.
Dependencies
re
Required Imports
import re
Usage Example
import re
def html_to_text(html: str) -> str:
text = html
text = re.sub(r'<[^>]*>', '', text)
text = text.replace(' ', ' ')
text = text.replace('<', '<')
text = text.replace('>', '>')
text = text.replace('&', '&')
text = text.replace('"', '"')
text = re.sub(r'\s+', ' ', text)
return text.strip()
# Example usage
html_content = '<p>Hello & welcome to <strong>our site</strong>!</p>'
plain_text = html_to_text(html_content)
print(plain_text) # Output: 'Hello & welcome to our site!'
# Example with entities and whitespace
html_with_entities = '<div>Price: <$100> "Sale"</div>'
result = html_to_text(html_with_entities)
print(result) # Output: 'Price: <$100> "Sale"'
Best Practices
- This function provides basic HTML-to-text conversion suitable for simple use cases. For production applications with complex HTML, consider using BeautifulSoup or html2text libraries for more robust parsing.
- The function only handles five common HTML entities ( , <, >, &, "). If your HTML contains other entities (e.g., ©, €), they will not be decoded.
- The regex pattern for tag removal is simple and may not handle malformed HTML or edge cases like unclosed tags optimally.
- Entity replacement order matters: & is replaced last to avoid double-decoding issues with entities that contain ampersands.
- The function does not preserve any HTML structure like paragraphs or line breaks - all content is flattened to a single line with normalized spacing.
- Input validation is minimal - ensure the input is actually a string to avoid AttributeError exceptions.
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function html_to_plain_text_with_formatting 70.6% similar
-
function clean_html_tags 66.0% similar
-
function html_to_markdown 64.9% similar
-
function html_to_markdown_v1 64.5% similar
-
function clean_text 63.4% similar