function clean_html_tags
Removes HTML tags and entities from text strings, returning clean plain text suitable for PDF display or other formatted output.
/tf/active/vicechatdev/vice_ai/complex_app.py
1775 - 1792
simple
Purpose
This function sanitizes HTML-formatted text by unescaping HTML entities (like &, <, etc.) and stripping all HTML tags, then normalizing whitespace. It's designed for preparing text content for PDF generation or other contexts where HTML markup needs to be removed while preserving the actual text content.
Source Code
def clean_html_tags(text):
"""Remove HTML tags from text for clean PDF display"""
if not text:
return text
import re
import html
# First unescape HTML entities
text = html.unescape(text)
# Remove HTML tags
text = re.sub(r'<[^>]+>', '', text)
# Clean up extra whitespace
text = ' '.join(text.split())
return text
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
text |
- | - | positional_or_keyword |
Parameter Details
text: A string containing HTML markup to be cleaned. Can be None or empty string, which will be returned as-is. Expected to contain HTML tags (like <p>, <div>, <span>) and/or HTML entities (like , ") that need to be removed or converted to plain text.
Return Value
Returns a string with all HTML tags removed, HTML entities unescaped to their character equivalents, and whitespace normalized (multiple spaces/newlines collapsed to single spaces). Returns the original value unchanged if input is None or empty. Return type is str or None.
Dependencies
rehtml
Required Imports
import re
import html
Conditional/Optional Imports
These imports are only needed under specific conditions:
import re
Condition: imported inside function body, always needed when function executes
Required (conditional)import html
Condition: imported inside function body, always needed when function executes
Required (conditional)Usage Example
import re
import html
def clean_html_tags(text):
if not text:
return text
import re
import html
text = html.unescape(text)
text = re.sub(r'<[^>]+>', '', text)
text = ' '.join(text.split())
return text
# Example usage
html_text = '<p>Hello & welcome to <strong>our site</strong>!</p>'
cleaned = clean_html_tags(html_text)
print(cleaned) # Output: 'Hello & welcome to our site!'
# With HTML entities
entity_text = '<div>Price: £50</div>'
cleaned = clean_html_tags(entity_text)
print(cleaned) # Output: '<div>Price: £50</div>' then tags removed: 'Price: £50'
# With None or empty
print(clean_html_tags(None)) # Output: None
print(clean_html_tags('')) # Output: ''
# With extra whitespace
whitespace_text = '<p>Multiple spaces\n\nand newlines</p>'
cleaned = clean_html_tags(whitespace_text)
print(cleaned) # Output: 'Multiple spaces and newlines'
Best Practices
- The function uses lazy imports (re and html imported inside function body), which adds slight overhead on each call. For performance-critical applications with many calls, consider moving imports to module level.
- The regex pattern r'<[^>]+>' removes all HTML tags but does not handle malformed HTML or nested tags specially - it simply removes anything between < and >.
- HTML entities are unescaped before tag removal, so entities like < and > are converted to < and > before being processed.
- The function collapses all whitespace (spaces, tabs, newlines) into single spaces, which may not preserve intentional formatting like line breaks.
- Returns None/empty input unchanged, so always check return value if you need to ensure a string type.
- Does not validate or sanitize for security purposes - this is for display formatting only, not XSS prevention.
- For more robust HTML parsing and cleaning, consider using libraries like BeautifulSoup or bleach for production applications.
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function clean_html_tags_v1 79.9% similar
-
function clean_text 76.6% similar
-
function html_to_text 66.0% similar
-
function process_inline_markdown 64.9% similar
-
function clean_text_for_xml 64.4% similar