🔍 Code Extractor

function clean_html_tags

Maturity: 44

Removes HTML tags and entities from text strings, returning clean plain text suitable for PDF display or other formatted output.

File:
/tf/active/vicechatdev/vice_ai/complex_app.py
Lines:
1775 - 1792
Complexity:
simple

Purpose

This function sanitizes HTML-formatted text by unescaping HTML entities (like &, <, etc.) and stripping all HTML tags, then normalizing whitespace. It's designed for preparing text content for PDF generation or other contexts where HTML markup needs to be removed while preserving the actual text content.

Source Code

def clean_html_tags(text):
    """Remove HTML tags from text for clean PDF display"""
    if not text:
        return text
    
    import re
    import html
    
    # First unescape HTML entities
    text = html.unescape(text)
    
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', '', text)
    
    # Clean up extra whitespace
    text = ' '.join(text.split())
    
    return text

Parameters

Name Type Default Kind
text - - positional_or_keyword

Parameter Details

text: A string containing HTML markup to be cleaned. Can be None or empty string, which will be returned as-is. Expected to contain HTML tags (like <p>, <div>, <span>) and/or HTML entities (like &nbsp;, &quot;) that need to be removed or converted to plain text.

Return Value

Returns a string with all HTML tags removed, HTML entities unescaped to their character equivalents, and whitespace normalized (multiple spaces/newlines collapsed to single spaces). Returns the original value unchanged if input is None or empty. Return type is str or None.

Dependencies

  • re
  • html

Required Imports

import re
import html

Conditional/Optional Imports

These imports are only needed under specific conditions:

import re

Condition: imported inside function body, always needed when function executes

Required (conditional)
import html

Condition: imported inside function body, always needed when function executes

Required (conditional)

Usage Example

import re
import html

def clean_html_tags(text):
    if not text:
        return text
    import re
    import html
    text = html.unescape(text)
    text = re.sub(r'<[^>]+>', '', text)
    text = ' '.join(text.split())
    return text

# Example usage
html_text = '<p>Hello &amp; welcome to <strong>our site</strong>!</p>'
cleaned = clean_html_tags(html_text)
print(cleaned)  # Output: 'Hello & welcome to our site!'

# With HTML entities
entity_text = '&lt;div&gt;Price: &pound;50&lt;/div&gt;'
cleaned = clean_html_tags(entity_text)
print(cleaned)  # Output: '<div>Price: £50</div>' then tags removed: 'Price: £50'

# With None or empty
print(clean_html_tags(None))  # Output: None
print(clean_html_tags(''))    # Output: ''

# With extra whitespace
whitespace_text = '<p>Multiple    spaces\n\nand   newlines</p>'
cleaned = clean_html_tags(whitespace_text)
print(cleaned)  # Output: 'Multiple spaces and newlines'

Best Practices

  • The function uses lazy imports (re and html imported inside function body), which adds slight overhead on each call. For performance-critical applications with many calls, consider moving imports to module level.
  • The regex pattern r'<[^>]+>' removes all HTML tags but does not handle malformed HTML or nested tags specially - it simply removes anything between < and >.
  • HTML entities are unescaped before tag removal, so entities like &lt; and &gt; are converted to < and > before being processed.
  • The function collapses all whitespace (spaces, tabs, newlines) into single spaces, which may not preserve intentional formatting like line breaks.
  • Returns None/empty input unchanged, so always check return value if you need to ensure a string type.
  • Does not validate or sanitize for security purposes - this is for display formatting only, not XSS prevention.
  • For more robust HTML parsing and cleaning, consider using libraries like BeautifulSoup or bleach for production applications.

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function clean_html_tags_v1 79.9% similar

    Removes all HTML tags from a given text string using regular expression pattern matching, returning clean text without markup.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function clean_text 76.6% similar

    Cleans and normalizes text content by removing HTML tags, normalizing whitespace, and stripping markdown formatting elements.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
  • function html_to_text 66.0% similar

    Converts HTML content to plain text by removing HTML tags, decoding common HTML entities, and normalizing whitespace.

    From: /tf/active/vicechatdev/CDocs/utils/notifications.py
  • function process_inline_markdown 64.9% similar

    Processes inline markdown formatting by unescaping HTML entities in text. Currently performs basic cleanup while preserving markdown syntax for downstream processing.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
  • function clean_text_for_xml 64.4% similar

    Sanitizes text by removing or replacing XML-incompatible characters to ensure compatibility with Word document XML structure.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
← Back to Browse