🔍 Code Extractor

function clean_html_tags_v1

Maturity: 34

Removes all HTML tags from a given text string using regular expression pattern matching, returning clean text without markup.

File:
/tf/active/vicechatdev/vice_ai/new_app.py
Lines:
3676 - 3680
Complexity:
simple

Purpose

This utility function sanitizes text by stripping out HTML tags, making it useful for cleaning user input, preparing text for display in non-HTML contexts, extracting plain text from HTML content, or sanitizing data before storage. It handles edge cases like None or empty strings by returning an empty string.

Source Code

def clean_html_tags(text):
    """Remove HTML tags from text"""
    if not text:
        return ""
    return re.sub(r'<[^>]+>', '', text)

Parameters

Name Type Default Kind
text - - positional_or_keyword

Parameter Details

text: The input string that may contain HTML tags. Can be None, an empty string, or any string with or without HTML markup. If None or empty, the function returns an empty string without attempting regex processing.

Return Value

Returns a string with all HTML tags removed. If the input is None or empty, returns an empty string (''). The function preserves all text content between tags but removes the tags themselves (anything matching the pattern '<...>'). Return type is always str.

Dependencies

  • re

Required Imports

import re

Usage Example

import re

def clean_html_tags(text):
    """Remove HTML tags from text"""
    if not text:
        return ""
    return re.sub(r'<[^>]+>', '', text)

# Example usage
html_text = "<p>Hello <strong>world</strong>!</p>"
clean_text = clean_html_tags(html_text)
print(clean_text)  # Output: "Hello world!"

# Handle None input
result = clean_html_tags(None)
print(result)  # Output: ""

# Handle empty string
result = clean_html_tags("")
print(result)  # Output: ""

# Complex HTML
complex_html = "<div class='container'><h1>Title</h1><p>Paragraph with <a href='#'>link</a></p></div>"
clean = clean_html_tags(complex_html)
print(clean)  # Output: "TitleParagraph with link"

Best Practices

  • This function uses a simple regex pattern that may not handle all edge cases of malformed HTML or HTML entities
  • For more robust HTML parsing and cleaning, consider using libraries like BeautifulSoup or html.parser
  • The function does not decode HTML entities (e.g., '&amp;' remains as '&amp;'), use html.unescape() if entity decoding is needed
  • The regex pattern '<[^>]+>' removes tags but does not add spaces, so adjacent tags may result in concatenated words
  • Always validate and sanitize user input even after tag removal to prevent other security issues
  • Consider using this in combination with other sanitization methods for comprehensive text cleaning
  • The function is safe for None inputs but does not validate if the input is actually a string type

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function clean_html_tags 79.9% similar

    Removes HTML tags and entities from text strings, returning clean plain text suitable for PDF display or other formatted output.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
  • function clean_text 74.7% similar

    Cleans and normalizes text content by removing HTML tags, normalizing whitespace, and stripping markdown formatting elements.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
  • function clean_text_for_xml_v1 63.9% similar

    Sanitizes text strings to ensure XML 1.0 compatibility by removing or replacing invalid control characters and ensuring all characters meet XML specification requirements for Word document generation.

    From: /tf/active/vicechatdev/enhanced_word_converter_fixed.py
  • function clean_text_for_xml 60.3% similar

    Sanitizes text by removing or replacing XML-incompatible characters to ensure compatibility with Word document XML structure.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
  • function html_to_text 58.7% similar

    Converts HTML content to plain text by removing HTML tags, decoding common HTML entities, and normalizing whitespace.

    From: /tf/active/vicechatdev/CDocs/utils/notifications.py
← Back to Browse