function clean_html_tags_v1
Removes all HTML tags from a given text string using regular expression pattern matching, returning clean text without markup.
/tf/active/vicechatdev/vice_ai/new_app.py
3676 - 3680
simple
Purpose
This utility function sanitizes text by stripping out HTML tags, making it useful for cleaning user input, preparing text for display in non-HTML contexts, extracting plain text from HTML content, or sanitizing data before storage. It handles edge cases like None or empty strings by returning an empty string.
Source Code
def clean_html_tags(text):
"""Remove HTML tags from text"""
if not text:
return ""
return re.sub(r'<[^>]+>', '', text)
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
text |
- | - | positional_or_keyword |
Parameter Details
text: The input string that may contain HTML tags. Can be None, an empty string, or any string with or without HTML markup. If None or empty, the function returns an empty string without attempting regex processing.
Return Value
Returns a string with all HTML tags removed. If the input is None or empty, returns an empty string (''). The function preserves all text content between tags but removes the tags themselves (anything matching the pattern '<...>'). Return type is always str.
Dependencies
re
Required Imports
import re
Usage Example
import re
def clean_html_tags(text):
"""Remove HTML tags from text"""
if not text:
return ""
return re.sub(r'<[^>]+>', '', text)
# Example usage
html_text = "<p>Hello <strong>world</strong>!</p>"
clean_text = clean_html_tags(html_text)
print(clean_text) # Output: "Hello world!"
# Handle None input
result = clean_html_tags(None)
print(result) # Output: ""
# Handle empty string
result = clean_html_tags("")
print(result) # Output: ""
# Complex HTML
complex_html = "<div class='container'><h1>Title</h1><p>Paragraph with <a href='#'>link</a></p></div>"
clean = clean_html_tags(complex_html)
print(clean) # Output: "TitleParagraph with link"
Best Practices
- This function uses a simple regex pattern that may not handle all edge cases of malformed HTML or HTML entities
- For more robust HTML parsing and cleaning, consider using libraries like BeautifulSoup or html.parser
- The function does not decode HTML entities (e.g., '&' remains as '&'), use html.unescape() if entity decoding is needed
- The regex pattern '<[^>]+>' removes tags but does not add spaces, so adjacent tags may result in concatenated words
- Always validate and sanitize user input even after tag removal to prevent other security issues
- Consider using this in combination with other sanitization methods for comprehensive text cleaning
- The function is safe for None inputs but does not validate if the input is actually a string type
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function clean_html_tags 79.9% similar
-
function clean_text 74.7% similar
-
function clean_text_for_xml_v1 63.9% similar
-
function clean_text_for_xml 60.3% similar
-
function html_to_text 58.7% similar