function clean_text
Cleans and normalizes text content by removing HTML tags, normalizing whitespace, and stripping markdown formatting elements.
/tf/active/vicechatdev/improved_convert_disclosures_to_table.py
25 - 42
simple
Purpose
This function is designed to sanitize and standardize text input by removing various formatting artifacts. It's particularly useful for preprocessing text data before analysis, storage, or display in plain text format. Common use cases include cleaning scraped web content, normalizing user input, preparing text for NLP tasks, or converting formatted documents to plain text.
Source Code
def clean_text(text):
"""Clean and normalize text content."""
if not text:
return ""
# Remove HTML tags if any
text = re.sub(r'<[^>]+>', '', text)
# Normalize whitespace
text = re.sub(r'\s+', ' ', text).strip()
# Remove markdown formatting
text = re.sub(r'\*\*([^*]+)\*\*', r'\1', text) # Bold
text = re.sub(r'\*([^*]+)\*', r'\1', text) # Italic
text = re.sub(r'`([^`]+)`', r'\1', text) # Code
text = re.sub(r'#{1,6}\s*', '', text) # Headers
return text
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
text |
- | - | positional_or_keyword |
Parameter Details
text: The input text string to be cleaned. Can be None, empty string, or any string containing HTML tags, markdown formatting, or irregular whitespace. If None or empty, the function returns an empty string.
Return Value
Returns a cleaned string with all HTML tags removed, whitespace normalized to single spaces, markdown formatting (bold, italic, code blocks, headers) stripped, and leading/trailing whitespace removed. Returns an empty string if input is None or empty.
Dependencies
re
Required Imports
import re
Usage Example
import re
def clean_text(text):
"""Clean and normalize text content."""
if not text:
return ""
text = re.sub(r'<[^>]+>', '', text)
text = re.sub(r'\s+', ' ', text).strip()
text = re.sub(r'\*\*([^*]+)\*\*', r'\1', text)
text = re.sub(r'\*([^*]+)\*', r'\1', text)
text = re.sub(r'`([^`]+)`', r'\1', text)
text = re.sub(r'#{1,6}\s*', '', text)
return text
# Example usage
raw_text = "<p>This is **bold** and *italic* text with `code` and ### headers</p>"
cleaned = clean_text(raw_text)
print(cleaned) # Output: "This is bold and italic text with code and headers"
# Handle None or empty input
print(clean_text(None)) # Output: ""
print(clean_text("")) # Output: ""
Best Practices
- Always check if the input text is None or empty before processing to avoid errors
- Be aware that this function removes ALL HTML and markdown formatting, which may not be desired if you need to preserve some structure
- The function uses greedy regex patterns that may not handle nested or malformed markdown/HTML perfectly
- For complex HTML parsing needs, consider using a dedicated HTML parser like BeautifulSoup instead of regex
- The markdown removal patterns are basic and may not cover all markdown syntax variations (e.g., links, images, lists)
- Whitespace normalization converts all consecutive whitespace (including newlines) to single spaces, which removes paragraph structure
- Consider the order of operations: HTML is removed first, then whitespace is normalized, then markdown is stripped
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function clean_html_tags 76.6% similar
-
function clean_html_tags_v1 74.7% similar
-
function clean_text_for_xml 71.0% similar
-
function clean_text_for_xml_v1 70.4% similar
-
function process_inline_markdown 66.8% similar