clean_text - Code Extractor

function clean_text

Maturity: 44

Cleans and normalizes text content by removing HTML tags, normalizing whitespace, and stripping markdown formatting elements.

File:
/tf/active/vicechatdev/improved_convert_disclosures_to_table.py

Lines:
25 - 42

Complexity:
simple

Purpose

This function is designed to sanitize and standardize text input by removing various formatting artifacts. It's particularly useful for preprocessing text data before analysis, storage, or display in plain text format. Common use cases include cleaning scraped web content, normalizing user input, preparing text for NLP tasks, or converting formatted documents to plain text.

Source Code

def clean_text(text):
    """Clean and normalize text content."""
    if not text:
        return ""
    
    # Remove HTML tags if any
    text = re.sub(r'<[^>]+>', '', text)
    
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Remove markdown formatting
    text = re.sub(r'\*\*([^*]+)\*\*', r'\1', text)  # Bold
    text = re.sub(r'\*([^*]+)\*', r'\1', text)      # Italic
    text = re.sub(r'`([^`]+)`', r'\1', text)        # Code
    text = re.sub(r'#{1,6}\s*', '', text)           # Headers
    
    return text

Parameters

Name	Type	Default	Kind
`text`	-	-	positional_or_keyword

Parameter Details

text: The input text string to be cleaned. Can be None, empty string, or any string containing HTML tags, markdown formatting, or irregular whitespace. If None or empty, the function returns an empty string.

Return Value

Returns a cleaned string with all HTML tags removed, whitespace normalized to single spaces, markdown formatting (bold, italic, code blocks, headers) stripped, and leading/trailing whitespace removed. Returns an empty string if input is None or empty.

Dependencies

re

Required Imports

import re

Usage Example

import re

def clean_text(text):
    """Clean and normalize text content."""
    if not text:
        return ""
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    text = re.sub(r'\*\*([^*]+)\*\*', r'\1', text)
    text = re.sub(r'\*([^*]+)\*', r'\1', text)
    text = re.sub(r'`([^`]+)`', r'\1', text)
    text = re.sub(r'#{1,6}\s*', '', text)
    return text

# Example usage
raw_text = "<p>This is **bold** and *italic* text with `code` and ### headers</p>"
cleaned = clean_text(raw_text)
print(cleaned)  # Output: "This is bold and italic text with code and headers"

# Handle None or empty input
print(clean_text(None))  # Output: ""
print(clean_text(""))    # Output: ""

Best Practices

Always check if the input text is None or empty before processing to avoid errors
Be aware that this function removes ALL HTML and markdown formatting, which may not be desired if you need to preserve some structure
The function uses greedy regex patterns that may not handle nested or malformed markdown/HTML perfectly
For complex HTML parsing needs, consider using a dedicated HTML parser like BeautifulSoup instead of regex
The markdown removal patterns are basic and may not cover all markdown syntax variations (e.g., links, images, lists)
Whitespace normalization converts all consecutive whitespace (including newlines) to single spaces, which removes paragraph structure
Consider the order of operations: HTML is removed first, then whitespace is normalized, then markdown is stripped

Similar Components

AI-powered semantic similarity - components with related functionality:

function clean_html_tags 76.6% similar

Removes HTML tags and entities from text strings, returning clean plain text suitable for PDF display or other formatted output.
From: /tf/active/vicechatdev/vice_ai/complex_app.py
function clean_html_tags_v1 74.7% similar

Removes all HTML tags from a given text string using regular expression pattern matching, returning clean text without markup.
From: /tf/active/vicechatdev/vice_ai/new_app.py
function clean_text_for_xml 71.0% similar

Sanitizes text by removing or replacing XML-incompatible characters to ensure compatibility with Word document XML structure.
From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
function clean_text_for_xml_v1 70.4% similar

Sanitizes text strings to ensure XML 1.0 compatibility by removing or replacing invalid control characters and ensuring all characters meet XML specification requirements for Word document generation.
From: /tf/active/vicechatdev/enhanced_word_converter_fixed.py
function process_inline_markdown 66.8% similar

Processes inline markdown formatting by unescaping HTML entities in text. Currently performs basic cleanup while preserving markdown syntax for downstream processing.
From: /tf/active/vicechatdev/vice_ai/complex_app.py

🔍 Code Extractor

function clean_text

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function clean_html_tags 76.6% similar

function clean_html_tags_v1 74.7% similar

function clean_text_for_xml 71.0% similar

function clean_text_for_xml_v1 70.4% similar

function process_inline_markdown 66.8% similar

function clean_text

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function clean_html_tags 76.6% similar

function clean_html_tags_v1 74.7% similar

function clean_text_for_xml 71.0% similar

function clean_text_for_xml_v1 70.4% similar

function process_inline_markdown 66.8% similar

✨ Improve Code: clean_text

Code Comparison