🔍 Code Extractor

function clean_text

Maturity: 44

Cleans and normalizes text content by removing HTML tags, normalizing whitespace, and stripping markdown formatting elements.

File:
/tf/active/vicechatdev/improved_convert_disclosures_to_table.py
Lines:
25 - 42
Complexity:
simple

Purpose

This function is designed to sanitize and standardize text input by removing various formatting artifacts. It's particularly useful for preprocessing text data before analysis, storage, or display in plain text format. Common use cases include cleaning scraped web content, normalizing user input, preparing text for NLP tasks, or converting formatted documents to plain text.

Source Code

def clean_text(text):
    """Clean and normalize text content."""
    if not text:
        return ""
    
    # Remove HTML tags if any
    text = re.sub(r'<[^>]+>', '', text)
    
    # Normalize whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    # Remove markdown formatting
    text = re.sub(r'\*\*([^*]+)\*\*', r'\1', text)  # Bold
    text = re.sub(r'\*([^*]+)\*', r'\1', text)      # Italic
    text = re.sub(r'`([^`]+)`', r'\1', text)        # Code
    text = re.sub(r'#{1,6}\s*', '', text)           # Headers
    
    return text

Parameters

Name Type Default Kind
text - - positional_or_keyword

Parameter Details

text: The input text string to be cleaned. Can be None, empty string, or any string containing HTML tags, markdown formatting, or irregular whitespace. If None or empty, the function returns an empty string.

Return Value

Returns a cleaned string with all HTML tags removed, whitespace normalized to single spaces, markdown formatting (bold, italic, code blocks, headers) stripped, and leading/trailing whitespace removed. Returns an empty string if input is None or empty.

Dependencies

  • re

Required Imports

import re

Usage Example

import re

def clean_text(text):
    """Clean and normalize text content."""
    if not text:
        return ""
    text = re.sub(r'<[^>]+>', '', text)
    text = re.sub(r'\s+', ' ', text).strip()
    text = re.sub(r'\*\*([^*]+)\*\*', r'\1', text)
    text = re.sub(r'\*([^*]+)\*', r'\1', text)
    text = re.sub(r'`([^`]+)`', r'\1', text)
    text = re.sub(r'#{1,6}\s*', '', text)
    return text

# Example usage
raw_text = "<p>This is **bold** and *italic* text with `code` and ### headers</p>"
cleaned = clean_text(raw_text)
print(cleaned)  # Output: "This is bold and italic text with code and headers"

# Handle None or empty input
print(clean_text(None))  # Output: ""
print(clean_text(""))    # Output: ""

Best Practices

  • Always check if the input text is None or empty before processing to avoid errors
  • Be aware that this function removes ALL HTML and markdown formatting, which may not be desired if you need to preserve some structure
  • The function uses greedy regex patterns that may not handle nested or malformed markdown/HTML perfectly
  • For complex HTML parsing needs, consider using a dedicated HTML parser like BeautifulSoup instead of regex
  • The markdown removal patterns are basic and may not cover all markdown syntax variations (e.g., links, images, lists)
  • Whitespace normalization converts all consecutive whitespace (including newlines) to single spaces, which removes paragraph structure
  • Consider the order of operations: HTML is removed first, then whitespace is normalized, then markdown is stripped

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function clean_html_tags 76.6% similar

    Removes HTML tags and entities from text strings, returning clean plain text suitable for PDF display or other formatted output.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
  • function clean_html_tags_v1 74.7% similar

    Removes all HTML tags from a given text string using regular expression pattern matching, returning clean text without markup.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function clean_text_for_xml 71.0% similar

    Sanitizes text by removing or replacing XML-incompatible characters to ensure compatibility with Word document XML structure.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
  • function clean_text_for_xml_v1 70.4% similar

    Sanitizes text strings to ensure XML 1.0 compatibility by removing or replacing invalid control characters and ensuring all characters meet XML specification requirements for Word document generation.

    From: /tf/active/vicechatdev/enhanced_word_converter_fixed.py
  • function process_inline_markdown 66.8% similar

    Processes inline markdown formatting by unescaping HTML entities in text. Currently performs basic cleanup while preserving markdown syntax for downstream processing.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
← Back to Browse