🔍 Code Extractor

function clean_text_for_xml_v1

Maturity: 43

Sanitizes text strings to ensure XML 1.0 compatibility by removing or replacing invalid control characters and ensuring all characters meet XML specification requirements for Word document generation.

File:
/tf/active/vicechatdev/enhanced_word_converter_fixed.py
Lines:
22 - 49
Complexity:
simple

Purpose

This function prepares text content for safe insertion into XML-based Word documents (.docx format) by filtering out characters that would cause XML parsing errors. It removes null bytes, control characters (except tab, newline, and carriage return), and ensures all characters fall within valid XML 1.0 character ranges. This is essential when processing user-generated content or data from external sources that may contain problematic characters before inserting into Word documents using the python-docx library.

Source Code

def clean_text_for_xml(text):
    """Clean text to be XML compatible for Word documents."""
    if not text:
        return ""
    
    # Remove or replace XML-incompatible characters
    # Remove null bytes and control characters except tab, newline, carriage return
    text = ''.join(char for char in text if ord(char) >= 32 or char in '\t\n\r')
    
    # Replace any remaining problematic characters
    text = text.replace('\x00', '')  # Remove null bytes
    text = text.replace('\x0b', ' ')  # Replace vertical tab with space
    text = text.replace('\x0c', ' ')  # Replace form feed with space
    
    # Ensure only valid XML characters (XML 1.0 specification)
    # Valid characters: #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
    cleaned = ''
    for char in text:
        code = ord(char)
        if (code == 0x09 or code == 0x0A or code == 0x0D or 
            (0x20 <= code <= 0xD7FF) or 
            (0xE000 <= code <= 0xFFFD) or 
            (0x10000 <= code <= 0x10FFFF)):
            cleaned += char
        else:
            cleaned += ' '  # Replace invalid characters with space
    
    return cleaned

Parameters

Name Type Default Kind
text - - positional_or_keyword

Parameter Details

text: Input string to be cleaned. Can be any text content including user input, file contents, or data from external sources. Accepts None or empty strings, which will return an empty string. May contain control characters, null bytes, or other XML-incompatible characters that need sanitization.

Return Value

Returns a cleaned string containing only XML 1.0 compatible characters. Invalid characters are replaced with spaces. Returns an empty string if input is None, empty, or falsy. The returned string is safe to insert into XML structures used by Word documents without causing parsing errors.

Usage Example

# Basic usage
text_with_control_chars = "Hello\x00World\x0b\x0cTest\t\nValid"
cleaned = clean_text_for_xml(text_with_control_chars)
print(cleaned)  # Output: "HelloWorld  Test\t\nValid"

# Handle None or empty input
result = clean_text_for_xml(None)
print(result)  # Output: ""

# Use with python-docx for Word document generation
from docx import Document

user_input = "User data with \x00 null bytes and \x0b control chars"
safe_text = clean_text_for_xml(user_input)

doc = Document()
doc.add_paragraph(safe_text)
doc.save('output.docx')

Best Practices

  • Always use this function when inserting user-generated content or external data into Word documents to prevent XML parsing errors
  • Call this function before adding text to python-docx Document objects to ensure document integrity
  • The function replaces invalid characters with spaces rather than removing them to maintain text length and readability
  • Valid XML 1.0 characters include: tab (0x09), newline (0x0A), carriage return (0x0D), and characters in ranges 0x20-0xD7FF, 0xE000-0xFFFD, and 0x10000-0x10FFFF
  • This function is safe to call multiple times on the same text (idempotent operation)
  • Consider using this function as part of a data validation pipeline when processing text from databases, APIs, or file uploads

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function clean_text_for_xml 93.8% similar

    Sanitizes text by removing or replacing XML-incompatible characters to ensure compatibility with Word document XML structure.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
  • function clean_text 70.4% similar

    Cleans and normalizes text content by removing HTML tags, normalizing whitespace, and stripping markdown formatting elements.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
  • function clean_html_tags_v1 63.9% similar

    Removes all HTML tags from a given text string using regular expression pattern matching, returning clean text without markup.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function clean_html_tags 62.0% similar

    Removes HTML tags and entities from text strings, returning clean plain text suitable for PDF display or other formatted output.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
  • function test_docx_file 54.5% similar

    Tests the ability to open and read a Microsoft Word (.docx) document file, validating file existence, size, and content extraction capabilities.

    From: /tf/active/vicechatdev/docchat/test_problematic_files.py
← Back to Browse