function clean_text_for_xml_v1
Sanitizes text strings to ensure XML 1.0 compatibility by removing or replacing invalid control characters and ensuring all characters meet XML specification requirements for Word document generation.
/tf/active/vicechatdev/enhanced_word_converter_fixed.py
22 - 49
simple
Purpose
This function prepares text content for safe insertion into XML-based Word documents (.docx format) by filtering out characters that would cause XML parsing errors. It removes null bytes, control characters (except tab, newline, and carriage return), and ensures all characters fall within valid XML 1.0 character ranges. This is essential when processing user-generated content or data from external sources that may contain problematic characters before inserting into Word documents using the python-docx library.
Source Code
def clean_text_for_xml(text):
"""Clean text to be XML compatible for Word documents."""
if not text:
return ""
# Remove or replace XML-incompatible characters
# Remove null bytes and control characters except tab, newline, carriage return
text = ''.join(char for char in text if ord(char) >= 32 or char in '\t\n\r')
# Replace any remaining problematic characters
text = text.replace('\x00', '') # Remove null bytes
text = text.replace('\x0b', ' ') # Replace vertical tab with space
text = text.replace('\x0c', ' ') # Replace form feed with space
# Ensure only valid XML characters (XML 1.0 specification)
# Valid characters: #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD] | [#x10000-#x10FFFF]
cleaned = ''
for char in text:
code = ord(char)
if (code == 0x09 or code == 0x0A or code == 0x0D or
(0x20 <= code <= 0xD7FF) or
(0xE000 <= code <= 0xFFFD) or
(0x10000 <= code <= 0x10FFFF)):
cleaned += char
else:
cleaned += ' ' # Replace invalid characters with space
return cleaned
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
text |
- | - | positional_or_keyword |
Parameter Details
text: Input string to be cleaned. Can be any text content including user input, file contents, or data from external sources. Accepts None or empty strings, which will return an empty string. May contain control characters, null bytes, or other XML-incompatible characters that need sanitization.
Return Value
Returns a cleaned string containing only XML 1.0 compatible characters. Invalid characters are replaced with spaces. Returns an empty string if input is None, empty, or falsy. The returned string is safe to insert into XML structures used by Word documents without causing parsing errors.
Usage Example
# Basic usage
text_with_control_chars = "Hello\x00World\x0b\x0cTest\t\nValid"
cleaned = clean_text_for_xml(text_with_control_chars)
print(cleaned) # Output: "HelloWorld Test\t\nValid"
# Handle None or empty input
result = clean_text_for_xml(None)
print(result) # Output: ""
# Use with python-docx for Word document generation
from docx import Document
user_input = "User data with \x00 null bytes and \x0b control chars"
safe_text = clean_text_for_xml(user_input)
doc = Document()
doc.add_paragraph(safe_text)
doc.save('output.docx')
Best Practices
- Always use this function when inserting user-generated content or external data into Word documents to prevent XML parsing errors
- Call this function before adding text to python-docx Document objects to ensure document integrity
- The function replaces invalid characters with spaces rather than removing them to maintain text length and readability
- Valid XML 1.0 characters include: tab (0x09), newline (0x0A), carriage return (0x0D), and characters in ranges 0x20-0xD7FF, 0xE000-0xFFFD, and 0x10000-0x10FFFF
- This function is safe to call multiple times on the same text (idempotent operation)
- Consider using this function as part of a data validation pipeline when processing text from databases, APIs, or file uploads
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function clean_text_for_xml 93.8% similar
-
function clean_text 70.4% similar
-
function clean_html_tags_v1 63.9% similar
-
function clean_html_tags 62.0% similar
-
function test_docx_file 54.5% similar