function export_to_docx_v1
Exports a document object to Microsoft Word DOCX format, converting sections, content, and references into a formatted Word document with proper styling and structure.
/tf/active/vicechatdev/vice_ai/complex_app.py
2017 - 2110
complex
Purpose
This function takes a document object containing title, author, sections, and references, and generates a properly formatted DOCX file. It handles document metadata, hierarchical sections with headers, text content (including HTML and Markdown conversion), and reference management. The function supports content formatting, heading levels, and creates a comprehensive bibliography section. It's designed for document export functionality in web applications or content management systems.
Source Code
def export_to_docx(document):
"""Export document to DOCX format"""
if not DOCX_AVAILABLE:
raise ImportError("python-docx not available")
doc = Document()
# Set document title
title = doc.add_heading(document.title, 0)
title.alignment = WD_ALIGN_PARAGRAPH.CENTER
# Add author and metadata
if document.author:
author_para = doc.add_paragraph()
author_para.add_run(f"Author: {document.author}").bold = True
author_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
date_para = doc.add_paragraph()
date_para.add_run(f"Created: {document.created_at.strftime('%Y-%m-%d %H:%M')}").italic = True
date_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
doc.add_paragraph() # Empty line
# Add sections
for section in document.sections:
if section.type == 'header':
# Add header
level = min(section.level, 9) # Word supports up to 9 heading levels
heading = doc.add_heading(section.title, level)
elif section.type == 'text':
# Add text content
if section.title:
doc.add_heading(section.title, 3)
if section.content:
# Check if content is HTML or Markdown and process accordingly
content_to_process = section.content
# If content looks like HTML, convert to Markdown first
if '<' in content_to_process and '>' in content_to_process:
# Content appears to be HTML, convert to Markdown
content_to_process = html_to_markdown(content_to_process)
# Process markdown content for proper formatting
try:
elements = process_markdown_content(content_to_process)
add_formatted_content_to_word(doc, elements)
except Exception as e:
logger.warning(f"Error processing content for section {section.id}: {e}")
# Fallback to simple paragraph splitting
# Clean HTML tags if present for fallback
clean_content = clean_html_tags(content_to_process)
paragraphs = clean_content.split('\n\n')
for para_text in paragraphs:
if para_text.strip():
para = doc.add_paragraph()
para.add_run(para_text.strip())
# Add section references if any
if section.references:
ref_heading = doc.add_heading("References for this section:", 4)
for i, ref in enumerate(section.references, 1):
ref_para = doc.add_paragraph()
ref_para.add_run(f"[{i}] ").bold = True
ref_para.add_run(ref.get('title', 'Untitled Reference'))
doc.add_paragraph() # Empty line after references
# Add global references
all_references = document.get_all_references()
if all_references:
doc.add_page_break()
doc.add_heading("References", 1)
unique_refs = {}
ref_counter = 1
for ref in all_references:
ref_key = ref.get('title', 'Untitled') + ref.get('source', '')
if ref_key not in unique_refs:
unique_refs[ref_key] = ref_counter
ref_para = doc.add_paragraph()
ref_para.add_run(f"[{ref_counter}] ").bold = True
ref_para.add_run(ref.get('title', 'Untitled Reference'))
if ref.get('source'):
ref_para.add_run(f" ({ref['source']})")
ref_counter += 1
# Save to BytesIO
buffer = BytesIO()
doc.save(buffer)
buffer.seek(0)
return buffer.getvalue()
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
document |
- | - | positional_or_keyword |
Parameter Details
document: A document object that must have the following attributes: 'title' (string), 'author' (string, optional), 'created_at' (datetime object), 'sections' (list of section objects), and a 'get_all_references()' method. Each section object should have 'type' (string: 'header' or 'text'), 'title' (string), 'content' (string, may contain HTML or Markdown), 'level' (integer for headers), 'id' (identifier), and 'references' (list of reference dictionaries with 'title' and 'source' keys).
Return Value
Returns bytes representing the complete DOCX file content. The bytes can be written to a file, sent as an HTTP response, or stored in memory. Returns None implicitly if an ImportError is raised when python-docx is not available.
Dependencies
python-docxio
Required Imports
from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH
from io import BytesIO
Conditional/Optional Imports
These imports are only needed under specific conditions:
from helper functions: html_to_markdown, process_markdown_content, add_formatted_content_to_word, clean_html_tags
Condition: These functions must be defined in the same module or imported separately. They handle content conversion from HTML/Markdown to Word format.
Required (conditional)Usage Example
from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH
from io import BytesIO
from datetime import datetime
# Assuming DOCX_AVAILABLE is True and helper functions are defined
DOCX_AVAILABLE = True
class MockSection:
def __init__(self, type, title, content='', level=1, references=None):
self.type = type
self.title = title
self.content = content
self.level = level
self.id = 'section-1'
self.references = references or []
class MockDocument:
def __init__(self):
self.title = 'Sample Report'
self.author = 'John Doe'
self.created_at = datetime.now()
self.sections = [
MockSection('header', 'Introduction', level=1),
MockSection('text', 'Overview', 'This is the overview content.', references=[{'title': 'Reference 1', 'source': 'Source A'}])
]
def get_all_references(self):
return [{'title': 'Reference 1', 'source': 'Source A'}]
doc = MockDocument()
docx_bytes = export_to_docx(doc)
# Save to file
with open('output.docx', 'wb') as f:
f.write(docx_bytes)
# Or send as HTTP response in Flask
# return send_file(BytesIO(docx_bytes), mimetype='application/vnd.openxmlformats-officedocument.wordprocessingml.document', as_attachment=True, download_name='document.docx')
Best Practices
- Ensure the DOCX_AVAILABLE flag is properly set before calling this function to avoid ImportError
- The document object must implement all required attributes and methods (title, author, created_at, sections, get_all_references())
- Helper functions (html_to_markdown, process_markdown_content, add_formatted_content_to_word, clean_html_tags) must be defined or imported
- Handle the returned bytes appropriately - either write to file or stream to HTTP response
- Word supports up to 9 heading levels; the function automatically caps section levels at 9
- The function includes error handling for content processing but falls back to simple text if formatting fails
- Consider memory usage when processing large documents as the entire DOCX is built in memory
- Ensure datetime objects in the document have proper timezone information if needed
- The function deduplicates references in the bibliography section based on title and source combination
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function export_to_docx 90.0% similar
-
function export_to_pdf_v1 70.8% similar
-
function add_formatted_content_to_word 67.2% similar
-
function create_enhanced_word_document_v1 67.1% similar
-
function add_formatted_content_to_word_v1 66.9% similar