🔍 Code Extractor

function export_to_docx_v1

Maturity: 48

Exports a document object to Microsoft Word DOCX format, converting sections, content, and references into a formatted Word document with proper styling and structure.

File:
/tf/active/vicechatdev/vice_ai/complex_app.py
Lines:
2017 - 2110
Complexity:
complex

Purpose

This function takes a document object containing title, author, sections, and references, and generates a properly formatted DOCX file. It handles document metadata, hierarchical sections with headers, text content (including HTML and Markdown conversion), and reference management. The function supports content formatting, heading levels, and creates a comprehensive bibliography section. It's designed for document export functionality in web applications or content management systems.

Source Code

def export_to_docx(document):
    """Export document to DOCX format"""
    if not DOCX_AVAILABLE:
        raise ImportError("python-docx not available")
    
    doc = Document()
    
    # Set document title
    title = doc.add_heading(document.title, 0)
    title.alignment = WD_ALIGN_PARAGRAPH.CENTER
    
    # Add author and metadata
    if document.author:
        author_para = doc.add_paragraph()
        author_para.add_run(f"Author: {document.author}").bold = True
        author_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
    
    date_para = doc.add_paragraph()
    date_para.add_run(f"Created: {document.created_at.strftime('%Y-%m-%d %H:%M')}").italic = True
    date_para.alignment = WD_ALIGN_PARAGRAPH.CENTER
    
    doc.add_paragraph()  # Empty line
    
    # Add sections
    for section in document.sections:
        if section.type == 'header':
            # Add header
            level = min(section.level, 9)  # Word supports up to 9 heading levels
            heading = doc.add_heading(section.title, level)
            
        elif section.type == 'text':
            # Add text content
            if section.title:
                doc.add_heading(section.title, 3)
            
            if section.content:
                # Check if content is HTML or Markdown and process accordingly
                content_to_process = section.content
                
                # If content looks like HTML, convert to Markdown first
                if '<' in content_to_process and '>' in content_to_process:
                    # Content appears to be HTML, convert to Markdown
                    content_to_process = html_to_markdown(content_to_process)
                
                # Process markdown content for proper formatting
                try:
                    elements = process_markdown_content(content_to_process)
                    add_formatted_content_to_word(doc, elements)
                except Exception as e:
                    logger.warning(f"Error processing content for section {section.id}: {e}")
                    # Fallback to simple paragraph splitting
                    # Clean HTML tags if present for fallback
                    clean_content = clean_html_tags(content_to_process)
                    paragraphs = clean_content.split('\n\n')
                    for para_text in paragraphs:
                        if para_text.strip():
                            para = doc.add_paragraph()
                            para.add_run(para_text.strip())
            
            # Add section references if any
            if section.references:
                ref_heading = doc.add_heading("References for this section:", 4)
                for i, ref in enumerate(section.references, 1):
                    ref_para = doc.add_paragraph()
                    ref_para.add_run(f"[{i}] ").bold = True
                    ref_para.add_run(ref.get('title', 'Untitled Reference'))
                
                doc.add_paragraph()  # Empty line after references
    
    # Add global references
    all_references = document.get_all_references()
    if all_references:
        doc.add_page_break()
        doc.add_heading("References", 1)
        
        unique_refs = {}
        ref_counter = 1
        
        for ref in all_references:
            ref_key = ref.get('title', 'Untitled') + ref.get('source', '')
            if ref_key not in unique_refs:
                unique_refs[ref_key] = ref_counter
                ref_para = doc.add_paragraph()
                ref_para.add_run(f"[{ref_counter}] ").bold = True
                ref_para.add_run(ref.get('title', 'Untitled Reference'))
                if ref.get('source'):
                    ref_para.add_run(f" ({ref['source']})")
                ref_counter += 1
    
    # Save to BytesIO
    buffer = BytesIO()
    doc.save(buffer)
    buffer.seek(0)
    return buffer.getvalue()

Parameters

Name Type Default Kind
document - - positional_or_keyword

Parameter Details

document: A document object that must have the following attributes: 'title' (string), 'author' (string, optional), 'created_at' (datetime object), 'sections' (list of section objects), and a 'get_all_references()' method. Each section object should have 'type' (string: 'header' or 'text'), 'title' (string), 'content' (string, may contain HTML or Markdown), 'level' (integer for headers), 'id' (identifier), and 'references' (list of reference dictionaries with 'title' and 'source' keys).

Return Value

Returns bytes representing the complete DOCX file content. The bytes can be written to a file, sent as an HTTP response, or stored in memory. Returns None implicitly if an ImportError is raised when python-docx is not available.

Dependencies

  • python-docx
  • io

Required Imports

from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH
from io import BytesIO

Conditional/Optional Imports

These imports are only needed under specific conditions:

from helper functions: html_to_markdown, process_markdown_content, add_formatted_content_to_word, clean_html_tags

Condition: These functions must be defined in the same module or imported separately. They handle content conversion from HTML/Markdown to Word format.

Required (conditional)

Usage Example

from docx import Document
from docx.enum.text import WD_ALIGN_PARAGRAPH
from io import BytesIO
from datetime import datetime

# Assuming DOCX_AVAILABLE is True and helper functions are defined
DOCX_AVAILABLE = True

class MockSection:
    def __init__(self, type, title, content='', level=1, references=None):
        self.type = type
        self.title = title
        self.content = content
        self.level = level
        self.id = 'section-1'
        self.references = references or []

class MockDocument:
    def __init__(self):
        self.title = 'Sample Report'
        self.author = 'John Doe'
        self.created_at = datetime.now()
        self.sections = [
            MockSection('header', 'Introduction', level=1),
            MockSection('text', 'Overview', 'This is the overview content.', references=[{'title': 'Reference 1', 'source': 'Source A'}])
        ]
    
    def get_all_references(self):
        return [{'title': 'Reference 1', 'source': 'Source A'}]

doc = MockDocument()
docx_bytes = export_to_docx(doc)

# Save to file
with open('output.docx', 'wb') as f:
    f.write(docx_bytes)

# Or send as HTTP response in Flask
# return send_file(BytesIO(docx_bytes), mimetype='application/vnd.openxmlformats-officedocument.wordprocessingml.document', as_attachment=True, download_name='document.docx')

Best Practices

  • Ensure the DOCX_AVAILABLE flag is properly set before calling this function to avoid ImportError
  • The document object must implement all required attributes and methods (title, author, created_at, sections, get_all_references())
  • Helper functions (html_to_markdown, process_markdown_content, add_formatted_content_to_word, clean_html_tags) must be defined or imported
  • Handle the returned bytes appropriately - either write to file or stream to HTTP response
  • Word supports up to 9 heading levels; the function automatically caps section levels at 9
  • The function includes error handling for content processing but falls back to simple text if formatting fails
  • Consider memory usage when processing large documents as the entire DOCX is built in memory
  • Ensure datetime objects in the document have proper timezone information if needed
  • The function deduplicates references in the bibliography section based on title and source combination

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function export_to_docx 90.0% similar

    Exports a document with text and data sections to Microsoft Word DOCX format, preserving formatting, structure, and metadata.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function export_to_pdf_v1 70.8% similar

    Converts a document object with sections and references into a formatted PDF file using ReportLab, supporting multiple heading levels, text content with markdown/HTML processing, and reference management.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
  • function add_formatted_content_to_word 67.2% similar

    Converts processed markdown elements into formatted content within a Word document, handling headers, paragraphs, lists, tables, and code blocks with appropriate styling.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function create_enhanced_word_document_v1 67.1% similar

    Converts markdown content into a formatted Microsoft Word document with proper styling, table of contents, warranty sections, and reference handling for Project Victoria warranty disclosures.

    From: /tf/active/vicechatdev/enhanced_word_converter_fixed.py
  • function add_formatted_content_to_word_v1 66.9% similar

    Converts processed markdown elements into formatted content within a Microsoft Word document, handling headers, paragraphs, lists, tables, and code blocks with appropriate styling.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
← Back to Browse