🔍 Code Extractor

function export_to_pdf_v1

Maturity: 48

Converts a document object with sections and references into a formatted PDF file using ReportLab, supporting multiple heading levels, text content with markdown/HTML processing, and reference management.

File:
/tf/active/vicechatdev/vice_ai/complex_app.py
Lines:
2112 - 2243
Complexity:
complex

Purpose

This function generates a professionally formatted PDF document from a structured document object. It handles document metadata (title, author, creation date), processes multiple section types (headers and text), converts HTML/Markdown content to PDF-compatible format, manages hierarchical heading styles, and compiles references. The function is designed for document export functionality in web applications or content management systems.

Source Code

def export_to_pdf(document):
    """Export document to PDF format"""
    if not PDF_AVAILABLE:
        raise ImportError("reportlab not available")
    
    buffer = BytesIO()
    doc = SimpleDocTemplate(buffer, pagesize=A4)
    styles = getSampleStyleSheet()
    story = []
    
    # Custom styles
    title_style = ParagraphStyle(
        'CustomTitle',
        parent=styles['Title'],
        fontSize=24,
        spaceAfter=30,
        alignment=1  # Center
    )
    
    heading1_style = ParagraphStyle(
        'CustomHeading1',
        parent=styles['Heading1'],
        fontSize=18,
        spaceAfter=12,
        spaceBefore=20
    )
    
    heading2_style = ParagraphStyle(
        'CustomHeading2',
        parent=styles['Heading2'],
        fontSize=16,
        spaceAfter=10,
        spaceBefore=15
    )
    
    heading3_style = ParagraphStyle(
        'CustomHeading3',
        parent=styles['Heading3'],
        fontSize=14,
        spaceAfter=8,
        spaceBefore=12
    )
    
    # Add custom styles to styles dictionary for use in add_formatted_content_to_pdf
    styles.add(heading1_style)
    styles.add(heading2_style) 
    styles.add(heading3_style)
    
    # Document title
    story.append(Paragraph(document.title, title_style))
    
    # Author and metadata
    if document.author:
        story.append(Paragraph(f"<b>Author:</b> {document.author}", styles['Normal']))
    
    story.append(Paragraph(f"<i>Created: {document.created_at.strftime('%Y-%m-%d %H:%M')}</i>", styles['Normal']))
    story.append(Spacer(1, 20))
    
    # Add sections
    for section in document.sections:
        if section.type == 'header':
            # Choose heading style based on level
            if section.level == 1:
                style = heading1_style
            elif section.level == 2:
                style = heading2_style
            else:
                style = heading3_style
            
            story.append(Paragraph(section.title, style))
            
        elif section.type == 'text':
            # Add section title if present
            if section.title:
                story.append(Paragraph(section.title, heading3_style))
            
            # Add content
            if section.content:
                # Check if content is HTML or Markdown and process accordingly
                content_to_process = section.content
                
                # If content looks like HTML, convert to Markdown first
                if '<' in content_to_process and '>' in content_to_process:
                    # Content appears to be HTML, convert to Markdown
                    content_to_process = html_to_markdown(content_to_process)
                
                try:
                    # Process markdown content for proper formatting
                    elements = process_markdown_content(content_to_process)
                    add_formatted_content_to_pdf(story, elements, styles)
                except Exception as e:
                    logger.warning(f"Error processing content for section {section.id}: {e}")
                    # Fallback to simple paragraph splitting with basic formatting
                    # Clean HTML tags if present for fallback
                    clean_content = clean_html_tags(content_to_process)
                    paragraphs = clean_content.split('\n\n')
                    for para_text in paragraphs:
                        if para_text.strip():
                            story.append(Paragraph(para_text.strip(), styles['Normal']))
                            story.append(Spacer(1, 6))
            
            # Add section references
            if section.references:
                story.append(Paragraph("<b>References for this section:</b>", styles['Normal']))
                for i, ref in enumerate(section.references, 1):
                    ref_text = f"[{i}] {ref.get('title', 'Untitled Reference')}"
                    story.append(Paragraph(ref_text, styles['Normal']))
                story.append(Spacer(1, 12))
    
    # Add global references
    all_references = document.get_all_references()
    if all_references:
        story.append(PageBreak())
        story.append(Paragraph("References", heading1_style))
        
        unique_refs = {}
        ref_counter = 1
        
        for ref in all_references:
            ref_key = ref.get('title', 'Untitled') + ref.get('source', '')
            if ref_key not in unique_refs:
                unique_refs[ref_key] = ref_counter
                ref_text = f"[{ref_counter}] {ref.get('title', 'Untitled Reference')}"
                if ref.get('source'):
                    ref_text += f" ({ref['source']})"
                story.append(Paragraph(ref_text, styles['Normal']))
                ref_counter += 1
    
    # Build PDF
    doc.build(story)
    buffer.seek(0)
    return buffer.getvalue()

Parameters

Name Type Default Kind
document - - positional_or_keyword

Parameter Details

document: A document object that must have the following attributes: 'title' (string), 'author' (string or None), 'created_at' (datetime object), 'sections' (list of section objects with 'type', 'level', 'title', 'content', and 'references' attributes), and a 'get_all_references()' method that returns a list of reference dictionaries with 'title' and 'source' keys. Each section object should have 'type' ('header' or 'text'), 'level' (1-3 for headers), 'title' (string), 'content' (string with HTML or Markdown), and 'references' (list of reference dictionaries).

Return Value

Returns bytes representing the complete PDF file content. This binary data can be written directly to a file, sent as an HTTP response, or stored in a BytesIO buffer. The PDF includes formatted title, metadata, all document sections with appropriate styling, and a references section if applicable.

Dependencies

  • reportlab
  • flask

Required Imports

from io import BytesIO
from reportlab.lib.pagesizes import A4
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle

Conditional/Optional Imports

These imports are only needed under specific conditions:

from reportlab.lib.pagesizes import A4

Condition: ReportLab must be installed and PDF_AVAILABLE flag must be True

Required (conditional)
html_to_markdown function

Condition: Required if document sections contain HTML content that needs conversion

Optional
process_markdown_content function

Condition: Required for processing markdown content into PDF elements

Optional
add_formatted_content_to_pdf function

Condition: Required for adding processed markdown elements to PDF story

Optional
clean_html_tags function

Condition: Used as fallback when markdown processing fails

Optional
logger object

Condition: Required for logging warnings during content processing errors

Optional

Usage Example

from io import BytesIO
from datetime import datetime
from reportlab.lib.pagesizes import A4
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer, PageBreak
from reportlab.lib.styles import getSampleStyleSheet, ParagraphStyle

# Assume PDF_AVAILABLE = True and helper functions are defined

class Section:
    def __init__(self, type, level, title, content, references):
        self.type = type
        self.level = level
        self.title = title
        self.content = content
        self.references = references
        self.id = 1

class Document:
    def __init__(self, title, author, created_at, sections):
        self.title = title
        self.author = author
        self.created_at = created_at
        self.sections = sections
    
    def get_all_references(self):
        refs = []
        for section in self.sections:
            if section.references:
                refs.extend(section.references)
        return refs

# Create document
sections = [
    Section('header', 1, 'Introduction', '', []),
    Section('text', None, 'Overview', 'This is the introduction text.', [{'title': 'Reference 1', 'source': 'Source A'}])
]

doc = Document('My Report', 'John Doe', datetime.now(), sections)

# Export to PDF
pdf_bytes = export_to_pdf(doc)

# Save to file
with open('output.pdf', 'wb') as f:
    f.write(pdf_bytes)

Best Practices

  • Ensure the PDF_AVAILABLE flag is properly set before calling this function to avoid ImportError
  • The document object must have all required attributes (title, author, created_at, sections) and the get_all_references() method
  • Section content can be HTML or Markdown; the function attempts to detect and convert appropriately
  • Helper functions (html_to_markdown, process_markdown_content, add_formatted_content_to_pdf, clean_html_tags) must be defined in the same module or imported
  • The function includes error handling for content processing failures with fallback to simple paragraph formatting
  • References are deduplicated in the final references section based on title and source combination
  • Use A4 page size by default; modify the pagesize parameter in SimpleDocTemplate if different size is needed
  • The returned bytes can be directly written to a file or sent as an HTTP response with appropriate content-type header (application/pdf)
  • Consider memory usage for large documents as the entire PDF is built in memory before returning

Related Versions

Other versions of this component:

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function export_to_pdf 85.3% similar

    Exports a document with text and data sections to a PDF file using ReportLab, handling custom styling, section ordering, and content formatting including Quill Delta to HTML/Markdown conversion.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function add_formatted_content_to_pdf_v1 73.2% similar

    Converts processed markdown elements into formatted PDF content by adding paragraphs, headers, lists, and tables to a ReportLab story object with appropriate styling.

    From: /tf/active/vicechatdev/vice_ai/new_app.py
  • function html_to_pdf 72.4% similar

    Converts HTML content to a PDF file using ReportLab with intelligent parsing of email-formatted HTML, including metadata extraction, body content processing, and attachment information.

    From: /tf/active/vicechatdev/msg_to_eml.py
  • function add_formatted_content_to_pdf 72.0% similar

    Processes markdown elements and adds them to a PDF document story with appropriate formatting, handling headers, paragraphs, lists, and tables.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
  • function export_to_docx_v1 70.8% similar

    Exports a document object to Microsoft Word DOCX format, converting sections, content, and references into a formatted Word document with proper styling and structure.

    From: /tf/active/vicechatdev/vice_ai/complex_app.py
← Back to Browse