class PageAnalysis
A dataclass that encapsulates the analysis results for a single PDF page, including its image representation, text content, dimensions, and optional analysis metadata.
/tf/active/vicechatdev/e-ink-llm/multi_page_processor.py
18 - 26
simple
Purpose
PageAnalysis serves as a structured data container for storing comprehensive information about a single PDF page after processing and analysis. It holds the page's visual representation (as base64-encoded image), extracted text content, page dimensions, and optional analysis results such as content type classification and key elements identification. This class is typically used in PDF processing pipelines where pages need to be analyzed individually and their results stored in a structured format for further processing or reporting.
Source Code
class PageAnalysis:
"""Analysis result for a single PDF page"""
page_number: int
image_b64: str
text_content: str
dimensions: Tuple[int, int]
analysis_result: Optional[str] = None
content_type: Optional[str] = None
key_elements: Optional[List[str]] = None
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
bases |
- | - |
Parameter Details
page_number: The sequential number of the page within the PDF document (typically 1-indexed). Used to identify and order pages within a document.
image_b64: Base64-encoded string representation of the page rendered as an image. This allows the visual content of the page to be stored and transmitted as text.
text_content: The extracted text content from the PDF page. Contains all readable text elements found on the page.
dimensions: A tuple of two integers (width, height) representing the pixel dimensions of the page image.
analysis_result: Optional string containing the results of any analysis performed on the page (e.g., summary, classification results, or structured analysis output). Defaults to None if no analysis has been performed.
content_type: Optional string indicating the type or category of content on the page (e.g., 'table', 'text', 'image', 'mixed'). Defaults to None if not classified.
key_elements: Optional list of strings identifying important elements or features found on the page (e.g., ['header', 'table', 'chart']). Defaults to None if not analyzed.
Return Value
Instantiation returns a PageAnalysis object containing all the specified attributes. As a dataclass, it automatically generates __init__, __repr__, __eq__, and other special methods. The object serves as an immutable-by-convention data container for page analysis results.
Class Interface
Methods
__init__(page_number: int, image_b64: str, text_content: str, dimensions: Tuple[int, int], analysis_result: Optional[str] = None, content_type: Optional[str] = None, key_elements: Optional[List[str]] = None) -> None
Purpose: Initializes a new PageAnalysis instance with the provided page data and optional analysis results. Auto-generated by the dataclass decorator.
Parameters:
page_number: The page number within the PDF documentimage_b64: Base64-encoded image representation of the pagetext_content: Extracted text from the pagedimensions: Tuple of (width, height) in pixelsanalysis_result: Optional analysis results or summarycontent_type: Optional content type classificationkey_elements: Optional list of identified key elements
Returns: None (constructor)
__repr__() -> str
Purpose: Returns a string representation of the PageAnalysis object showing all field values. Auto-generated by the dataclass decorator.
Returns: String representation of the object in the format 'PageAnalysis(page_number=..., image_b64=..., ...)'
__eq__(other: object) -> bool
Purpose: Compares two PageAnalysis objects for equality based on all field values. Auto-generated by the dataclass decorator.
Parameters:
other: Another object to compare with
Returns: True if all fields are equal, False otherwise
Attributes
| Name | Type | Description | Scope |
|---|---|---|---|
page_number |
int | The sequential number of the page within the PDF document | instance |
image_b64 |
str | Base64-encoded string representation of the page rendered as an image | instance |
text_content |
str | The extracted text content from the PDF page | instance |
dimensions |
Tuple[int, int] | A tuple containing the width and height of the page image in pixels | instance |
analysis_result |
Optional[str] | Optional string containing analysis results, summaries, or structured output from page analysis | instance |
content_type |
Optional[str] | Optional classification of the page content type (e.g., 'table', 'text', 'image', 'mixed') | instance |
key_elements |
Optional[List[str]] | Optional list of identified key elements or features on the page (e.g., headers, tables, charts) | instance |
Dependencies
dataclassestyping
Required Imports
from dataclasses import dataclass
from typing import Tuple, Optional, List
Usage Example
from dataclasses import dataclass
from typing import Tuple, Optional, List
@dataclass
class PageAnalysis:
page_number: int
image_b64: str
text_content: str
dimensions: Tuple[int, int]
analysis_result: Optional[str] = None
content_type: Optional[str] = None
key_elements: Optional[List[str]] = None
# Create a PageAnalysis instance for a simple page
page_analysis = PageAnalysis(
page_number=1,
image_b64="iVBORw0KGgoAAAANSUhEUgAAAAEAAAABCAYAAAAfFcSJAAAADUlEQVR42mNk+M9QDwADhgGAWjR9awAAAABJRU5ErkJggg==",
text_content="This is the text content of page 1.",
dimensions=(800, 1100)
)
# Create a PageAnalysis instance with optional fields
detailed_analysis = PageAnalysis(
page_number=2,
image_b64="base64_encoded_image_data_here",
text_content="Page 2 contains a table and chart.",
dimensions=(800, 1100),
analysis_result="This page contains financial data with a summary table and trend chart.",
content_type="mixed",
key_elements=["table", "chart", "header"]
)
# Access attributes
print(f"Page {page_analysis.page_number}: {page_analysis.dimensions}")
print(f"Content type: {detailed_analysis.content_type}")
print(f"Key elements: {detailed_analysis.key_elements}")
Best Practices
- This is a dataclass, so it should be treated as an immutable data container. Avoid modifying attributes after instantiation unless necessary.
- The image_b64 field can contain large amounts of data for high-resolution pages. Consider memory implications when storing many PageAnalysis objects.
- Always provide the required fields (page_number, image_b64, text_content, dimensions) during instantiation. Optional fields can be set later if needed.
- Use meaningful values for content_type to enable consistent categorization across your application (e.g., establish a fixed set of content types).
- The key_elements list should contain standardized element names for consistency in downstream processing.
- When serializing PageAnalysis objects (e.g., to JSON), be aware that the image_b64 field may significantly increase payload size.
- Page numbers should typically start at 1 to match conventional PDF page numbering, though 0-indexing is also acceptable if used consistently.
- The dimensions tuple should represent (width, height) in pixels, matching the resolution of the image_b64 data.
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
class MultiPageAnalysisResult 81.5% similar
-
class DocumentSummary 70.6% similar
-
class AnalysisResult 67.7% similar
-
class AnalysisResult_v1 67.4% similar
-
class DataSection 66.9% similar