🔍 Code Extractor

function compare_pdf_content

Maturity: 46

Compares the textual content similarity between two PDF files by extracting text samples and computing a similarity ratio using sequence matching.

File:
/tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
Lines:
150 - 158
Complexity:
moderate

Purpose

This function is designed to determine how similar two PDF documents are based on their textual content. It extracts text from sample pages (likely the first few pages) of each PDF and uses the SequenceMatcher algorithm to calculate a similarity ratio between 0.0 (completely different) and 1.0 (identical). This is useful for detecting duplicate documents, finding similar PDFs, or identifying versions of the same document.

Source Code

def compare_pdf_content(file1: str, file2: str) -> float:
    """Compare PDF content similarity (first few pages)"""
    text1 = extract_text_from_pdf_sample(file1)
    text2 = extract_text_from_pdf_sample(file2)
    
    if not text1 or not text2:
        return 0.0
    
    return SequenceMatcher(None, text1, text2).ratio()

Parameters

Name Type Default Kind
file1 str - positional_or_keyword
file2 str - positional_or_keyword

Parameter Details

file1: Path to the first PDF file as a string. Should be a valid file path pointing to an existing PDF document. Can be absolute or relative path.

file2: Path to the second PDF file as a string. Should be a valid file path pointing to an existing PDF document. Can be absolute or relative path.

Return Value

Type: float

Returns a float value between 0.0 and 1.0 representing the similarity ratio between the two PDF documents. A value of 1.0 indicates identical content, 0.5 indicates 50% similarity, and 0.0 indicates completely different content or if text extraction failed for either file.

Dependencies

  • PyPDF2
  • difflib

Required Imports

from difflib import SequenceMatcher

Usage Example

from difflib import SequenceMatcher

# Assuming extract_text_from_pdf_sample is defined
def extract_text_from_pdf_sample(file_path: str) -> str:
    # Implementation to extract text from PDF
    pass

# Compare two PDF files
file1_path = 'document1.pdf'
file2_path = 'document2.pdf'

similarity_score = compare_pdf_content(file1_path, file2_path)

if similarity_score > 0.9:
    print(f'Documents are highly similar: {similarity_score:.2%}')
elif similarity_score > 0.5:
    print(f'Documents are moderately similar: {similarity_score:.2%}')
else:
    print(f'Documents are different: {similarity_score:.2%}')

Best Practices

  • Ensure both PDF file paths are valid and accessible before calling this function to avoid file not found errors.
  • The function returns 0.0 if text extraction fails for either file, so check for this case if you need to distinguish between 'no similarity' and 'extraction failed'.
  • This function only compares a sample of pages (not the entire document), so it may not be suitable for detecting differences in later pages of long documents.
  • The SequenceMatcher algorithm is case-sensitive and whitespace-sensitive, so formatting differences may affect the similarity score.
  • For large-scale PDF comparison operations, consider caching extracted text to avoid repeated PDF parsing.
  • The function depends on 'extract_text_from_pdf_sample' being available in the same scope - ensure this dependency is properly defined or imported.

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function compare_documents_v1 70.4% similar

    Compares two sets of PDF documents by matching document codes, detecting signatures, calculating content similarity, and generating detailed comparison results with signature information.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function extract_text_from_pdf_sample 67.0% similar

    Extracts text content from the first few pages of a PDF file for content comparison purposes, returning up to 5000 characters.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function fuzzy_match_score 59.4% similar

    Calculates a fuzzy string similarity score between two input strings using the SequenceMatcher algorithm, returning a ratio between 0.0 and 1.0.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function test_extraction_methods 58.4% similar

    A test function that compares two PDF text extraction methods (regular llmsherpa and OCR-based Tesseract) on a specific purchase order document from FileCloud, checking for vendor name detection.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_extraction_methods.py
  • function find_best_match 56.5% similar

    Finds the best matching document from a list of candidates by comparing hash, size, filename, and content similarity with configurable confidence thresholds.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
← Back to Browse