🔍 Code Extractor

function test_end_date_extraction

Maturity: 53

Tests end date extraction functionality for contract documents that previously had missing end dates by downloading documents from FileCloud, extracting text, analyzing with LLM, and comparing results.

File:
/tf/active/vicechatdev/contract_validity_analyzer/test_missing_end_dates.py
Lines:
47 - 229
Complexity:
complex

Purpose

This function serves as an integration test to validate improvements in contract end date extraction. It processes a predefined list of test documents (TEST_DOCUMENTS) that historically had missing end dates, downloads them from FileCloud, extracts text content, analyzes them using an LLM client to identify contract metadata (type, start date, end date, status), and generates a comprehensive report showing which documents now have successfully extracted end dates. The function tracks success rates, logs detailed results, saves analysis to JSON, and reports LLM token usage statistics.

Source Code

def test_end_date_extraction():
    """Test end date extraction for documents with previously missing end dates."""
    logger = setup_test_logging()
    logger.info("Starting end date extraction test for documents with missing end dates")
    
    try:
        # Load configuration
        config = Config()
        
        # Initialize components
        fc_client = FileCloudClient(config.get_section('filecloud'))
        doc_processor = DocumentProcessor(config.get_section('document_processing'))
        llm_client = LLMClient(config.get_section('llm'))
        
        # Connect to FileCloud
        if not fc_client.connect():
            logger.error("Failed to connect to FileCloud")
            return False
            
        logger.info(f"Testing {len(TEST_DOCUMENTS)} documents with previously missing end dates")
        
        results = []
        
        for i, filename in enumerate(TEST_DOCUMENTS, 1):
            logger.info(f"\n{'='*60}")
            logger.info(f"Testing document {i}/{len(TEST_DOCUMENTS)}: {filename}")
            logger.info(f"{'='*60}")
            
            try:
                # Search for the document
                documents = fc_client.search_documents()
                target_doc = None
                
                for doc in documents:
                    if doc['filename'] == filename:
                        target_doc = doc
                        break
                
                if not target_doc:
                    logger.warning(f"Document not found: {filename}")
                    results.append({
                        'filename': filename,
                        'status': 'not_found',
                        'error': 'Document not found in FileCloud'
                    })
                    continue
                
                # Download the document
                logger.info(f"Downloading: {filename}")
                local_path = fc_client.download_document(target_doc['full_path'])
                
                if not local_path or not os.path.exists(local_path):
                    logger.error(f"Failed to download: {filename}")
                    results.append({
                        'filename': filename,
                        'status': 'download_failed',
                        'error': 'Failed to download document'
                    })
                    continue
                
                # Extract text from document
                logger.info(f"Extracting text from: {filename}")
                document_text = doc_processor.extract_text(local_path)
                
                if not document_text or len(document_text.strip()) < 100:
                    logger.warning(f"Little or no text extracted from: {filename}")
                    results.append({
                        'filename': filename,
                        'status': 'extraction_failed',
                        'error': 'No meaningful text extracted',
                        'text_length': len(document_text) if document_text else 0
                    })
                    continue
                
                logger.info(f"Extracted {len(document_text)} characters of text")
                
                # Analyze with LLM using optimized 2-step approach
                logger.info(f"Analyzing contract: {filename}")
                analysis_result = llm_client.analyze_contract(document_text, filename)
                
                # Log results
                logger.info(f"Analysis completed for: {filename}")
                logger.info(f"Contract Type: {analysis_result.get('contract_type', 'Unknown')}")
                logger.info(f"Start Date: {analysis_result.get('start_date', 'Not found')}")
                logger.info(f"End Date: {analysis_result.get('end_date', 'Not found')}")
                logger.info(f"Is In Effect: {analysis_result.get('is_in_effect', 'Unknown')}")
                logger.info(f"Confidence: {analysis_result.get('confidence', 0.0)}")
                logger.info(f"Analysis Notes: {analysis_result.get('analysis_notes', 'None')}")
                
                # Check if this is an improvement (end date found when previously missing)
                end_date_found = analysis_result.get('end_date') and analysis_result.get('end_date') not in ['null', None, '']
                improvement_status = "✓ IMPROVED - End date found!" if end_date_found else "✗ Still missing end date"
                logger.info(f"Improvement Status: {improvement_status}")
                
                # Store result
                result = {
                    'filename': filename,
                    'status': 'analyzed',
                    'contract_type': analysis_result.get('contract_type'),
                    'start_date': analysis_result.get('start_date'),
                    'end_date': analysis_result.get('end_date'),
                    'is_in_effect': analysis_result.get('is_in_effect'),
                    'confidence': analysis_result.get('confidence'),
                    'analysis_notes': analysis_result.get('analysis_notes'),
                    'text_length': len(document_text),
                    'token_usage': analysis_result.get('_metadata', {}).get('token_usage', {})
                }
                
                results.append(result)
                
                # Clean up downloaded file
                try:
                    os.remove(local_path)
                except:
                    pass
                    
            except Exception as e:
                logger.error(f"Error processing {filename}: {e}")
                results.append({
                    'filename': filename,
                    'status': 'error',
                    'error': str(e)
                })
        
        # Disconnect from FileCloud
        fc_client.disconnect()
        
        # Analyze results
        logger.info(f"\n{'='*60}")
        logger.info("TEST RESULTS SUMMARY")
        logger.info(f"{'='*60}")
        
        successful_analyses = [r for r in results if r['status'] == 'analyzed']
        found_end_dates = [r for r in successful_analyses if r.get('end_date') and r['end_date'] != 'null']
        
        logger.info(f"Total documents tested: {len(TEST_DOCUMENTS)}")
        logger.info(f"Successfully analyzed: {len(successful_analyses)}")
        logger.info(f"End dates found: {len(found_end_dates)}")
        logger.info(f"Success rate: {len(found_end_dates)/len(successful_analyses)*100:.1f}% (of successfully analyzed)")
        
        # Detailed results
        logger.info(f"\nDETAILED RESULTS:")
        improvements = 0
        for result in results:
            if result['status'] == 'analyzed':
                end_date_found = result.get('end_date') and result['end_date'] not in ['null', None, '']
                end_date_status = "✓ Found" if end_date_found else "✗ Missing"
                if end_date_found:
                    improvements += 1
                logger.info(f"  {result['filename']}")
                logger.info(f"    Type: {result.get('contract_type', 'Unknown')}")
                logger.info(f"    Start: {result.get('start_date', 'Not found')}")
                logger.info(f"    End: {result.get('end_date', 'Not found')} ({end_date_status})")
                logger.info(f"    Confidence: {result.get('confidence', 0.0)}")
                logger.info(f"    Notes: {result.get('analysis_notes', 'None')}")
                logger.info("")
            else:
                logger.info(f"  {result['filename']}: {result['status']} - {result.get('error', 'Unknown error')}")
        
        logger.info(f"IMPROVEMENT SUMMARY:")
        logger.info(f"Documents that now have end dates: {improvements}/{len(successful_analyses)}")
        logger.info(f"Improvement rate: {improvements/len(successful_analyses)*100:.1f}% (of successfully analyzed)")
        
        # Save results to JSON file
        results_file = 'test_missing_end_dates_results.json'
        with open(results_file, 'w') as f:
            json.dump(results, f, indent=2, default=str)
        
        logger.info(f"\nDetailed results saved to: {results_file}")
        
        # LLM usage stats
        usage_stats = llm_client.get_usage_stats()
        if usage_stats.get('total_tokens', 0) > 0:
            logger.info(f"\nLLM Usage Statistics:")
            logger.info(f"  Total tokens: {usage_stats['total_tokens']:,}")
            logger.info(f"  Prompt tokens: {usage_stats['total_prompt_tokens']:,}")
            logger.info(f"  Completion tokens: {usage_stats['total_completion_tokens']:,}")
        
        return len(found_end_dates) > 0
        
    except Exception as e:
        logger.error(f"Test failed with error: {e}")
        return False

Return Value

Returns a boolean value: True if at least one document had its end date successfully extracted (len(found_end_dates) > 0), False if the test failed with an error or no end dates were found. This indicates whether the end date extraction functionality is working for at least some of the test documents.

Dependencies

  • os
  • sys
  • json
  • pathlib
  • logging

Required Imports

import os
import sys
import json
from pathlib import Path
from config.config import Config
from utils.filecloud_client import FileCloudClient
from utils.document_processor import DocumentProcessor
from utils.llm_client import LLMClient
from utils.logging_utils import setup_logging
from utils.logging_utils import get_logger
import logging

Usage Example

# Define test documents list
TEST_DOCUMENTS = [
    'contract_2023_vendor_a.pdf',
    'agreement_2022_supplier_b.docx',
    'lease_2021_property_c.pdf'
]

# Define test logging setup
def setup_test_logging():
    logging.basicConfig(level=logging.INFO)
    return logging.getLogger(__name__)

# Run the test
import os
import sys
import json
from pathlib import Path
from config.config import Config
from utils.filecloud_client import FileCloudClient
from utils.document_processor import DocumentProcessor
from utils.llm_client import LLMClient
from utils.logging_utils import setup_logging, get_logger
import logging

# Execute test
success = test_end_date_extraction()
if success:
    print('Test passed: At least one end date was successfully extracted')
else:
    print('Test failed: No end dates were extracted or error occurred')

# Review results
with open('test_missing_end_dates_results.json', 'r') as f:
    results = json.load(f)
    for result in results:
        print(f"{result['filename']}: {result.get('end_date', 'Not found')}")

Best Practices

  • Ensure TEST_DOCUMENTS global variable is defined before calling this function with a list of document filenames to test
  • Verify all configuration sections ('filecloud', 'document_processing', 'llm') are properly configured before running
  • Monitor disk space as documents are temporarily downloaded during processing
  • Review the generated 'test_missing_end_dates_results.json' file for detailed analysis results
  • Check LLM token usage statistics to monitor API costs during testing
  • Ensure FileCloud credentials have read access to all test documents
  • The function cleans up downloaded files automatically, but verify temp directory if errors occur
  • Use this function in a test environment before production deployment to validate extraction improvements
  • Consider the function's return value (boolean) for automated test pipelines - True indicates at least partial success
  • Review logger output for detailed per-document analysis including confidence scores and analysis notes
  • The function continues processing remaining documents even if individual documents fail, check status field in results

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function test_single_document 90.4% similar

    Tests end date extraction from a specific PDF document by downloading it from FileCloud, extracting text, and using LLM-based analysis to identify contract expiry dates.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_single_document.py
  • function test_local_document 85.0% similar

    Integration test function that validates end date extraction from a local PDF document using document processing and LLM-based analysis.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_local_document.py
  • function test_simulated_document 82.7% similar

    Integration test function that validates end date extraction from a simulated contract document containing an explicit term clause, using a two-step LLM-based analysis process.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_simulated_document.py
  • function test_with_simulated_content 73.9% similar

    Tests LLM-based contract analysis prompts using simulated NDA content containing a term clause to verify extraction of contract dates and metadata.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_local_document.py
  • function test_llm_extraction 73.0% similar

    A test function that validates LLM-based contract data extraction by processing a sample contract and verifying the extracted fields against expected values.

    From: /tf/active/vicechatdev/contract_validity_analyzer/test_extractor.py
← Back to Browse