test_international_tax_ids

function test_international_tax_ids

Maturity: 49

A test function that validates an LLM client's ability to extract tax identification numbers and business registration numbers from a multi-party international contract document across 8 different countries.

File:
/tf/active/vicechatdev/contract_validity_analyzer/test_international_tax_ids.py

Lines:
18 - 163

Complexity:
moderate

Purpose

This function serves as an integration test for LLM-based document analysis capabilities, specifically testing the extraction of various international tax ID formats (US EIN, EU VAT numbers, UK company registration, Australian ABN/ACN, German USt-IdNr, French SIRET/SIREN, Dutch BTW, and Canadian BN). It creates a synthetic Master Service Agreement with parties from 8 countries, sends it to an LLM for analysis, and validates that at least 75% of the expected tax IDs are correctly extracted. The function also validates email extraction as a secondary check.

Source Code

def test_international_tax_ids():
    """Test that the LLM client can extract tax IDs in various international formats."""
    
    # Sample contract text with various international tax ID formats
    test_document = """
    MASTER SERVICE AGREEMENT

    This Master Service Agreement ("Agreement") is entered into on March 10, 2024, 
    between ViceBio Ltd, a company incorporated in England and Wales, and the following parties:

    PARTY 1: TechCorp Solutions Inc. (United States)
    Address: 123 Innovation Drive, San Francisco, CA 94105
    Email: contracts@techcorp.com
    Federal Tax ID (EIN): 12-3456789
    Contact: john.doe@techcorp.com

    PARTY 2: European Innovations BVBA (Belgium)
    Address: Rue de la Innovation 45, 1000 Brussels, Belgium
    Email: contact@euro-innovations.be
    VAT Number: BE0664510277
    Company Registration: 0664.510.277
    Legal Contact: marie.dubois@euro-innovations.be

    PARTY 3: Advanced Systems Ltd (United Kingdom)
    Address: 10 Downing Street, London SW1A 2AA, UK
    Email: info@advancedsystems.co.uk
    Company Registration Number: 12345678
    VAT Number: GB123456789
    Director: david.smith@advancedsystems.co.uk

    PARTY 4: Digital Solutions Pty Ltd (Australia)
    Address: 100 Collins Street, Melbourne VIC 3000, Australia
    Email: admin@digitalsolutions.com.au
    Australian Business Number (ABN): 12 345 678 901
    Australian Company Number (ACN): 123 456 789
    Manager: sarah.johnson@digitalsolutions.com.au

    PARTY 5: Innovation Labs GmbH (Germany)
    Address: Unter den Linden 1, 10117 Berlin, Germany
    Email: kontakt@innovationlabs.de
    Handelsregisternummer: HRB 12345
    USt-IdNr: DE123456789
    Geschäftsführer: thomas.mueller@innovationlabs.de

    PARTY 6: Tech Solutions SARL (France)
    Address: 1 Avenue des Champs-Élysées, 75008 Paris, France
    Email: contact@techsolutions.fr
    SIRET: 12345678901234
    SIREN: 123456789
    Directeur: pierre.martin@techsolutions.fr

    PARTY 7: Digital Innovations B.V. (Netherlands)
    Address: Damrak 1, 1012 LG Amsterdam, Netherlands
    Email: info@digital-innovations.nl
    KvK nummer: 12345678
    BTW nummer: NL123456789B01
    Director: anna.vanderberg@digital-innovations.nl

    PARTY 8: Maple Tech Inc. (Canada)
    Address: 100 Queen Street West, Toronto, ON M5H 2N2, Canada
    Email: contact@mapletech.ca
    Business Number (BN): 123456789RC0001
    HST Registration: 123456789RT0001
    CEO: michael.thompson@mapletech.ca

    This Agreement shall commence on March 10, 2024 and shall remain in effect for a period of two (2) years 
    from the effective date, unless terminated earlier in accordance with the terms herein.

    All parties agree to the terms and conditions set forth in this Agreement.
    """
    
    # Initialize LLM client
    config = {
        'provider': 'openai',
        'model': 'gpt-4o',
        'temperature': 0.0,
        'max_tokens': 4000
    }
    
    llm_client = LLMClient(config)
    
    # Test the contract analysis
    print("Testing international tax ID extraction...")
    try:
        result = llm_client.analyze_contract(test_document, "international_contract.pdf")
        
        print("\nAnalysis Result:")
        print(json.dumps(result, indent=2))
        
        # Define expected tax IDs by format
        expected_tax_ids = {
            'US': ['12-3456789'],
            'Belgium': ['BE0664510277', '0664.510.277'],
            'UK': ['12345678', 'GB123456789'],
            'Australia': ['12 345 678 901', '123 456 789'],
            'Germany': ['HRB 12345', 'DE123456789'],
            'France': ['12345678901234', '123456789'],
            'Netherlands': ['12345678', 'NL123456789B01'],
            'Canada': ['123456789RC0001', '123456789RT0001']
        }
        
        # Check if tax IDs were extracted
        extracted_tax_ids = result.get('third_party_tax_ids', [])
        print(f"\nExtracted tax IDs: {extracted_tax_ids}")
        
        # Count successful extractions by country/format
        found_formats = {}
        for country, ids in expected_tax_ids.items():
            found_count = 0
            for expected_id in ids:
                if any(expected_id in extracted_id for extracted_id in extracted_tax_ids):
                    found_count += 1
            found_formats[country] = f"{found_count}/{len(ids)}"
            
        print("\nTax ID extraction results by country:")
        for country, result_ratio in found_formats.items():
            success = result_ratio.split('/')[0] != '0'
            status = "✓" if success else "✗"
            print(f"  {status} {country}: {result_ratio} formats found")
        
        # Check email extraction as well
        extracted_emails = result.get('third_party_emails', [])
        expected_email_domains = ['techcorp.com', 'euro-innovations.be', 'advancedsystems.co.uk', 
                                'digitalsolutions.com.au', 'innovationlabs.de', 'techsolutions.fr',
                                'digital-innovations.nl', 'mapletech.ca']
        
        found_domains = 0
        for domain in expected_email_domains:
            if any(domain in email for email in extracted_emails):
                found_domains += 1
                
        print(f"\nEmail extraction: {found_domains}/{len(expected_email_domains)} domains found")
        
        # Overall success assessment
        total_expected_countries = len(expected_tax_ids)
        successful_countries = len([c for c, r in found_formats.items() if r.split('/')[0] != '0'])
        
        success_rate = (successful_countries / total_expected_countries) * 100
        print(f"\nOverall tax ID extraction success: {successful_countries}/{total_expected_countries} countries ({success_rate:.1f}%)")
        
        # Return success if we got at least 75% of the formats
        return success_rate >= 75.0
        
    except Exception as e:
        print(f"Error during analysis: {e}")
        return False

Return Value

Returns a boolean value: True if the LLM successfully extracted tax IDs from at least 75% (6 out of 8) of the countries represented in the test document, False otherwise or if an exception occurs during analysis. The success threshold is based on finding at least one tax ID format per country.

Dependencies

json
pathlib
utils.llm_client

Required Imports

import json
from pathlib import Path
from utils.llm_client import LLMClient

Usage Example

# Ensure OpenAI API key is set
import os
os.environ['OPENAI_API_KEY'] = 'your-api-key-here'

# Import and run the test
from utils.llm_client import LLMClient
import json
from pathlib import Path

# Run the test function
success = test_international_tax_ids()

if success:
    print("Test passed: LLM successfully extracted international tax IDs")
else:
    print("Test failed: LLM did not meet the 75% extraction threshold")

# Expected output includes:
# - Detailed JSON analysis result
# - Per-country extraction statistics
# - Email domain extraction results
# - Overall success rate percentage

Best Practices

This is a test function and should be run in a testing environment, not production
Ensure the LLMClient class is properly implemented with an analyze_contract method that returns structured data
The function uses temperature=0.0 for deterministic results, which is appropriate for testing
The 75% success threshold is hardcoded; consider making it configurable for different testing scenarios
The function prints detailed output to console; consider using a logging framework for production test suites
API costs should be considered as this test makes a call to GPT-4o with a large document
The test document is embedded in the function; for maintainability, consider externalizing test data
Error handling catches all exceptions generically; consider more specific exception handling for production use
The function validates both tax IDs and emails, providing comprehensive extraction testing

Similar Components

AI-powered semantic similarity - components with related functionality:

function test_new_fields 79.9% similar

A test function that validates an LLM client's ability to extract third-party email addresses and tax identification numbers from contract documents.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_new_fields.py
function test_edge_cases 77.8% similar

Tests edge cases and variations in European tax ID formats by analyzing a sample contract document containing Swiss, Norwegian, Swedish, and Danish tax identifiers.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_international_tax_ids.py
function test_llm_client 69.7% similar

Tests the LLM client functionality by analyzing a sample contract text and verifying the extraction of key contract metadata such as third parties, dates, and status.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_implementation.py
function test_llm_extraction 67.5% similar

A test function that validates LLM-based contract data extraction by processing a sample contract and verifying the extracted fields against expected values.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_extractor.py
function test_local_document 62.2% similar

Integration test function that validates end date extraction from a local PDF document using document processing and LLM-based analysis.
From: /tf/active/vicechatdev/contract_validity_analyzer/test_local_document.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def test_international_tax_ids():
    """Test that the LLM client can extract tax IDs in various international formats."""
    
    # Sample contract text with various international tax ID formats
    test_document = """
    MASTER SERVICE AGREEMENT

    This Master Service Agreement ("Agreement") is entered into on March 10, 2024, 
    between ViceBio Ltd, a company incorporated in England and Wales, and the following parties:

    PARTY 1: TechCorp Solutions Inc. (United States)
    Address: 123 Innovation Drive, San Francisco, CA 94105
    Email: contracts@techcorp.com
    Federal Tax ID (EIN): 12-3456789
    Contact: john.doe@techcorp.com

    PARTY 2: European Innovations BVBA (Belgium)
    Address: Rue de la Innovation 45, 1000 Brussels, Belgium
    Email: contact@euro-innovations.be
    VAT Number: BE0664510277
    Company Registration: 0664.510.277
    Legal Contact: marie.dubois@euro-innovations.be

    PARTY 3: Advanced Systems Ltd (United Kingdom)
    Address: 10 Downing Street, London SW1A 2AA, UK
    Email: info@advancedsystems.co.uk
    Company Registration Number: 12345678
    VAT Number: GB123456789
    Director: david.smith@advancedsystems.co.uk

    PARTY 4: Digital Solutions Pty Ltd (Australia)
    Address: 100 Collins Street, Melbourne VIC 3000, Australia
    Email: admin@digitalsolutions.com.au
    Australian Business Number (ABN): 12 345 678 901
    Australian Company Number (ACN): 123 456 789
    Manager: sarah.johnson@digitalsolutions.com.au

    PARTY 5: Innovation Labs GmbH (Germany)
    Address: Unter den Linden 1, 10117 Berlin, Germany
    Email: kontakt@innovationlabs.de
    Handelsregisternummer: HRB 12345
    USt-IdNr: DE123456789
    Geschäftsführer: thomas.mueller@innovationlabs.de

    PARTY 6: Tech Solutions SARL (France)
    Address: 1 Avenue des Champs-Élysées, 75008 Paris, France
    Email: contact@techsolutions.fr
    SIRET: 12345678901234
    SIREN: 123456789
    Directeur: pierre.martin@techsolutions.fr

    PARTY 7: Digital Innovations B.V. (Netherlands)
    Address: Damrak 1, 1012 LG Amsterdam, Netherlands
    Email: info@digital-innovations.nl
    KvK nummer: 12345678
    BTW nummer: NL123456789B01
    Director: anna.vanderberg@digital-innovations.nl

    PARTY 8: Maple Tech Inc. (Canada)
    Address: 100 Queen Street West, Toronto, ON M5H 2N2, Canada
    Email: contact@mapletech.ca
    Business Number (BN): 123456789RC0001
    HST Registration: 123456789RT0001
    CEO: michael.thompson@mapletech.ca

    This Agreement shall commence on March 10, 2024 and shall remain in effect for a period of two (2) years 
    from the effective date, unless terminated earlier in accordance with the terms herein.

    All parties agree to the terms and conditions set forth in this Agreement.
    """
    
    # Initialize LLM client
    config = {
        'provider': 'openai',
        'model': 'gpt-4o',
        'temperature': 0.0,
        'max_tokens': 4000
    }
    
    llm_client = LLMClient(config)
    
    # Test the contract analysis
    print("Testing international tax ID extraction...")
    try:
        result = llm_client.analyze_contract(test_document, "international_contract.pdf")
        
        print("\nAnalysis Result:")
        print(json.dumps(result, indent=2))
        
        # Define expected tax IDs by format
        expected_tax_ids = {
            'US': ['12-3456789'],
            'Belgium': ['BE0664510277', '0664.510.277'],
            'UK': ['12345678', 'GB123456789'],
            'Australia': ['12 345 678 901', '123 456 789'],
            'Germany': ['HRB 12345', 'DE123456789'],
            'France': ['12345678901234', '123456789'],
            'Netherlands': ['12345678', 'NL123456789B01'],
            'Canada': ['123456789RC0001', '123456789RT0001']
        }
        
        # Check if tax IDs were extracted
        extracted_tax_ids = result.get('third_party_tax_ids', [])
        print(f"\nExtracted tax IDs: {extracted_tax_ids}")
        
        # Count successful extractions by country/format
        found_formats = {}
        for country, ids in expected_tax_ids.items():
            found_count = 0
            for expected_id in ids:
                if any(expected_id in extracted_id for extracted_id in extracted_tax_ids):
                    found_count += 1
            found_formats[country] = f"{found_count}/{len(ids)}"
            
        print("\nTax ID extraction results by country:")
        for country, result_ratio in found_formats.items():
            success = result_ratio.split('/')[0] != '0'
            status = "✓" if success else "✗"
            print(f"  {status} {country}: {result_ratio} formats found")
        
        # Check email extraction as well
        extracted_emails = result.get('third_party_emails', [])
        expected_email_domains = ['techcorp.com', 'euro-innovations.be', 'advancedsystems.co.uk', 
                                'digitalsolutions.com.au', 'innovationlabs.de', 'techsolutions.fr',
                                'digital-innovations.nl', 'mapletech.ca']
        
        found_domains = 0
        for domain in expected_email_domains:
            if any(domain in email for email in extracted_emails):
                found_domains += 1
                
        print(f"\nEmail extraction: {found_domains}/{len(expected_email_domains)} domains found")
        
        # Overall success assessment
        total_expected_countries = len(expected_tax_ids)
        successful_countries = len([c for c, r in found_formats.items() if r.split('/')[0] != '0'])
        
        success_rate = (successful_countries / total_expected_countries) * 100
        print(f"\nOverall tax ID extraction success: {successful_countries}/{total_expected_countries} countries ({success_rate:.1f}%)")
        
        # Return success if we got at least 75% of the formats
        return success_rate >= 75.0
        
    except Exception as e:
        print(f"Error during analysis: {e}")
        return False
                        

Improved Code

🔍 Code Extractor

function test_international_tax_ids

Purpose

Source Code

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function test_new_fields 79.9% similar

function test_edge_cases 77.8% similar

function test_llm_client 69.7% similar

function test_llm_extraction 67.5% similar

function test_local_document 62.2% similar

function test_international_tax_ids

Purpose

Source Code

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function test_new_fields 79.9% similar

function test_edge_cases 77.8% similar

function test_llm_client 69.7% similar

function test_llm_extraction 67.5% similar

function test_local_document 62.2% similar

✨ Improve Code: test_international_tax_ids

Code Comparison