🔍 Code Extractor

function extract_batch

Maturity: 66

Batch processes a list of vendors from an Excel file to extract their email addresses by searching through Microsoft 365 mailboxes using AI-powered email analysis.

File:
/tf/active/vicechatdev/find_email/extract_vendor_batch.py
Lines:
44 - 115
Complexity:
complex

Purpose

This function orchestrates a complete vendor email extraction workflow. It loads vendor data from an Excel file, initializes a VendorEmailExtractor with Microsoft Graph API and OpenAI credentials, searches through organizational mailboxes for vendor-related emails, extracts high-confidence vendor email addresses using AI analysis, and saves the results to a timestamped Excel file. It supports test mode for processing a subset of vendors, resume capability for interrupted runs, and provides detailed progress reporting and summary statistics.

Source Code

def extract_batch(
    vendor_excel_file: str,
    max_mailboxes: Optional[int] = None,
    max_emails_per_mailbox: int = DEFAULT_MAX_EMAILS_PER_MAILBOX,
    days_back: int = DEFAULT_DAYS_BACK,
    test_mode: bool = False
):
    """
    Extract vendor emails for all vendors in the list
    
    Args:
        vendor_excel_file: Path to enriched vendor Excel file
        max_mailboxes: Limit mailboxes searched (None = all)
        max_emails_per_mailbox: Max emails per mailbox per vendor
        days_back: Days to search back
        test_mode: If True, only process first 3 vendors
    """
    print("\n" + "="*60)
    print("VENDOR EMAIL BATCH EXTRACTION")
    print("="*60)
    
    # Load vendors
    vendors = load_vendor_list(vendor_excel_file)
    
    if test_mode:
        print("\n⚠️  TEST MODE: Processing only first 3 vendors")
        vendors = vendors[:3]
    
    # Create extractor
    extractor = VendorEmailExtractor(
        tenant_id=TENANT_ID,
        client_id=CLIENT_ID,
        client_secret=CLIENT_SECRET,
        openai_api_key=OPENAI_API_KEY,
        domain=DOMAIN
    )
    
    # Extract emails for all vendors
    results_df = extractor.extract_for_vendor_list(
        vendor_list=vendors,
        max_mailboxes=max_mailboxes,
        max_emails_per_mailbox=max_emails_per_mailbox,
        days_back=days_back,
        resume=True  # Allow resuming if interrupted
    )
    
    # Results are already in 3-column format: Vendor, Retained Emails, Source Mailboxes
    # Save the results
    if not results_df.empty:
        output_file = Path(vendor_excel_file).parent / f"vendors_mailbox_emails_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.xlsx"
        results_df.to_excel(output_file, index=False)
        
        print(f"\n✅ Results saved: {output_file}")
        
        # Show summary
        print("\n" + "="*60)
        print("EXTRACTION SUMMARY")
        print("="*60)
        print(f"Total vendors processed: {len(results_df)}")
        vendors_with_emails = results_df[results_df['Retained Emails'].str.len() > 0]
        print(f"Vendors with HIGH confidence emails: {len(vendors_with_emails)}")
        
        # Count total unique emails
        total_emails = 0
        for emails_str in results_df['Retained Emails']:
            if emails_str:
                total_emails += len([e.strip() for e in emails_str.split(',') if e.strip()])
        print(f"Total HIGH confidence emails found: {total_emails}")
        print("="*60 + "\n")
    else:
        print("\n⚠️  No results to save")
        print("="*60 + "\n")

Parameters

Name Type Default Kind
vendor_excel_file str - positional_or_keyword
max_mailboxes Optional[int] None positional_or_keyword
max_emails_per_mailbox int DEFAULT_MAX_EMAILS_PER_MAILBOX positional_or_keyword
days_back int DEFAULT_DAYS_BACK positional_or_keyword
test_mode bool False positional_or_keyword

Parameter Details

vendor_excel_file: String path to an Excel file containing enriched vendor data. This file should have vendor information that the extractor can use to identify and search for vendor-related emails. Required parameter.

max_mailboxes: Optional integer limiting the number of mailboxes to search. If None (default), all available mailboxes will be searched. Use this to limit scope for testing or performance reasons.

max_emails_per_mailbox: Integer specifying the maximum number of emails to retrieve from each mailbox for each vendor. Defaults to DEFAULT_MAX_EMAILS_PER_MAILBOX constant. Controls the depth of search per mailbox.

days_back: Integer specifying how many days in the past to search for emails. Defaults to DEFAULT_DAYS_BACK constant. Limits the time window for email retrieval.

test_mode: Boolean flag that when True, limits processing to only the first 3 vendors in the list. Useful for testing the pipeline without processing the entire vendor list. Defaults to False.

Return Value

This function does not return a value (implicitly returns None). Instead, it produces side effects: (1) Saves an Excel file with results containing three columns: 'Vendor', 'Retained Emails' (comma-separated high-confidence emails), and 'Source Mailboxes' (where emails were found). The file is saved in the same directory as the input file with a timestamp. (2) Prints detailed progress information and summary statistics to stdout including total vendors processed, vendors with emails found, and total email count.

Dependencies

  • pandas
  • pathlib
  • typing
  • sys
  • argparse

Required Imports

import sys
import pandas as pd
from pathlib import Path
from typing import List, Optional
from vendor_email_extractor import VendorEmailExtractor
from vendor_email_config import TENANT_ID, CLIENT_ID, CLIENT_SECRET, OPENAI_API_KEY, DOMAIN, DEFAULT_DAYS_BACK, DEFAULT_MAX_EMAILS_PER_MAILBOX

Usage Example

# Assuming vendor_email_config.py is properly configured with credentials
# and load_vendor_list function is available

# Basic usage - process all vendors
extract_batch(
    vendor_excel_file='vendors_enriched.xlsx'
)

# Test mode - process only first 3 vendors
extract_batch(
    vendor_excel_file='vendors_enriched.xlsx',
    test_mode=True
)

# Custom parameters - limit scope and time window
extract_batch(
    vendor_excel_file='vendors_enriched.xlsx',
    max_mailboxes=10,
    max_emails_per_mailbox=50,
    days_back=30,
    test_mode=False
)

# Results will be saved to:
# vendors_mailbox_emails_YYYYMMDD_HHMMSS.xlsx
# in the same directory as the input file

Best Practices

  • Always test with test_mode=True first to verify the pipeline works correctly before processing all vendors
  • Ensure vendor_excel_file contains properly formatted vendor data that the VendorEmailExtractor can process
  • Monitor API rate limits for both Microsoft Graph API and OpenAI API during batch processing
  • Use max_mailboxes parameter to limit scope during initial testing or when working with large organizations
  • The function supports resume capability, so interrupted runs can be restarted without losing progress
  • Review the output Excel file to verify email extraction quality before using the results
  • Consider adjusting days_back parameter based on vendor communication patterns (older vendors may need longer lookback)
  • Ensure sufficient disk space for the output Excel file, especially when processing many vendors
  • The function only retains HIGH confidence emails - review VendorEmailExtractor configuration if too few results are returned
  • Check that all required credentials in vendor_email_config.py are valid before running batch extraction

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function main_v28 75.5% similar

    Command-line entry point that parses arguments and orchestrates the extraction of vendor emails from all vicebio.com mailboxes using Microsoft Graph API.

    From: /tf/active/vicechatdev/find_email/extract_vendor_batch.py
  • function main_v27 73.5% similar

    Demonstrates example usage of the VendorEmailExtractor class by searching for vendor emails across Office 365 mailboxes and displaying results.

    From: /tf/active/vicechatdev/find_email/vendor_email_extractor.py
  • function test_email_search 63.0% similar

    Tests the email search functionality of a VendorEmailExtractor instance by searching for emails containing common business terms in the first available mailbox.

    From: /tf/active/vicechatdev/find_email/test_vendor_extractor.py
  • function run_all_tests 62.3% similar

    Orchestrates a comprehensive test suite for the Vendor Email Extractor system, verifying configuration, authentication, mailbox access, email search, and LLM connectivity.

    From: /tf/active/vicechatdev/find_email/test_vendor_extractor.py
  • class VendorEmailExtractor 62.2% similar

    Extract vendor email addresses from all organizational mailboxes

    From: /tf/active/vicechatdev/find_email/vendor_email_extractor.py
← Back to Browse