function extract_batch
Batch processes a list of vendors from an Excel file to extract their email addresses by searching through Microsoft 365 mailboxes using AI-powered email analysis.
/tf/active/vicechatdev/find_email/extract_vendor_batch.py
44 - 115
complex
Purpose
This function orchestrates a complete vendor email extraction workflow. It loads vendor data from an Excel file, initializes a VendorEmailExtractor with Microsoft Graph API and OpenAI credentials, searches through organizational mailboxes for vendor-related emails, extracts high-confidence vendor email addresses using AI analysis, and saves the results to a timestamped Excel file. It supports test mode for processing a subset of vendors, resume capability for interrupted runs, and provides detailed progress reporting and summary statistics.
Source Code
def extract_batch(
vendor_excel_file: str,
max_mailboxes: Optional[int] = None,
max_emails_per_mailbox: int = DEFAULT_MAX_EMAILS_PER_MAILBOX,
days_back: int = DEFAULT_DAYS_BACK,
test_mode: bool = False
):
"""
Extract vendor emails for all vendors in the list
Args:
vendor_excel_file: Path to enriched vendor Excel file
max_mailboxes: Limit mailboxes searched (None = all)
max_emails_per_mailbox: Max emails per mailbox per vendor
days_back: Days to search back
test_mode: If True, only process first 3 vendors
"""
print("\n" + "="*60)
print("VENDOR EMAIL BATCH EXTRACTION")
print("="*60)
# Load vendors
vendors = load_vendor_list(vendor_excel_file)
if test_mode:
print("\n⚠️ TEST MODE: Processing only first 3 vendors")
vendors = vendors[:3]
# Create extractor
extractor = VendorEmailExtractor(
tenant_id=TENANT_ID,
client_id=CLIENT_ID,
client_secret=CLIENT_SECRET,
openai_api_key=OPENAI_API_KEY,
domain=DOMAIN
)
# Extract emails for all vendors
results_df = extractor.extract_for_vendor_list(
vendor_list=vendors,
max_mailboxes=max_mailboxes,
max_emails_per_mailbox=max_emails_per_mailbox,
days_back=days_back,
resume=True # Allow resuming if interrupted
)
# Results are already in 3-column format: Vendor, Retained Emails, Source Mailboxes
# Save the results
if not results_df.empty:
output_file = Path(vendor_excel_file).parent / f"vendors_mailbox_emails_{pd.Timestamp.now().strftime('%Y%m%d_%H%M%S')}.xlsx"
results_df.to_excel(output_file, index=False)
print(f"\n✅ Results saved: {output_file}")
# Show summary
print("\n" + "="*60)
print("EXTRACTION SUMMARY")
print("="*60)
print(f"Total vendors processed: {len(results_df)}")
vendors_with_emails = results_df[results_df['Retained Emails'].str.len() > 0]
print(f"Vendors with HIGH confidence emails: {len(vendors_with_emails)}")
# Count total unique emails
total_emails = 0
for emails_str in results_df['Retained Emails']:
if emails_str:
total_emails += len([e.strip() for e in emails_str.split(',') if e.strip()])
print(f"Total HIGH confidence emails found: {total_emails}")
print("="*60 + "\n")
else:
print("\n⚠️ No results to save")
print("="*60 + "\n")
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
vendor_excel_file |
str | - | positional_or_keyword |
max_mailboxes |
Optional[int] | None | positional_or_keyword |
max_emails_per_mailbox |
int | DEFAULT_MAX_EMAILS_PER_MAILBOX | positional_or_keyword |
days_back |
int | DEFAULT_DAYS_BACK | positional_or_keyword |
test_mode |
bool | False | positional_or_keyword |
Parameter Details
vendor_excel_file: String path to an Excel file containing enriched vendor data. This file should have vendor information that the extractor can use to identify and search for vendor-related emails. Required parameter.
max_mailboxes: Optional integer limiting the number of mailboxes to search. If None (default), all available mailboxes will be searched. Use this to limit scope for testing or performance reasons.
max_emails_per_mailbox: Integer specifying the maximum number of emails to retrieve from each mailbox for each vendor. Defaults to DEFAULT_MAX_EMAILS_PER_MAILBOX constant. Controls the depth of search per mailbox.
days_back: Integer specifying how many days in the past to search for emails. Defaults to DEFAULT_DAYS_BACK constant. Limits the time window for email retrieval.
test_mode: Boolean flag that when True, limits processing to only the first 3 vendors in the list. Useful for testing the pipeline without processing the entire vendor list. Defaults to False.
Return Value
This function does not return a value (implicitly returns None). Instead, it produces side effects: (1) Saves an Excel file with results containing three columns: 'Vendor', 'Retained Emails' (comma-separated high-confidence emails), and 'Source Mailboxes' (where emails were found). The file is saved in the same directory as the input file with a timestamp. (2) Prints detailed progress information and summary statistics to stdout including total vendors processed, vendors with emails found, and total email count.
Dependencies
pandaspathlibtypingsysargparse
Required Imports
import sys
import pandas as pd
from pathlib import Path
from typing import List, Optional
from vendor_email_extractor import VendorEmailExtractor
from vendor_email_config import TENANT_ID, CLIENT_ID, CLIENT_SECRET, OPENAI_API_KEY, DOMAIN, DEFAULT_DAYS_BACK, DEFAULT_MAX_EMAILS_PER_MAILBOX
Usage Example
# Assuming vendor_email_config.py is properly configured with credentials
# and load_vendor_list function is available
# Basic usage - process all vendors
extract_batch(
vendor_excel_file='vendors_enriched.xlsx'
)
# Test mode - process only first 3 vendors
extract_batch(
vendor_excel_file='vendors_enriched.xlsx',
test_mode=True
)
# Custom parameters - limit scope and time window
extract_batch(
vendor_excel_file='vendors_enriched.xlsx',
max_mailboxes=10,
max_emails_per_mailbox=50,
days_back=30,
test_mode=False
)
# Results will be saved to:
# vendors_mailbox_emails_YYYYMMDD_HHMMSS.xlsx
# in the same directory as the input file
Best Practices
- Always test with test_mode=True first to verify the pipeline works correctly before processing all vendors
- Ensure vendor_excel_file contains properly formatted vendor data that the VendorEmailExtractor can process
- Monitor API rate limits for both Microsoft Graph API and OpenAI API during batch processing
- Use max_mailboxes parameter to limit scope during initial testing or when working with large organizations
- The function supports resume capability, so interrupted runs can be restarted without losing progress
- Review the output Excel file to verify email extraction quality before using the results
- Consider adjusting days_back parameter based on vendor communication patterns (older vendors may need longer lookback)
- Ensure sufficient disk space for the output Excel file, especially when processing many vendors
- The function only retains HIGH confidence emails - review VendorEmailExtractor configuration if too few results are returned
- Check that all required credentials in vendor_email_config.py are valid before running batch extraction
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function main_v28 75.5% similar
-
function main_v27 73.5% similar
-
function test_email_search 63.0% similar
-
function run_all_tests 62.3% similar
-
class VendorEmailExtractor 62.2% similar