🔍 Code Extractor

function generate_failure_report

Maturity: 48

Analyzes processing results from a JSON file, generates a comprehensive failure report with statistics and error categorization, and exports detailed failure information to a CSV file.

File:
/tf/active/vicechatdev/mailsearch/generate_failure_report.py
Lines:
11 - 134
Complexity:
moderate

Purpose

This function is designed for post-processing analysis of document processing results. It reads a JSON file containing processing outcomes, separates successful and failed operations, categorizes failures by error type, generates console output with statistics and detailed failure information, saves a CSV report of all failures, and provides actionable recommendations based on common error patterns. It's particularly useful for debugging batch document processing operations and identifying systematic issues.

Source Code

def generate_failure_report(results_json: str, output_dir: str = "./output"):
    """Generate detailed failure analysis report"""
    
    output_dir = Path(output_dir)
    
    # Load results
    with open(results_json, 'r') as f:
        results = json.load(f)
    
    # Separate successes and failures
    successes = [r for r in results if r.get('success', False)]
    failures = [r for r in results if not r.get('success', False)]
    
    print(f"\n{'='*80}")
    print(f"Document Analyzer - Failure Report")
    print(f"{'='*80}\n")
    print(f"Total documents: {len(results)}")
    print(f"Successful: {len(successes)} ({len(successes)/len(results)*100:.1f}%)")
    print(f"Failed: {len(failures)} ({len(failures)/len(results)*100:.1f}%)")
    print(f"\n{'='*80}\n")
    
    if not failures:
        print("✓ All documents processed successfully!")
        return
    
    # Analyze failure reasons
    error_types = Counter()
    error_details = {}
    
    for fail in failures:
        error = fail.get('error', 'Unknown error')
        error_type = error.split(':')[0] if ':' in error else error
        error_types[error_type] += 1
        
        if error_type not in error_details:
            error_details[error_type] = []
        error_details[error_type].append({
            'filename': fail.get('filename', 'Unknown'),
            'error': error,
            'pdf_path': fail.get('pdf_path', ''),
            'email_date': fail.get('email_date', ''),
            'email_subject': fail.get('email_subject', '')
        })
    
    # Print summary by error type
    print("Failure Summary by Error Type:")
    print("-" * 80)
    for error_type, count in error_types.most_common():
        print(f"  {error_type}: {count} documents ({count/len(failures)*100:.1f}% of failures)")
    
    print(f"\n{'='*80}\n")
    
    # Print detailed failures by type
    print("Detailed Failure Analysis:")
    print("-" * 80)
    
    for error_type, docs in sorted(error_details.items()):
        print(f"\n[{error_type}] - {len(docs)} documents:")
        print("-" * 80)
        for i, doc in enumerate(docs[:10], 1):  # Show first 10 of each type
            print(f"\n  {i}. {doc['filename']}")
            print(f"     Path: {doc['pdf_path']}")
            print(f"     Error: {doc['error']}")
            print(f"     Email Date: {doc['email_date']}")
            if doc['email_subject']:
                print(f"     Subject: {doc['email_subject'][:80]}...")
        
        if len(docs) > 10:
            print(f"\n  ... and {len(docs) - 10} more documents with this error type")
    
    # Save detailed failures to CSV
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    csv_path = output_dir / f"{timestamp}_failure_report.csv"
    
    with open(csv_path, 'w', encoding='utf-8', newline='') as f:
        fieldnames = [
            'filename', 'pdf_path', 'email_date', 'email_subject', 
            'sender', 'error', 'error_type', 'processing_date'
        ]
        writer = csv.DictWriter(f, fieldnames=fieldnames)
        writer.writeheader()
        
        for fail in failures:
            error = fail.get('error', 'Unknown error')
            error_type = error.split(':')[0] if ':' in error else error
            
            writer.writerow({
                'filename': fail.get('filename', ''),
                'pdf_path': fail.get('pdf_path', ''),
                'email_date': fail.get('email_date', ''),
                'email_subject': fail.get('email_subject', ''),
                'sender': fail.get('sender', ''),
                'error': error,
                'error_type': error_type,
                'processing_date': fail.get('processing_date', '')
            })
    
    print(f"\n{'='*80}")
    print(f"Detailed failure report saved to: {csv_path}")
    print(f"{'='*80}\n")
    
    # Provide recommendations
    print("\nRecommendations:")
    print("-" * 80)
    
    if "File not found" in error_types:
        print("• File not found errors:")
        print("  - Check if files were moved or deleted")
        print("  - Verify download_register.csv paths are correct")
        print("  - Consider cleaning up register entries for missing files")
    
    if "Insufficient text extracted" in error_types:
        print("• Insufficient text errors:")
        print("  - Files may be image-only PDFs with poor quality")
        print("  - Try increasing OCR DPI (currently 400)")
        print("  - Check if files are corrupted")
    
    if "Error extracting text" in error_types or "OCR" in str(error_types):
        print("• OCR/extraction errors:")
        print("  - Files may be corrupted or encrypted")
        print("  - Try manual inspection of these files")
        print("  - Consider alternative PDF processing tools")
    
    print()

Parameters

Name Type Default Kind
results_json str - positional_or_keyword
output_dir str './output' positional_or_keyword

Parameter Details

results_json: Path to a JSON file containing processing results. Expected to be a list of dictionaries where each dictionary represents a document processing result with keys like 'success' (boolean), 'error' (string), 'filename', 'pdf_path', 'email_date', 'email_subject', 'sender', and 'processing_date'. The file must exist and be readable.

output_dir: Directory path where the failure report CSV will be saved. Defaults to './output'. The directory will be used as-is (not created if it doesn't exist), so it should exist beforehand. The CSV filename will be auto-generated with a timestamp prefix.

Return Value

Returns None. The function produces side effects: prints a formatted report to stdout and writes a CSV file to the specified output directory. The CSV file contains columns: filename, pdf_path, email_date, email_subject, sender, error, error_type, and processing_date.

Dependencies

  • json
  • csv
  • pathlib
  • datetime
  • collections

Required Imports

import json
import csv
from pathlib import Path
from datetime import datetime
from collections import Counter

Usage Example

import json
import csv
from pathlib import Path
from datetime import datetime
from collections import Counter

# Ensure output directory exists
output_dir = Path('./output')
output_dir.mkdir(exist_ok=True)

# Create sample results JSON file
results = [
    {
        'success': True,
        'filename': 'doc1.pdf',
        'pdf_path': '/path/to/doc1.pdf',
        'email_date': '2024-01-15',
        'email_subject': 'Invoice',
        'sender': 'sender@example.com',
        'processing_date': '2024-01-20'
    },
    {
        'success': False,
        'filename': 'doc2.pdf',
        'pdf_path': '/path/to/doc2.pdf',
        'error': 'File not found: /path/to/doc2.pdf',
        'email_date': '2024-01-16',
        'email_subject': 'Report',
        'sender': 'sender2@example.com',
        'processing_date': '2024-01-20'
    }
]

with open('results.json', 'w') as f:
    json.dump(results, f)

# Generate failure report
generate_failure_report('results.json', './output')

# This will print statistics to console and create a CSV file
# in ./output/ with name like '20240120_143022_failure_report.csv'

Best Practices

  • Ensure the output directory exists before calling this function, as it does not create it automatically
  • The results_json file should follow the expected schema with 'success', 'error', 'filename', 'pdf_path', 'email_date', 'email_subject', 'sender', and 'processing_date' keys
  • The function limits console output to the first 10 documents per error type to avoid overwhelming output, but all failures are saved to the CSV
  • Error types are extracted by splitting on ':' character, so error messages should follow the pattern 'ErrorType: detailed message' for proper categorization
  • The function handles missing keys gracefully with .get() methods and default values, but providing complete data yields better reports
  • Review the recommendations section in the output for actionable insights based on detected error patterns
  • The generated CSV file includes a timestamp in its name to prevent overwriting previous reports

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function generate_report 66.5% similar

    Generates a text-based statistical analysis report from session data and saves it to the configured reports folder.

    From: /tf/active/vicechatdev/full_smartstat/app.py
  • function create_csv_report 57.4% similar

    Creates two CSV reports (summary and detailed) from warranty data, writing warranty information to files with different levels of detail.

    From: /tf/active/vicechatdev/convert_disclosures_to_table.py
  • function create_csv_report_improved 56.4% similar

    Creates two CSV reports from warranty data: a summary report with key fields and a detailed report with all fields including full disclosures.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
  • function save_results_v1 55.7% similar

    Saves a list of dictionary results to both CSV and JSON file formats with UTF-8 encoding.

    From: /tf/active/vicechatdev/mailsearch/enhanced_document_comparison.py
  • function test_extraction_debugging 55.0% similar

    A test function that validates the extraction debugging functionality of a DocumentProcessor by creating test files, simulating document extraction, and verifying debug log creation.

    From: /tf/active/vicechatdev/vice_ai/test_extraction_debug.py
← Back to Browse