function generate_failure_report
Analyzes processing results from a JSON file, generates a comprehensive failure report with statistics and error categorization, and exports detailed failure information to a CSV file.
/tf/active/vicechatdev/mailsearch/generate_failure_report.py
11 - 134
moderate
Purpose
This function is designed for post-processing analysis of document processing results. It reads a JSON file containing processing outcomes, separates successful and failed operations, categorizes failures by error type, generates console output with statistics and detailed failure information, saves a CSV report of all failures, and provides actionable recommendations based on common error patterns. It's particularly useful for debugging batch document processing operations and identifying systematic issues.
Source Code
def generate_failure_report(results_json: str, output_dir: str = "./output"):
"""Generate detailed failure analysis report"""
output_dir = Path(output_dir)
# Load results
with open(results_json, 'r') as f:
results = json.load(f)
# Separate successes and failures
successes = [r for r in results if r.get('success', False)]
failures = [r for r in results if not r.get('success', False)]
print(f"\n{'='*80}")
print(f"Document Analyzer - Failure Report")
print(f"{'='*80}\n")
print(f"Total documents: {len(results)}")
print(f"Successful: {len(successes)} ({len(successes)/len(results)*100:.1f}%)")
print(f"Failed: {len(failures)} ({len(failures)/len(results)*100:.1f}%)")
print(f"\n{'='*80}\n")
if not failures:
print("✓ All documents processed successfully!")
return
# Analyze failure reasons
error_types = Counter()
error_details = {}
for fail in failures:
error = fail.get('error', 'Unknown error')
error_type = error.split(':')[0] if ':' in error else error
error_types[error_type] += 1
if error_type not in error_details:
error_details[error_type] = []
error_details[error_type].append({
'filename': fail.get('filename', 'Unknown'),
'error': error,
'pdf_path': fail.get('pdf_path', ''),
'email_date': fail.get('email_date', ''),
'email_subject': fail.get('email_subject', '')
})
# Print summary by error type
print("Failure Summary by Error Type:")
print("-" * 80)
for error_type, count in error_types.most_common():
print(f" {error_type}: {count} documents ({count/len(failures)*100:.1f}% of failures)")
print(f"\n{'='*80}\n")
# Print detailed failures by type
print("Detailed Failure Analysis:")
print("-" * 80)
for error_type, docs in sorted(error_details.items()):
print(f"\n[{error_type}] - {len(docs)} documents:")
print("-" * 80)
for i, doc in enumerate(docs[:10], 1): # Show first 10 of each type
print(f"\n {i}. {doc['filename']}")
print(f" Path: {doc['pdf_path']}")
print(f" Error: {doc['error']}")
print(f" Email Date: {doc['email_date']}")
if doc['email_subject']:
print(f" Subject: {doc['email_subject'][:80]}...")
if len(docs) > 10:
print(f"\n ... and {len(docs) - 10} more documents with this error type")
# Save detailed failures to CSV
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
csv_path = output_dir / f"{timestamp}_failure_report.csv"
with open(csv_path, 'w', encoding='utf-8', newline='') as f:
fieldnames = [
'filename', 'pdf_path', 'email_date', 'email_subject',
'sender', 'error', 'error_type', 'processing_date'
]
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
for fail in failures:
error = fail.get('error', 'Unknown error')
error_type = error.split(':')[0] if ':' in error else error
writer.writerow({
'filename': fail.get('filename', ''),
'pdf_path': fail.get('pdf_path', ''),
'email_date': fail.get('email_date', ''),
'email_subject': fail.get('email_subject', ''),
'sender': fail.get('sender', ''),
'error': error,
'error_type': error_type,
'processing_date': fail.get('processing_date', '')
})
print(f"\n{'='*80}")
print(f"Detailed failure report saved to: {csv_path}")
print(f"{'='*80}\n")
# Provide recommendations
print("\nRecommendations:")
print("-" * 80)
if "File not found" in error_types:
print("• File not found errors:")
print(" - Check if files were moved or deleted")
print(" - Verify download_register.csv paths are correct")
print(" - Consider cleaning up register entries for missing files")
if "Insufficient text extracted" in error_types:
print("• Insufficient text errors:")
print(" - Files may be image-only PDFs with poor quality")
print(" - Try increasing OCR DPI (currently 400)")
print(" - Check if files are corrupted")
if "Error extracting text" in error_types or "OCR" in str(error_types):
print("• OCR/extraction errors:")
print(" - Files may be corrupted or encrypted")
print(" - Try manual inspection of these files")
print(" - Consider alternative PDF processing tools")
print()
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
results_json |
str | - | positional_or_keyword |
output_dir |
str | './output' | positional_or_keyword |
Parameter Details
results_json: Path to a JSON file containing processing results. Expected to be a list of dictionaries where each dictionary represents a document processing result with keys like 'success' (boolean), 'error' (string), 'filename', 'pdf_path', 'email_date', 'email_subject', 'sender', and 'processing_date'. The file must exist and be readable.
output_dir: Directory path where the failure report CSV will be saved. Defaults to './output'. The directory will be used as-is (not created if it doesn't exist), so it should exist beforehand. The CSV filename will be auto-generated with a timestamp prefix.
Return Value
Returns None. The function produces side effects: prints a formatted report to stdout and writes a CSV file to the specified output directory. The CSV file contains columns: filename, pdf_path, email_date, email_subject, sender, error, error_type, and processing_date.
Dependencies
jsoncsvpathlibdatetimecollections
Required Imports
import json
import csv
from pathlib import Path
from datetime import datetime
from collections import Counter
Usage Example
import json
import csv
from pathlib import Path
from datetime import datetime
from collections import Counter
# Ensure output directory exists
output_dir = Path('./output')
output_dir.mkdir(exist_ok=True)
# Create sample results JSON file
results = [
{
'success': True,
'filename': 'doc1.pdf',
'pdf_path': '/path/to/doc1.pdf',
'email_date': '2024-01-15',
'email_subject': 'Invoice',
'sender': 'sender@example.com',
'processing_date': '2024-01-20'
},
{
'success': False,
'filename': 'doc2.pdf',
'pdf_path': '/path/to/doc2.pdf',
'error': 'File not found: /path/to/doc2.pdf',
'email_date': '2024-01-16',
'email_subject': 'Report',
'sender': 'sender2@example.com',
'processing_date': '2024-01-20'
}
]
with open('results.json', 'w') as f:
json.dump(results, f)
# Generate failure report
generate_failure_report('results.json', './output')
# This will print statistics to console and create a CSV file
# in ./output/ with name like '20240120_143022_failure_report.csv'
Best Practices
- Ensure the output directory exists before calling this function, as it does not create it automatically
- The results_json file should follow the expected schema with 'success', 'error', 'filename', 'pdf_path', 'email_date', 'email_subject', 'sender', and 'processing_date' keys
- The function limits console output to the first 10 documents per error type to avoid overwhelming output, but all failures are saved to the CSV
- Error types are extracted by splitting on ':' character, so error messages should follow the pattern 'ErrorType: detailed message' for proper categorization
- The function handles missing keys gracefully with .get() methods and default values, but providing complete data yields better reports
- Review the recommendations section in the output for actionable insights based on detected error patterns
- The generated CSV file includes a timestamp in its name to prevent overwriting previous reports
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function generate_report 66.5% similar
-
function create_csv_report 57.4% similar
-
function create_csv_report_improved 56.4% similar
-
function save_results_v1 55.7% similar
-
function test_extraction_debugging 55.0% similar