extract_warranty_data_improved

function extract_warranty_data_improved

Maturity: 49

Parses markdown-formatted warranty documentation to extract structured warranty data including IDs, titles, sections, disclosure text, and reference citations.

File:
/tf/active/vicechatdev/improved_convert_disclosures_to_table.py

Lines:
75 - 165

Complexity:
complex

Purpose

This function processes markdown content containing warranty information structured with specific headers and patterns. It extracts individual warranty entries with their metadata (ID, title, section name, source document count), warranty text, disclosure content, and tracks numerical references cited within disclosures. The function handles complex warranty ID patterns (including nested parenthetical notations), normalizes escaped newlines, preserves inline reference citations, and separates a references section from the main content. It's designed for document processing pipelines that need to convert semi-structured warranty documentation into structured data for analysis, reporting, or database storage.

Source Code

def extract_warranty_data_improved(markdown_content):
    """Extract warranty data from improved markdown content with proper references."""
    warranties = []
    
    # First, normalize the content by converting escaped newlines to actual newlines
    normalized_content = markdown_content.replace('\\n', '\n')
    
    # Find all warranty sections using a more flexible pattern
    # Look for ## followed by warranty ID - Title pattern
    warranty_pattern = r'## ([\d\.]+(?:\([a-z]\))?(?:\([ivx]+\))?(?:\([A-Za-z]+\))?) - (.+?)\n'
    
    # Find all warranty sections
    warranty_matches = list(re.finditer(warranty_pattern, normalized_content))
    logger.info(f"Found {len(warranty_matches)} warranty sections")
    
    # Extract references section for later use
    references_section = ""
    ref_match = re.search(r'\n## References\n(.+)$', normalized_content, re.DOTALL)
    if ref_match:
        references_section = ref_match.group(1)
        logger.info("Found references section")
    else:
        logger.warning("References section not found")
    
    for i, match in enumerate(warranty_matches):
        warranty_id = match.group(1).strip()
        warranty_title = match.group(2).strip()
        
        # Find the content between this warranty and the next one (or end of file)
        start_pos = match.end()
        if i + 1 < len(warranty_matches):
            end_pos = warranty_matches[i + 1].start()
            content = normalized_content[start_pos:end_pos]
        else:
            # For the last warranty, stop at the references section
            if ref_match:
                end_pos = ref_match.start()
                content = normalized_content[start_pos:end_pos]
            else:
                content = normalized_content[start_pos:]
        
        logger.info(f"Processing warranty: {warranty_id} - {warranty_title}")
        
        # Extract section name (look for **Section**: pattern)
        section_match = re.search(r'\*\*Section\*\*:\s*(.+?)(?:\n|\*\*)', content)
        section_name = section_match.group(1).strip() if section_match else ""
        
        # Extract source documents count
        source_docs_match = re.search(r'\*\*Source Documents Found\*\*:\s*(\d+)', content)
        source_docs_count = source_docs_match.group(1) if source_docs_match else "0"
        
        # Extract warranty text (between ### Warranty Text and ### Disclosure)
        warranty_text_match = re.search(r'### Warranty Text\s*\n\n(.+?)\n\n### Disclosure', content, re.DOTALL)
        warranty_text = clean_text(warranty_text_match.group(1)) if warranty_text_match else ""
        
        # Extract disclosure content (everything after ### Disclosure until next --- or end)
        disclosure_match = re.search(r'### Disclosure\s*\n\n(.+?)(?=\n\n---\n|$)', content, re.DOTALL)
        disclosure_content = disclosure_match.group(1) if disclosure_match else ""
        
        # If disclosure_content is empty, try a more relaxed pattern
        if not disclosure_content:
            disclosure_match = re.search(r'### Disclosure\s*\n(.+?)(?=\n---\n|$)', content, re.DOTALL)
            disclosure_content = disclosure_match.group(1) if disclosure_match else ""
        
        # Clean disclosure content but preserve references
        if disclosure_content:
            # Don't apply markdown cleaning to preserve inline references like [1], [2], etc.
            disclosure_content = re.sub(r'\s+', ' ', disclosure_content).strip()
        
        # Create both summary and full versions
        disclosure_summary = disclosure_content[:500] + "..." if len(disclosure_content) > 500 else disclosure_content
        
        # Extract referenced numbers from the disclosure
        referenced_numbers = set()
        if disclosure_content:
            ref_pattern = r'\[(\d+)\]'
            referenced_numbers = set(re.findall(ref_pattern, disclosure_content))
        
        warranties.append({
            'Warranty_ID': warranty_id,
            'Warranty_Title': warranty_title,
            'Section_Name': section_name,
            'Source_Documents_Count': source_docs_count,
            'Warranty_Text': warranty_text,
            'Disclosure_Summary': clean_text(disclosure_summary),
            'Full_Disclosure': disclosure_content,  # Keep references intact
            'Referenced_Numbers': list(referenced_numbers)
        })
    
    logger.info(f"Extracted {len(warranties)} warranties")
    return warranties, references_section

Parameters

Name	Type	Default	Kind
`markdown_content`	-	-	positional_or_keyword

Parameter Details

markdown_content: A string containing markdown-formatted warranty documentation. Expected to have warranty sections marked with '## [ID] - [Title]' headers, subsections for 'Warranty Text' and 'Disclosure' marked with '###' headers, metadata fields like '**Section**: [name]' and '**Source Documents Found**: [count]', inline reference citations in [number] format, and an optional '## References' section at the end. The content may contain escaped newlines (\n) that will be normalized during processing.

Return Value

Returns a tuple containing two elements: (1) A list of dictionaries, where each dictionary represents one warranty with keys: 'Warranty_ID' (string, the warranty identifier), 'Warranty_Title' (string, the warranty title), 'Section_Name' (string, the section this warranty belongs to), 'Source_Documents_Count' (string, number of source documents found), 'Warranty_Text' (string, cleaned warranty text content), 'Disclosure_Summary' (string, cleaned first 500 characters of disclosure with '...' if truncated), 'Full_Disclosure' (string, complete disclosure text with reference citations preserved), and 'Referenced_Numbers' (list of strings, unique reference numbers cited in the disclosure). (2) A string containing the content of the References section if found, otherwise an empty string.

Dependencies

re
logging

Required Imports

import re
import logging

Usage Example

import re
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

def clean_text(text):
    """Simple text cleaning function."""
    return re.sub(r'\s+', ' ', text).strip()

markdown_content = '''## 1.1 - Limited Warranty
**Section**: General Warranties
**Source Documents Found**: 3

### Warranty Text

This product is warranted for 1 year from date of purchase.

### Disclosure

The warranty covers manufacturing defects [1] but excludes normal wear [2].

---

## 1.2 - Extended Warranty
**Section**: Optional Coverage
**Source Documents Found**: 1

### Warranty Text

Extended coverage available for purchase.

### Disclosure

Extended warranty provides additional 2 years of coverage [3].

## References

[1] Manufacturing Defects Policy
[2] Wear and Tear Guidelines
[3] Extended Coverage Terms
'''

warranties, references = extract_warranty_data_improved(markdown_content)

for warranty in warranties:
    print(f"ID: {warranty['Warranty_ID']}")
    print(f"Title: {warranty['Warranty_Title']}")
    print(f"References: {warranty['Referenced_Numbers']}")
    print(f"Disclosure: {warranty['Disclosure_Summary']}")
    print('---')

print(f"\nReferences Section:\n{references}")

Best Practices

Ensure the markdown content follows the expected structure with '## [ID] - [Title]' headers for warranties and '###' subheaders for Warranty Text and Disclosure sections
Configure logging before calling this function to capture processing information and warnings
Implement the 'clean_text' helper function in the same module to handle text cleaning consistently
The function preserves inline reference citations (e.g., [1], [2]) in the Full_Disclosure field but cleans them in Disclosure_Summary - use Full_Disclosure when references are needed
Handle the case where the references section might not be found (empty string returned as second tuple element)
Warranty IDs can be complex with nested parentheses like '1.1(a)(i)(Note)' - the regex pattern accommodates this
The function uses regex with DOTALL flag for multi-line matching - ensure content doesn't have unexpected patterns that might break extraction
Source_Documents_Count is returned as a string, not an integer - convert if numerical operations are needed
Referenced_Numbers are extracted as strings and returned as a list - they represent citation numbers found in the disclosure text

Similar Components

AI-powered semantic similarity - components with related functionality:

function extract_warranty_data 96.2% similar

Parses markdown-formatted warranty documentation to extract structured warranty information including IDs, titles, sections, source document counts, warranty text, and disclosure content.
From: /tf/active/vicechatdev/convert_disclosures_to_table.py
function extract_warranty_sections 85.7% similar

Parses markdown content to extract warranty section headers, returning a list of dictionaries containing section IDs and titles for table of contents generation.
From: /tf/active/vicechatdev/enhanced_word_converter_fixed.py
function create_enhanced_word_document 72.4% similar

Converts markdown-formatted warranty disclosure content into a formatted Microsoft Word document with hierarchical headings, styled text, lists, and special formatting for block references.
From: /tf/active/vicechatdev/improved_word_converter.py
function main_v8 69.2% similar

Orchestrates the conversion of an improved markdown file containing warranty disclosures into multiple tabular formats (CSV, Excel, Word) with timestamp-based file naming.
From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
function main_v15 68.5% similar

Converts a markdown file containing warranty disclosure data into multiple tabular formats (CSV, Excel, Word) with timestamped output files.
From: /tf/active/vicechatdev/convert_disclosures_to_table.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            def extract_warranty_data_improved(markdown_content):
    """Extract warranty data from improved markdown content with proper references."""
    warranties = []
    
    # First, normalize the content by converting escaped newlines to actual newlines
    normalized_content = markdown_content.replace('\\n', '\n')
    
    # Find all warranty sections using a more flexible pattern
    # Look for ## followed by warranty ID - Title pattern
    warranty_pattern = r'## ([\d\.]+(?:\([a-z]\))?(?:\([ivx]+\))?(?:\([A-Za-z]+\))?) - (.+?)\n'
    
    # Find all warranty sections
    warranty_matches = list(re.finditer(warranty_pattern, normalized_content))
    logger.info(f"Found {len(warranty_matches)} warranty sections")
    
    # Extract references section for later use
    references_section = ""
    ref_match = re.search(r'\n## References\n(.+)$', normalized_content, re.DOTALL)
    if ref_match:
        references_section = ref_match.group(1)
        logger.info("Found references section")
    else:
        logger.warning("References section not found")
    
    for i, match in enumerate(warranty_matches):
        warranty_id = match.group(1).strip()
        warranty_title = match.group(2).strip()
        
        # Find the content between this warranty and the next one (or end of file)
        start_pos = match.end()
        if i + 1 < len(warranty_matches):
            end_pos = warranty_matches[i + 1].start()
            content = normalized_content[start_pos:end_pos]
        else:
            # For the last warranty, stop at the references section
            if ref_match:
                end_pos = ref_match.start()
                content = normalized_content[start_pos:end_pos]
            else:
                content = normalized_content[start_pos:]
        
        logger.info(f"Processing warranty: {warranty_id} - {warranty_title}")
        
        # Extract section name (look for **Section**: pattern)
        section_match = re.search(r'\*\*Section\*\*:\s*(.+?)(?:\n|\*\*)', content)
        section_name = section_match.group(1).strip() if section_match else ""
        
        # Extract source documents count
        source_docs_match = re.search(r'\*\*Source Documents Found\*\*:\s*(\d+)', content)
        source_docs_count = source_docs_match.group(1) if source_docs_match else "0"
        
        # Extract warranty text (between ### Warranty Text and ### Disclosure)
        warranty_text_match = re.search(r'### Warranty Text\s*\n\n(.+?)\n\n### Disclosure', content, re.DOTALL)
        warranty_text = clean_text(warranty_text_match.group(1)) if warranty_text_match else ""
        
        # Extract disclosure content (everything after ### Disclosure until next --- or end)
        disclosure_match = re.search(r'### Disclosure\s*\n\n(.+?)(?=\n\n---\n|$)', content, re.DOTALL)
        disclosure_content = disclosure_match.group(1) if disclosure_match else ""
        
        # If disclosure_content is empty, try a more relaxed pattern
        if not disclosure_content:
            disclosure_match = re.search(r'### Disclosure\s*\n(.+?)(?=\n---\n|$)', content, re.DOTALL)
            disclosure_content = disclosure_match.group(1) if disclosure_match else ""
        
        # Clean disclosure content but preserve references
        if disclosure_content:
            # Don't apply markdown cleaning to preserve inline references like [1], [2], etc.
            disclosure_content = re.sub(r'\s+', ' ', disclosure_content).strip()
        
        # Create both summary and full versions
        disclosure_summary = disclosure_content[:500] + "..." if len(disclosure_content) > 500 else disclosure_content
        
        # Extract referenced numbers from the disclosure
        referenced_numbers = set()
        if disclosure_content:
            ref_pattern = r'\[(\d+)\]'
            referenced_numbers = set(re.findall(ref_pattern, disclosure_content))
        
        warranties.append({
            'Warranty_ID': warranty_id,
            'Warranty_Title': warranty_title,
            'Section_Name': section_name,
            'Source_Documents_Count': source_docs_count,
            'Warranty_Text': warranty_text,
            'Disclosure_Summary': clean_text(disclosure_summary),
            'Full_Disclosure': disclosure_content,  # Keep references intact
            'Referenced_Numbers': list(referenced_numbers)
        })
    
    logger.info(f"Extracted {len(warranties)} warranties")
    return warranties, references_section
                        

Improved Code

🔍 Code Extractor

function extract_warranty_data_improved

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function extract_warranty_data 96.2% similar

function extract_warranty_sections 85.7% similar

function create_enhanced_word_document 72.4% similar

function main_v8 69.2% similar

function main_v15 68.5% similar

function extract_warranty_data_improved

Purpose

Source Code

Parameters

Parameter Details

Return Value

Dependencies

Required Imports

Usage Example

Best Practices

Tags

Similar Components

function extract_warranty_data 96.2% similar

function extract_warranty_sections 85.7% similar

function create_enhanced_word_document 72.4% similar

function main_v8 69.2% similar

function main_v15 68.5% similar

✨ Improve Code: extract_warranty_data_improved

Code Comparison