🔍 Code Extractor

function extract_warranty_data

Maturity: 46

Parses markdown-formatted warranty documentation to extract structured warranty information including IDs, titles, sections, source document counts, warranty text, and disclosure content.

File:
/tf/active/vicechatdev/convert_disclosures_to_table.py
Lines:
75 - 139
Complexity:
moderate

Purpose

This function processes markdown content containing warranty information structured with specific heading patterns (## for warranty sections, ### for subsections). It extracts and organizes warranty data into a list of dictionaries, normalizing escaped newlines, parsing warranty IDs (including complex patterns with parentheses), extracting metadata fields, and creating both summary and full versions of disclosure content. Useful for converting markdown warranty documentation into structured data for analysis, reporting, or database storage.

Source Code

def extract_warranty_data(markdown_content):
    """Extract warranty data from markdown content."""
    warranties = []
    
    # First, normalize the content by converting escaped newlines to actual newlines
    normalized_content = markdown_content.replace('\\n', '\n')
    
    # Find all warranty sections using a more flexible pattern
    # Look for ## followed by warranty ID - Title pattern
    warranty_pattern = r'## ([\d\.]+(?:\([a-z]\))?(?:\([ivx]+\))?(?:\([A-Za-z]+\))?) - (.+?)\n'
    
    # Find all warranty sections
    warranty_matches = list(re.finditer(warranty_pattern, normalized_content))
    logger.info(f"Found {len(warranty_matches)} warranty sections")
    
    for i, match in enumerate(warranty_matches):
        warranty_id = match.group(1).strip()
        warranty_title = match.group(2).strip()
        
        # Find the content between this warranty and the next one (or end of file)
        start_pos = match.end()
        if i + 1 < len(warranty_matches):
            end_pos = warranty_matches[i + 1].start()
            content = normalized_content[start_pos:end_pos]
        else:
            content = normalized_content[start_pos:]
        
        logger.info(f"Processing warranty: {warranty_id} - {warranty_title}")
        
        # Extract section name (look for **Section**: pattern)
        section_match = re.search(r'\*\*Section\*\*:\s*(.+?)(?:\n|\*\*)', content)
        section_name = section_match.group(1).strip() if section_match else ""
        
        # Extract source documents count
        source_docs_match = re.search(r'\*\*Source Documents Found\*\*:\s*(\d+)', content)
        source_docs_count = source_docs_match.group(1) if source_docs_match else "0"
        
        # Extract warranty text (between ### Warranty Text and ### Disclosure)
        warranty_text_match = re.search(r'### Warranty Text\s*\n\n(.+?)\n\n### Disclosure', content, re.DOTALL)
        warranty_text = clean_text(warranty_text_match.group(1)) if warranty_text_match else ""
        
        # Extract disclosure content (everything after ### Disclosure until next --- or end)
        disclosure_match = re.search(r'### Disclosure\s*\n\n(.+?)(?=\n\n---\n|$)', content, re.DOTALL)
        disclosure_content = clean_text(disclosure_match.group(1)) if disclosure_match else ""
        
        # If disclosure_content is empty, try a more relaxed pattern
        if not disclosure_content:
            disclosure_match = re.search(r'### Disclosure\s*\n(.+?)(?=\n---\n|$)', content, re.DOTALL)
            disclosure_content = clean_text(disclosure_match.group(1)) if disclosure_match else ""
        
        # Create both summary and full versions
        disclosure_summary = disclosure_content[:500] + "..." if len(disclosure_content) > 500 else disclosure_content
        
        warranties.append({
            'Warranty_ID': warranty_id,
            'Warranty_Title': warranty_title,
            'Section_Name': section_name,
            'Source_Documents_Count': source_docs_count,
            'Warranty_Text': warranty_text,
            'Disclosure_Summary': disclosure_summary,
            'Full_Disclosure': disclosure_content
        })
    
    logger.info(f"Extracted {len(warranties)} warranties")
    return warranties

Parameters

Name Type Default Kind
markdown_content - - positional_or_keyword

Parameter Details

markdown_content: A string containing markdown-formatted warranty documentation. Expected to have warranty sections marked with '## [ID] - [Title]' headers, followed by subsections including '**Section**:', '**Source Documents Found**:', '### Warranty Text', and '### Disclosure'. The content may contain escaped newlines (\n) which will be normalized. The warranty ID pattern supports complex formats like '1.2', '1.2(a)', '1.2(a)(i)', or '1.2(a)(Example)'.

Return Value

Returns a list of dictionaries, where each dictionary represents one warranty section with the following keys: 'Warranty_ID' (string, the extracted warranty identifier), 'Warranty_Title' (string, the warranty title), 'Section_Name' (string, the section name or empty string if not found), 'Source_Documents_Count' (string, number of source documents or '0' if not found), 'Warranty_Text' (string, cleaned warranty text content), 'Disclosure_Summary' (string, first 500 characters of disclosure with '...' appended if longer), 'Full_Disclosure' (string, complete cleaned disclosure content). Returns an empty list if no warranty sections are found.

Dependencies

  • re
  • logging

Required Imports

import re
import logging

Usage Example

import re
import logging

logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)

def clean_text(text):
    """Helper function to clean text."""
    return text.strip()

markdown_content = '''## 1.1 - Product Warranty

**Section**: General Warranties
**Source Documents Found**: 3

### Warranty Text

This product is warranted for 1 year from date of purchase.

### Disclosure

Warranty does not cover misuse or accidental damage.

---

## 1.2 - Service Warranty

**Section**: Service Terms
**Source Documents Found**: 2

### Warranty Text

Services are warranted for 90 days.

### Disclosure

Service warranty is non-transferable.
'''

warranties = extract_warranty_data(markdown_content)
for warranty in warranties:
    print(f"ID: {warranty['Warranty_ID']}, Title: {warranty['Warranty_Title']}")
    print(f"Section: {warranty['Section_Name']}")
    print(f"Sources: {warranty['Source_Documents_Count']}")
    print(f"Text: {warranty['Warranty_Text'][:50]}...")
    print()

Best Practices

  • Ensure the 'clean_text' function is defined before calling this function, as it's used to process extracted text
  • Configure a logger object before calling this function to capture processing information and debug output
  • The input markdown must follow the expected structure with '## [ID] - [Title]' headers for warranty sections
  • Warranty IDs can be complex (e.g., '1.2(a)(i)') and the function handles various parenthetical patterns
  • The function creates both summary (500 char limit) and full disclosure versions - choose appropriately based on use case
  • If disclosure content is not found with strict patterns, the function attempts a more relaxed pattern match
  • Consider validating the returned list is not empty before processing to handle cases where no warranties are found
  • The function normalizes escaped newlines (\n) to actual newlines, so input can contain either format

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function extract_warranty_data_improved 96.2% similar

    Parses markdown-formatted warranty documentation to extract structured warranty data including IDs, titles, sections, disclosure text, and reference citations.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
  • function extract_warranty_sections 88.8% similar

    Parses markdown content to extract warranty section headers, returning a list of dictionaries containing section IDs and titles for table of contents generation.

    From: /tf/active/vicechatdev/enhanced_word_converter_fixed.py
  • function create_enhanced_word_document 72.6% similar

    Converts markdown-formatted warranty disclosure content into a formatted Microsoft Word document with hierarchical headings, styled text, lists, and special formatting for block references.

    From: /tf/active/vicechatdev/improved_word_converter.py
  • function main_v15 70.9% similar

    Converts a markdown file containing warranty disclosure data into multiple tabular formats (CSV, Excel, Word) with timestamped output files.

    From: /tf/active/vicechatdev/convert_disclosures_to_table.py
  • function main_v8 69.7% similar

    Orchestrates the conversion of an improved markdown file containing warranty disclosures into multiple tabular formats (CSV, Excel, Word) with timestamp-based file naming.

    From: /tf/active/vicechatdev/improved_convert_disclosures_to_table.py
← Back to Browse