function extract_warranty_data
Parses markdown-formatted warranty documentation to extract structured warranty information including IDs, titles, sections, source document counts, warranty text, and disclosure content.
/tf/active/vicechatdev/convert_disclosures_to_table.py
75 - 139
moderate
Purpose
This function processes markdown content containing warranty information structured with specific heading patterns (## for warranty sections, ### for subsections). It extracts and organizes warranty data into a list of dictionaries, normalizing escaped newlines, parsing warranty IDs (including complex patterns with parentheses), extracting metadata fields, and creating both summary and full versions of disclosure content. Useful for converting markdown warranty documentation into structured data for analysis, reporting, or database storage.
Source Code
def extract_warranty_data(markdown_content):
"""Extract warranty data from markdown content."""
warranties = []
# First, normalize the content by converting escaped newlines to actual newlines
normalized_content = markdown_content.replace('\\n', '\n')
# Find all warranty sections using a more flexible pattern
# Look for ## followed by warranty ID - Title pattern
warranty_pattern = r'## ([\d\.]+(?:\([a-z]\))?(?:\([ivx]+\))?(?:\([A-Za-z]+\))?) - (.+?)\n'
# Find all warranty sections
warranty_matches = list(re.finditer(warranty_pattern, normalized_content))
logger.info(f"Found {len(warranty_matches)} warranty sections")
for i, match in enumerate(warranty_matches):
warranty_id = match.group(1).strip()
warranty_title = match.group(2).strip()
# Find the content between this warranty and the next one (or end of file)
start_pos = match.end()
if i + 1 < len(warranty_matches):
end_pos = warranty_matches[i + 1].start()
content = normalized_content[start_pos:end_pos]
else:
content = normalized_content[start_pos:]
logger.info(f"Processing warranty: {warranty_id} - {warranty_title}")
# Extract section name (look for **Section**: pattern)
section_match = re.search(r'\*\*Section\*\*:\s*(.+?)(?:\n|\*\*)', content)
section_name = section_match.group(1).strip() if section_match else ""
# Extract source documents count
source_docs_match = re.search(r'\*\*Source Documents Found\*\*:\s*(\d+)', content)
source_docs_count = source_docs_match.group(1) if source_docs_match else "0"
# Extract warranty text (between ### Warranty Text and ### Disclosure)
warranty_text_match = re.search(r'### Warranty Text\s*\n\n(.+?)\n\n### Disclosure', content, re.DOTALL)
warranty_text = clean_text(warranty_text_match.group(1)) if warranty_text_match else ""
# Extract disclosure content (everything after ### Disclosure until next --- or end)
disclosure_match = re.search(r'### Disclosure\s*\n\n(.+?)(?=\n\n---\n|$)', content, re.DOTALL)
disclosure_content = clean_text(disclosure_match.group(1)) if disclosure_match else ""
# If disclosure_content is empty, try a more relaxed pattern
if not disclosure_content:
disclosure_match = re.search(r'### Disclosure\s*\n(.+?)(?=\n---\n|$)', content, re.DOTALL)
disclosure_content = clean_text(disclosure_match.group(1)) if disclosure_match else ""
# Create both summary and full versions
disclosure_summary = disclosure_content[:500] + "..." if len(disclosure_content) > 500 else disclosure_content
warranties.append({
'Warranty_ID': warranty_id,
'Warranty_Title': warranty_title,
'Section_Name': section_name,
'Source_Documents_Count': source_docs_count,
'Warranty_Text': warranty_text,
'Disclosure_Summary': disclosure_summary,
'Full_Disclosure': disclosure_content
})
logger.info(f"Extracted {len(warranties)} warranties")
return warranties
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
markdown_content |
- | - | positional_or_keyword |
Parameter Details
markdown_content: A string containing markdown-formatted warranty documentation. Expected to have warranty sections marked with '## [ID] - [Title]' headers, followed by subsections including '**Section**:', '**Source Documents Found**:', '### Warranty Text', and '### Disclosure'. The content may contain escaped newlines (\n) which will be normalized. The warranty ID pattern supports complex formats like '1.2', '1.2(a)', '1.2(a)(i)', or '1.2(a)(Example)'.
Return Value
Returns a list of dictionaries, where each dictionary represents one warranty section with the following keys: 'Warranty_ID' (string, the extracted warranty identifier), 'Warranty_Title' (string, the warranty title), 'Section_Name' (string, the section name or empty string if not found), 'Source_Documents_Count' (string, number of source documents or '0' if not found), 'Warranty_Text' (string, cleaned warranty text content), 'Disclosure_Summary' (string, first 500 characters of disclosure with '...' appended if longer), 'Full_Disclosure' (string, complete cleaned disclosure content). Returns an empty list if no warranty sections are found.
Dependencies
relogging
Required Imports
import re
import logging
Usage Example
import re
import logging
logger = logging.getLogger(__name__)
logging.basicConfig(level=logging.INFO)
def clean_text(text):
"""Helper function to clean text."""
return text.strip()
markdown_content = '''## 1.1 - Product Warranty
**Section**: General Warranties
**Source Documents Found**: 3
### Warranty Text
This product is warranted for 1 year from date of purchase.
### Disclosure
Warranty does not cover misuse or accidental damage.
---
## 1.2 - Service Warranty
**Section**: Service Terms
**Source Documents Found**: 2
### Warranty Text
Services are warranted for 90 days.
### Disclosure
Service warranty is non-transferable.
'''
warranties = extract_warranty_data(markdown_content)
for warranty in warranties:
print(f"ID: {warranty['Warranty_ID']}, Title: {warranty['Warranty_Title']}")
print(f"Section: {warranty['Section_Name']}")
print(f"Sources: {warranty['Source_Documents_Count']}")
print(f"Text: {warranty['Warranty_Text'][:50]}...")
print()
Best Practices
- Ensure the 'clean_text' function is defined before calling this function, as it's used to process extracted text
- Configure a logger object before calling this function to capture processing information and debug output
- The input markdown must follow the expected structure with '## [ID] - [Title]' headers for warranty sections
- Warranty IDs can be complex (e.g., '1.2(a)(i)') and the function handles various parenthetical patterns
- The function creates both summary (500 char limit) and full disclosure versions - choose appropriately based on use case
- If disclosure content is not found with strict patterns, the function attempts a more relaxed pattern match
- Consider validating the returned list is not empty before processing to handle cases where no warranties are found
- The function normalizes escaped newlines (\n) to actual newlines, so input can contain either format
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function extract_warranty_data_improved 96.2% similar
-
function extract_warranty_sections 88.8% similar
-
function create_enhanced_word_document 72.6% similar
-
function main_v15 70.9% similar
-
function main_v8 69.7% similar