function extract_warranty_data_improved
Parses markdown-formatted warranty documentation to extract structured warranty data including IDs, titles, sections, disclosure text, and reference citations.
/tf/active/vicechatdev/improved_convert_disclosures_to_table.py
75 - 165
complex
Purpose
This function processes markdown content containing warranty information structured with specific headers and patterns. It extracts individual warranty entries with their metadata (ID, title, section name, source document count), warranty text, disclosure content, and tracks numerical references cited within disclosures. The function handles complex warranty ID patterns (including nested parenthetical notations), normalizes escaped newlines, preserves inline reference citations, and separates a references section from the main content. It's designed for document processing pipelines that need to convert semi-structured warranty documentation into structured data for analysis, reporting, or database storage.
Source Code
def extract_warranty_data_improved(markdown_content):
"""Extract warranty data from improved markdown content with proper references."""
warranties = []
# First, normalize the content by converting escaped newlines to actual newlines
normalized_content = markdown_content.replace('\\n', '\n')
# Find all warranty sections using a more flexible pattern
# Look for ## followed by warranty ID - Title pattern
warranty_pattern = r'## ([\d\.]+(?:\([a-z]\))?(?:\([ivx]+\))?(?:\([A-Za-z]+\))?) - (.+?)\n'
# Find all warranty sections
warranty_matches = list(re.finditer(warranty_pattern, normalized_content))
logger.info(f"Found {len(warranty_matches)} warranty sections")
# Extract references section for later use
references_section = ""
ref_match = re.search(r'\n## References\n(.+)$', normalized_content, re.DOTALL)
if ref_match:
references_section = ref_match.group(1)
logger.info("Found references section")
else:
logger.warning("References section not found")
for i, match in enumerate(warranty_matches):
warranty_id = match.group(1).strip()
warranty_title = match.group(2).strip()
# Find the content between this warranty and the next one (or end of file)
start_pos = match.end()
if i + 1 < len(warranty_matches):
end_pos = warranty_matches[i + 1].start()
content = normalized_content[start_pos:end_pos]
else:
# For the last warranty, stop at the references section
if ref_match:
end_pos = ref_match.start()
content = normalized_content[start_pos:end_pos]
else:
content = normalized_content[start_pos:]
logger.info(f"Processing warranty: {warranty_id} - {warranty_title}")
# Extract section name (look for **Section**: pattern)
section_match = re.search(r'\*\*Section\*\*:\s*(.+?)(?:\n|\*\*)', content)
section_name = section_match.group(1).strip() if section_match else ""
# Extract source documents count
source_docs_match = re.search(r'\*\*Source Documents Found\*\*:\s*(\d+)', content)
source_docs_count = source_docs_match.group(1) if source_docs_match else "0"
# Extract warranty text (between ### Warranty Text and ### Disclosure)
warranty_text_match = re.search(r'### Warranty Text\s*\n\n(.+?)\n\n### Disclosure', content, re.DOTALL)
warranty_text = clean_text(warranty_text_match.group(1)) if warranty_text_match else ""
# Extract disclosure content (everything after ### Disclosure until next --- or end)
disclosure_match = re.search(r'### Disclosure\s*\n\n(.+?)(?=\n\n---\n|$)', content, re.DOTALL)
disclosure_content = disclosure_match.group(1) if disclosure_match else ""
# If disclosure_content is empty, try a more relaxed pattern
if not disclosure_content:
disclosure_match = re.search(r'### Disclosure\s*\n(.+?)(?=\n---\n|$)', content, re.DOTALL)
disclosure_content = disclosure_match.group(1) if disclosure_match else ""
# Clean disclosure content but preserve references
if disclosure_content:
# Don't apply markdown cleaning to preserve inline references like [1], [2], etc.
disclosure_content = re.sub(r'\s+', ' ', disclosure_content).strip()
# Create both summary and full versions
disclosure_summary = disclosure_content[:500] + "..." if len(disclosure_content) > 500 else disclosure_content
# Extract referenced numbers from the disclosure
referenced_numbers = set()
if disclosure_content:
ref_pattern = r'\[(\d+)\]'
referenced_numbers = set(re.findall(ref_pattern, disclosure_content))
warranties.append({
'Warranty_ID': warranty_id,
'Warranty_Title': warranty_title,
'Section_Name': section_name,
'Source_Documents_Count': source_docs_count,
'Warranty_Text': warranty_text,
'Disclosure_Summary': clean_text(disclosure_summary),
'Full_Disclosure': disclosure_content, # Keep references intact
'Referenced_Numbers': list(referenced_numbers)
})
logger.info(f"Extracted {len(warranties)} warranties")
return warranties, references_section
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
markdown_content |
- | - | positional_or_keyword |
Parameter Details
markdown_content: A string containing markdown-formatted warranty documentation. Expected to have warranty sections marked with '## [ID] - [Title]' headers, subsections for 'Warranty Text' and 'Disclosure' marked with '###' headers, metadata fields like '**Section**: [name]' and '**Source Documents Found**: [count]', inline reference citations in [number] format, and an optional '## References' section at the end. The content may contain escaped newlines (\n) that will be normalized during processing.
Return Value
Returns a tuple containing two elements: (1) A list of dictionaries, where each dictionary represents one warranty with keys: 'Warranty_ID' (string, the warranty identifier), 'Warranty_Title' (string, the warranty title), 'Section_Name' (string, the section this warranty belongs to), 'Source_Documents_Count' (string, number of source documents found), 'Warranty_Text' (string, cleaned warranty text content), 'Disclosure_Summary' (string, cleaned first 500 characters of disclosure with '...' if truncated), 'Full_Disclosure' (string, complete disclosure text with reference citations preserved), and 'Referenced_Numbers' (list of strings, unique reference numbers cited in the disclosure). (2) A string containing the content of the References section if found, otherwise an empty string.
Dependencies
relogging
Required Imports
import re
import logging
Usage Example
import re
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
def clean_text(text):
"""Simple text cleaning function."""
return re.sub(r'\s+', ' ', text).strip()
markdown_content = '''## 1.1 - Limited Warranty
**Section**: General Warranties
**Source Documents Found**: 3
### Warranty Text
This product is warranted for 1 year from date of purchase.
### Disclosure
The warranty covers manufacturing defects [1] but excludes normal wear [2].
---
## 1.2 - Extended Warranty
**Section**: Optional Coverage
**Source Documents Found**: 1
### Warranty Text
Extended coverage available for purchase.
### Disclosure
Extended warranty provides additional 2 years of coverage [3].
## References
[1] Manufacturing Defects Policy
[2] Wear and Tear Guidelines
[3] Extended Coverage Terms
'''
warranties, references = extract_warranty_data_improved(markdown_content)
for warranty in warranties:
print(f"ID: {warranty['Warranty_ID']}")
print(f"Title: {warranty['Warranty_Title']}")
print(f"References: {warranty['Referenced_Numbers']}")
print(f"Disclosure: {warranty['Disclosure_Summary']}")
print('---')
print(f"\nReferences Section:\n{references}")
Best Practices
- Ensure the markdown content follows the expected structure with '## [ID] - [Title]' headers for warranties and '###' subheaders for Warranty Text and Disclosure sections
- Configure logging before calling this function to capture processing information and warnings
- Implement the 'clean_text' helper function in the same module to handle text cleaning consistently
- The function preserves inline reference citations (e.g., [1], [2]) in the Full_Disclosure field but cleans them in Disclosure_Summary - use Full_Disclosure when references are needed
- Handle the case where the references section might not be found (empty string returned as second tuple element)
- Warranty IDs can be complex with nested parentheses like '1.1(a)(i)(Note)' - the regex pattern accommodates this
- The function uses regex with DOTALL flag for multi-line matching - ensure content doesn't have unexpected patterns that might break extraction
- Source_Documents_Count is returned as a string, not an integer - convert if numerical operations are needed
- Referenced_Numbers are extracted as strings and returned as a list - they represent citation numbers found in the disclosure text
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function extract_warranty_data 96.2% similar
-
function extract_warranty_sections 85.7% similar
-
function create_enhanced_word_document 72.4% similar
-
function main_v8 69.2% similar
-
function main_v15 68.5% similar