🔍 Code Extractor

class QueryParser

Maturity: 46

A parser class that converts LLM-generated query response text into structured dictionaries containing various search query types, metadata, and parameters.

File:
/tf/active/vicechatdev/QA_updater/core/query_parser.py
Lines:
4 - 218
Complexity:
moderate

Purpose

The QueryParser class is designed to parse unstructured text responses from Large Language Models (LLMs) into structured data dictionaries. It extracts multiple types of search queries (vector search, Google search, scientific literature, clinical trials, patents, company news), metadata (key concepts, authors, journals), and configuration parameters (lookback periods, domain restrictions). The parser handles markdown-style formatting with section headers (##), bullet points (*), and numbered lists, making it ideal for processing LLM outputs that follow a specific template format. It's particularly useful in research and information retrieval systems where LLM-generated queries need to be systematically organized and executed across multiple search platforms.

Source Code

class QueryParser:
    """Parses LLM-generated query strings into structured data."""

    def __init__(self):
        """Initializes the QueryParser."""
        self.logger = logging.getLogger(__name__)

    def parse_query_response(self, response_text: str) -> dict:
        """
        Parses the LLM's response to extract structured query information.

        Args:
            response_text (str): The LLM's response text.

        Returns:
            dict: A dictionary containing parsed query information.
        """
        try:
            lines = response_text.strip().split('\n')

            result = {
                "key_concepts": [],
                "vector_search_queries": [],
                "google_search_queries": [],
                "search_operators": [],
                "domain_restrictions": [],
                "scientific_literature_queries": [],
                "clinical_trial_queries": [],
                "patent_queries": [],
                "company_news_queries": [],
                "target_journals": [],
                "key_authors": [],
                "condition_terms": [],
                "intervention_terms": [],
                "technology_categories": [],
                "company_focus": [],
                "company_names": [],
                "industry_terms": [],
                "optimal_lookback_period": 90  # Default
            }

            current_section = None
            
            self.logger.info(f"Parsing query response with {len(lines)} lines")

            for line in lines:
                line = line.strip()
                
                # Skip empty lines
                if not line:
                    continue
                    
                # Check for section headers with ## prefix
                if line.startswith('## '):
                    section_name = line[3:].lower().strip()
                    self.logger.info(f"Found section header: {section_name}")
                    
                    if 'key concept' in section_name:
                        current_section = "key_concepts"
                    elif 'vector search' in section_name:
                        current_section = "vector_search_queries"
                    elif 'google search' in section_name:
                        current_section = "google_search"
                    elif 'scientific literature' in section_name:
                        current_section = "scientific_literature_queries"
                    elif 'clinical trial' in section_name:
                        current_section = "clinical_trial_queries"
                    elif 'patent' in section_name:
                        current_section = "patent_queries"
                    elif 'company' in section_name or 'news' in section_name:
                        current_section = "company_news_queries"
                    elif 'timeframe' in section_name:
                        current_section = "timeframe"
                    else:
                        current_section = None
                    continue
                
                # Process subsections with * prefix
                if line.startswith('* '):
                    # Check if this is a subsection header with a colon
                    if ':' in line:
                        subsection = line[2:line.find(':')].lower().strip()
                        value = line[line.find(':')+1:].strip()
                        
                        if current_section == "google_search":
                            if subsection.startswith('query '):
                                # Parse a Google search query and remove any quotes
                                cleaned_value = self._remove_quotes(value)
                                result["google_search_queries"].append(cleaned_value)
                            elif subsection == 'search operators':
                                # Parse search operators
                                if '"' in value:
                                    # Extract quoted parts without the quotes
                                    operators = re.findall(r'"([^"]*)"', value)
                                    result["search_operators"] = operators
                                else:
                                    # Just split by comma
                                    result["search_operators"] = [op.strip() for op in value.split(',') if op.strip()]
                            elif subsection == 'domain restrictions':
                                # Parse domain restrictions
                                result["domain_restrictions"] = [domain.strip() for domain in value.split(',') if domain.strip()]
                        elif subsection == 'target journals':
                            result["target_journals"] = [j.strip() for j in value.split(',') if j.strip()]
                        elif subsection == 'key authors':
                            if '[' in value and ']' in value:
                                # Extract content inside brackets
                                value = value[value.find('[')+1:value.find(']')]
                            result["key_authors"] = [a.strip() for a in value.split(',') if a.strip()]
                        elif subsection == 'condition terms':
                            result["condition_terms"] = [t.strip() for t in value.split(',') if t.strip()]
                        elif subsection == 'intervention terms':
                            result["intervention_terms"] = [t.strip() for t in value.split(',') if t.strip()]
                        elif subsection == 'technology categories':
                            result["technology_categories"] = [t.strip() for t in value.split(',') if t.strip()]
                        elif subsection == 'company focus':
                            result["company_focus"] = [c.strip() for c in value.split(',') if c.strip()]
                        elif subsection == 'company names':
                            result["company_names"] = [c.strip() for c in value.split(',') if c.strip()]
                        elif subsection == 'industry terms':
                            result["industry_terms"] = [t.strip() for t in value.split(',') if t.strip()]
                        elif current_section == "timeframe" and subsection == 'optimal lookback period':
                            # Try to extract a number from the value
                            matches = re.findall(r'(\d+)(?:-(\d+))?', value)
                            if matches:
                                if matches[0][1]:  # If there's a range (e.g., 6-12)
                                    # Take the average
                                    result["optimal_lookback_period"] = (int(matches[0][0]) + int(matches[0][1])) // 2
                                else:
                                    result["optimal_lookback_period"] = int(matches[0][0])
                                
                                # Convert to days if in months
                                if 'month' in value:
                                    result["optimal_lookback_period"] *= 30
                    else:
                        # This is a regular list item with * prefix
                        item_text = self._remove_quotes(line[2:].strip())
                        
                        if current_section == "key_concepts":
                            result["key_concepts"].append(item_text)
                        elif current_section == "vector_search_queries":
                            result["vector_search_queries"].append(item_text)
                        elif current_section == "google_search":
                            result["google_search_queries"].append(item_text)
                        elif current_section == "scientific_literature_queries":
                            result["scientific_literature_queries"].append(item_text)
                        elif current_section == "clinical_trial_queries":
                            result["clinical_trial_queries"].append(item_text)
                        elif current_section == "patent_queries":
                            result["patent_queries"].append(item_text)
                        elif current_section == "company_news_queries":
                            result["company_news_queries"].append(item_text)
                    continue
                
                # Process numbered list items (1., 2., etc.)
                numbered_match = re.match(r'^\d+\.\s+(.+)$', line)
                if numbered_match:
                    item_text = self._remove_quotes(numbered_match.group(1).strip())
                    
                    if current_section == "key_concepts":
                        result["key_concepts"].append(item_text)
                    elif current_section == "vector_search_queries":
                        result["vector_search_queries"].append(item_text)
                    elif current_section == "google_search":
                        result["google_search_queries"].append(item_text)
                    elif current_section == "scientific_literature_queries":
                        result["scientific_literature_queries"].append(item_text)
                    elif current_section == "clinical_trial_queries":
                        result["clinical_trial_queries"].append(item_text)
                    elif current_section == "patent_queries":
                        result["patent_queries"].append(item_text)
                    elif current_section == "company_news_queries":
                        result["company_news_queries"].append(item_text)
                    continue

            # Log parsing results
            for key, values in result.items():
                if isinstance(values, list):
                    self.logger.info(f"Parsed {len(values)} items for {key}")
                else:
                    self.logger.info(f"Parsed value for {key}: {values}")
            
            # Ensure we have at least one query for each type
            for query_type in ["vector_search_queries", "google_search_queries", "scientific_literature_queries",
                            "clinical_trial_queries", "patent_queries", "company_news_queries"]:
                if not result[query_type]:
                    # Use key concepts as fallback
                    result[query_type] = result["key_concepts"][:1] if result["key_concepts"] else [""]

            return result

        except Exception as e:
            self.logger.exception(f"Error parsing query response: {e}")
            return {}
            
    def _remove_quotes(self, text):
        """
        Remove surrounding quotes (both single and double) from text.
        
        Args:
            text (str): The text to process
            
        Returns:
            str: Text with surrounding quotes removed
        """
        text = text.strip()
        
        # Remove surrounding double quotes
        if text.startswith('"') and text.endswith('"'):
            text = text[1:-1]
            
        # Remove surrounding single quotes
        if text.startswith("'") and text.endswith("'"):
            text = text[1:-1]
            
        return text

Parameters

Name Type Default Kind
bases - -

Parameter Details

No constructor parameters: The __init__ method takes no parameters beyond self. It only initializes an internal logger instance for tracking parsing operations.

Return Value

Instantiation returns a QueryParser object. The main method parse_query_response returns a dictionary with keys: 'key_concepts', 'vector_search_queries', 'google_search_queries', 'search_operators', 'domain_restrictions', 'scientific_literature_queries', 'clinical_trial_queries', 'patent_queries', 'company_news_queries', 'target_journals', 'key_authors', 'condition_terms', 'intervention_terms', 'technology_categories', 'company_focus', 'company_names', 'industry_terms', and 'optimal_lookback_period'. Most values are lists of strings, except 'optimal_lookback_period' which is an integer (default 90 days). Returns an empty dictionary {} if parsing fails.

Class Interface

Methods

__init__(self) -> None

Purpose: Initializes the QueryParser instance with a logger

Returns: None - initializes the instance

parse_query_response(self, response_text: str) -> dict

Purpose: Parses LLM-generated response text into a structured dictionary containing various query types and metadata

Parameters:

  • response_text: The raw text response from an LLM, expected to contain markdown-formatted sections with headers (##), bullet points (*), and numbered lists

Returns: A dictionary with 18 keys containing parsed query information. List-based keys include: key_concepts, vector_search_queries, google_search_queries, search_operators, domain_restrictions, scientific_literature_queries, clinical_trial_queries, patent_queries, company_news_queries, target_journals, key_authors, condition_terms, intervention_terms, technology_categories, company_focus, company_names, industry_terms. Integer key: optimal_lookback_period (in days). Returns empty dict {} on error.

_remove_quotes(self, text: str) -> str

Purpose: Removes surrounding single or double quotes from text strings

Parameters:

  • text: The text string to process, potentially wrapped in quotes

Returns: The input text with surrounding quotes removed (both single and double quotes are handled)

Attributes

Name Type Description Scope
logger logging.Logger Logger instance for tracking parsing operations, errors, and debugging information. Initialized with the module's __name__. instance

Dependencies

  • logging
  • re

Required Imports

import logging
import re

Usage Example

import logging
import re

# Configure logging (optional but recommended)
logging.basicConfig(level=logging.INFO)

# Instantiate the parser
parser = QueryParser()

# Example LLM response text
llm_response = '''
## Key Concepts
* machine learning
* neural networks
* deep learning

## Vector Search Queries
1. "advanced neural network architectures"
2. "machine learning optimization techniques"

## Google Search Queries
* Query 1: machine learning AND neural networks
* Search Operators: "exact phrase", site:arxiv.org
* Domain Restrictions: arxiv.org, ieee.org

## Scientific Literature Queries
* Target Journals: Nature, Science
* Key Authors: [Hinton, LeCun, Bengio]

## Timeframe
* Optimal Lookback Period: 6-12 months
'''

# Parse the response
result = parser.parse_query_response(llm_response)

# Access parsed data
print(f"Key concepts: {result['key_concepts']}")
print(f"Vector queries: {result['vector_search_queries']}")
print(f"Google queries: {result['google_search_queries']}")
print(f"Search operators: {result['search_operators']}")
print(f"Domain restrictions: {result['domain_restrictions']}")
print(f"Target journals: {result['target_journals']}")
print(f"Key authors: {result['key_authors']}")
print(f"Lookback period (days): {result['optimal_lookback_period']}")

Best Practices

  • Always instantiate a new QueryParser object before parsing; the class is stateless except for the logger
  • The parser expects LLM responses to follow a specific markdown format with ## section headers and * or numbered list items
  • Check if the returned dictionary is empty to detect parsing failures
  • The parser automatically provides fallback queries using key_concepts if specific query types are empty
  • Quote removal is automatic for query strings, so both quoted and unquoted inputs are handled
  • The optimal_lookback_period is automatically converted to days if specified in months
  • For range values (e.g., '6-12 months'), the parser calculates the average
  • Enable logging at INFO level or higher to track parsing progress and debug issues
  • The parser is designed to be fault-tolerant and will continue processing even if some sections fail
  • All list-based fields default to empty lists, and optimal_lookback_period defaults to 90 days

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class QueryBasedExtractor_v1 63.7% similar

    A class that performs targeted information extraction from text using LLM-based query-guided extraction, with support for handling long documents through chunking and token management.

    From: /tf/active/vicechatdev/vice_ai/hybrid_rag_engine.py
  • class QueryBasedExtractor 62.7% similar

    A class that extracts relevant information from documents using a small LLM (Language Model), designed for Extensive and Full Reading modes in RAG systems.

    From: /tf/active/vicechatdev/docchat/rag_engine.py
  • class QueryBasedExtractor_v2 61.8% similar

    A class that performs targeted information extraction from text using LLM-based query-guided extraction, with support for handling long documents through chunking and token management.

    From: /tf/active/vicechatdev/OneCo_hybrid_RAG.py
  • class LLMClient_v1 52.8% similar

    A client class for interacting with Large Language Models (LLMs), specifically designed to work with OpenAI's chat completion API.

    From: /tf/active/vicechatdev/QA_updater/core/llm_client.py
  • class ContractDataExtractor 49.9% similar

    Extract structured data from legal contracts using LLM analysis

    From: /tf/active/vicechatdev/contract_validity_analyzer/extractor.py
← Back to Browse