QueryBasedExtractor - Code Extractor

class QueryBasedExtractor

Maturity: 47

A class that extracts relevant information from documents using a small LLM (Language Model), designed for Extensive and Full Reading modes in RAG systems.

File:
/tf/active/vicechatdev/docchat/rag_engine.py

Lines:
92 - 327

Complexity:
moderate

Purpose

QueryBasedExtractor performs targeted information extraction from single or multiple documents based on user queries. It uses a small LLM (default: gpt-4o-mini) to extract only the most relevant passages without summarization, maintaining original wording. The class handles token counting, text chunking, document truncation, and consolidation of extractions from multiple sources. It integrates with a parent RAG engine for usage tracking and ensures extracted content fits within token limits.

Source Code

class QueryBasedExtractor:
    """
    Extract relevant information from documents using a small LLM
    Used in Extensive and Full Reading modes
    """
    
    def __init__(self, api_key: str, model_name: str = "gpt-4o-mini", 
                 max_output_tokens: int = 2048, parent_rag=None):
        """
        Initialize extractor
        
        Args:
            api_key: OpenAI API key
            model_name: Small LLM model for extraction
            max_output_tokens: Maximum tokens in extracted output
            parent_rag: Reference to parent RAG engine for usage tracking
        """
        self.api_key = api_key
        self.model_name = model_name
        self.max_output_tokens = max_output_tokens
        self.tokenizer = tiktoken.get_encoding("cl100k_base")
        self.parent_rag = parent_rag  # For usage tracking
        
        # Use shared factory to ensure GPT-5 compatibility for small LLM as well
        self.llm = get_llm_instance(
            model_name=model_name,
            temperature=0,
            max_tokens=max_output_tokens
        )
    
    def count_tokens(self, text: str) -> int:
        """Count tokens in text"""
        return len(self.tokenizer.encode(text))
    
    def chunk_text(self, text: str, max_chunk_tokens: int) -> List[str]:
        """
        Split text into chunks that fit within token limit
        
        Args:
            text: Text to chunk
            max_chunk_tokens: Maximum tokens per chunk
            
        Returns:
            List of text chunks
        """
        # Try to split on paragraphs first
        paragraphs = text.split('\n\n')
        
        chunks = []
        current_chunk = []
        current_tokens = 0
        
        for para in paragraphs:
            para_tokens = self.count_tokens(para)
            
            # If single paragraph is too large, split it further
            if para_tokens > max_chunk_tokens:
                # If we have accumulated content, save it
                if current_chunk:
                    chunks.append('\n\n'.join(current_chunk))
                    current_chunk = []
                    current_tokens = 0
                
                # Split large paragraph by sentences
                sentences = para.split('. ')
                for sentence in sentences:
                    sent_tokens = self.count_tokens(sentence)
                    if current_tokens + sent_tokens > max_chunk_tokens:
                        if current_chunk:
                            chunks.append('\n\n'.join(current_chunk))
                        current_chunk = [sentence]
                        current_tokens = sent_tokens
                    else:
                        current_chunk.append(sentence)
                        current_tokens += sent_tokens
            else:
                # Normal paragraph
                if current_tokens + para_tokens > max_chunk_tokens:
                    # Save current chunk and start new one
                    if current_chunk:
                        chunks.append('\n\n'.join(current_chunk))
                    current_chunk = [para]
                    current_tokens = para_tokens
                else:
                    current_chunk.append(para)
                    current_tokens += para_tokens
        
        # Add remaining content
        if current_chunk:
            chunks.append('\n\n'.join(current_chunk))
        
        return chunks
    
    def extract_from_document(self, document_text: str, query: str) -> str:
        """
        Extract relevant information from a document based on query
        
        Args:
            document_text: Full document text
            query: User query to guide extraction
            
        Returns:
            Extracted relevant information
        """
        # Track usage if parent RAG is available
        if self.parent_rag:
            self.parent_rag.small_llm_usage["document_extraction"] += 1
            self.parent_rag.small_llm_usage["total_calls"] += 1
        
        # If document is already small enough, return as-is
        if self.count_tokens(document_text) <= self.max_output_tokens:
            return document_text
        
        # Calculate token budget for document text in the extraction prompt
        # Model limit for the extraction LLM (using small LLM)
        model_limit = 128000  # GPT-4o-mini limit
        reserved_for_output = self.max_output_tokens  # 8000 tokens
        safety_margin = 1000
        
        # Base prompt template tokens (without document text)
        base_prompt = f"""You are performing targeted information extraction. Extract ONLY the most relevant 
information from the provided document that directly addresses the user's query.

IMPORTANT INSTRUCTIONS:
- DO NOT summarize or paraphrase - extract exact relevant passages
- Maintain original wording and details crucial for answering the query
- Include complete sentences and necessary context
- Extract passages in order of relevance
- If the document doesn't contain relevant information, state that clearly
- Maximum extraction length: {self.max_output_tokens} tokens

USER QUERY:
{query}

DOCUMENT TEXT:
[DOCUMENT_PLACEHOLDER]

EXTRACTED RELEVANT INFORMATION:
"""
        base_tokens = self.count_tokens(base_prompt)
        
        # Available tokens for document text
        max_doc_tokens = model_limit - reserved_for_output - safety_margin - base_tokens
        
        # Truncate document if needed
        doc_tokens = self.count_tokens(document_text)
        if doc_tokens > max_doc_tokens:
            logger.warning(f"Document too large ({doc_tokens} tokens), truncating to {max_doc_tokens} tokens for extraction")
            # Truncate using tokenizer
            tokens = self.tokenizer.encode(document_text)
            truncated_tokens = tokens[:max_doc_tokens]
            document_text = self.tokenizer.decode(truncated_tokens)
            document_text += "\n\n[... document truncated due to length ...]"
        
        # Create extraction prompt with size-controlled document
        prompt = f"""You are performing targeted information extraction. Extract ONLY the most relevant 
information from the provided document that directly addresses the user's query.

IMPORTANT INSTRUCTIONS:
- DO NOT summarize or paraphrase - extract exact relevant passages
- Maintain original wording and details crucial for answering the query
- Include complete sentences and necessary context
- Extract passages in order of relevance
- If the document doesn't contain relevant information, state that clearly
- Maximum extraction length: {self.max_output_tokens} tokens

USER QUERY:
{query}

DOCUMENT TEXT:
{document_text}

EXTRACTED RELEVANT INFORMATION:
"""
        
        response = self.llm.invoke(prompt)
        return response.content.strip()
    
    def extract_from_multiple_documents(self, documents: List[Dict[str, str]], 
                                       query: str) -> str:
        """
        Extract and consolidate information from multiple documents
        
        Args:
            documents: List of documents with 'text' and 'metadata'
            query: User query
            
        Returns:
            Consolidated extracted information
        """
        extractions = []
        
        for i, doc in enumerate(documents):
            logger.info(f"Extracting from document {i+1}/{len(documents)}: {doc.get('file_name', 'Unknown')}")
            
            extracted = self.extract_from_document(doc['text'], query)
            
            if extracted and extracted.strip():
                # Add document attribution
                file_name = doc.get('file_name', f'Document {i+1}')
                extractions.append(f"### From: {file_name}\n\n{extracted}")
        
        # Combine all extractions
        combined = "\n\n---\n\n".join(extractions)
        
        # If still too long, do consolidation pass
        if self.count_tokens(combined) > self.max_output_tokens * 2:
            return self._consolidate_extractions(combined, query)
        
        return combined
    
    def _consolidate_extractions(self, combined_text: str, query: str) -> str:
        """Consolidate multiple extractions into a coherent summary"""
        prompt = f"""You are consolidating information extracted from multiple documents.
Create a coherent synthesis that includes the most important and relevant information
to answer the user's query, avoiding redundancy.

INSTRUCTIONS:
- Focus on the most relevant information
- Maintain original wording for key facts and details
- Remove redundant information
- Organize information logically
- Keep document attributions when significant
- Maximum length: {self.max_output_tokens * 2} tokens

USER QUERY:
{query}

EXTRACTED INFORMATION FROM MULTIPLE DOCUMENTS:
{combined_text}

CONSOLIDATED INFORMATION:
"""
        
        response = self.llm.invoke(prompt)
        return response.content.strip()

Parameters

Name	Type	Default	Kind
`bases`	-	-

Parameter Details

api_key: OpenAI API key required for authenticating with the LLM service. Must be a valid API key string.

model_name: Name of the small LLM model to use for extraction. Defaults to 'gpt-4o-mini'. Should be a model identifier compatible with the llm_factory.get_llm_instance function.

max_output_tokens: Maximum number of tokens allowed in the extracted output. Defaults to 2048. Controls the size of extracted information to prevent exceeding model limits.

parent_rag: Optional reference to the parent RAG engine object. Used for tracking small LLM usage statistics. Can be None if usage tracking is not needed.

Return Value

Instantiation returns a QueryBasedExtractor object configured with the specified LLM and token limits. Key method returns: extract_from_document() returns a string of extracted relevant information; extract_from_multiple_documents() returns a consolidated string combining extractions from all documents; count_tokens() returns an integer token count; chunk_text() returns a list of text chunk strings.

Class Interface

Methods

`init(self, api_key: str, model_name: str = 'gpt-4o-mini', max_output_tokens: int = 2048, parent_rag=None)`

Purpose: Initialize the QueryBasedExtractor with LLM configuration and token limits

Parameters:

api_key: OpenAI API key for authentication
model_name: LLM model identifier (default: 'gpt-4o-mini')
max_output_tokens: Maximum tokens in extracted output (default: 2048)
parent_rag: Optional parent RAG engine reference for usage tracking

Returns: None (constructor)

`count_tokens(self, text: str) -> int`

Purpose: Count the number of tokens in a given text string using tiktoken encoding

Parameters:

text: Text string to count tokens for

Returns: Integer count of tokens in the text

`chunk_text(self, text: str, max_chunk_tokens: int) -> List[str]`

Purpose: Split text into chunks that fit within a specified token limit, preserving paragraph and sentence boundaries

Parameters:

text: Text to be chunked
max_chunk_tokens: Maximum number of tokens allowed per chunk

Returns: List of text chunk strings, each within the token limit

`extract_from_document(self, document_text: str, query: str) -> str`

Purpose: Extract relevant information from a single document based on a user query, maintaining original wording

Parameters:

document_text: Full text content of the document to extract from
query: User query that guides what information to extract

Returns: String containing extracted relevant passages from the document

`extract_from_multiple_documents(self, documents: List[Dict[str, str]], query: str) -> str`

Purpose: Extract and consolidate information from multiple documents, adding document attribution

Parameters:

documents: List of document dictionaries, each containing 'text' and optionally 'file_name' keys
query: User query to guide extraction across all documents

Returns: Consolidated string combining extractions from all documents with source attribution

`_consolidate_extractions(self, combined_text: str, query: str) -> str`

Purpose: Private method to consolidate multiple extractions into a coherent synthesis when combined text is too long

Parameters:

combined_text: Combined text from multiple document extractions
query: Original user query for context

Returns: Consolidated and deduplicated string of extracted information

Attributes

Name	Type	Description	Scope
`api_key`	str	OpenAI API key used for LLM authentication	instance
`model_name`	str	Name of the LLM model used for extraction (e.g., 'gpt-4o-mini')	instance
`max_output_tokens`	int	Maximum number of tokens allowed in extracted output	instance
`tokenizer`	tiktoken.Encoding	Tiktoken tokenizer instance using cl100k_base encoding for token counting	instance
`parent_rag`	Optional[object]	Reference to parent RAG engine for tracking small LLM usage statistics	instance
`llm`	LLM instance	LLM instance created via get_llm_instance factory, configured with temperature=0 and max_tokens	instance

Dependencies

tiktoken
langchain_community
logging

Required Imports

import tiktoken
from typing import List, Dict
import logging
from llm_factory import get_llm_instance

Conditional/Optional Imports

These imports are only needed under specific conditions:

from llm_factory import get_llm_instance

Condition: Required for creating LLM instances with GPT-5 compatibility

Required (conditional)

Usage Example

# Initialize the extractor
api_key = 'your-openai-api-key'
extractor = QueryBasedExtractor(api_key=api_key, model_name='gpt-4o-mini', max_output_tokens=2048)

# Extract from a single document
document_text = 'Long document content here...'
query = 'What are the main findings about climate change?'
extracted_info = extractor.extract_from_document(document_text, query)
print(extracted_info)

# Extract from multiple documents
documents = [
    {'text': 'Document 1 content...', 'file_name': 'report1.pdf'},
    {'text': 'Document 2 content...', 'file_name': 'report2.pdf'}
]
consolidated = extractor.extract_from_multiple_documents(documents, query)
print(consolidated)

# Count tokens in text
token_count = extractor.count_tokens('Some text to count')
print(f'Token count: {token_count}')

# Chunk large text
chunks = extractor.chunk_text('Very long text...', max_chunk_tokens=1000)
for i, chunk in enumerate(chunks):
    print(f'Chunk {i+1}: {chunk[:100]}...')

Best Practices

Always provide a valid OpenAI API key during instantiation
Set max_output_tokens based on your downstream processing needs and model limits
Use parent_rag parameter if you need to track LLM usage statistics across your application
The class automatically handles document truncation when content exceeds token limits
For multiple documents, the class adds document attribution (file names) to help track information sources
The extractor prioritizes exact passage extraction over summarization to maintain accuracy
Token counting uses tiktoken's cl100k_base encoding, which is compatible with GPT-4 and GPT-3.5 models
When processing multiple documents, if combined extractions exceed 2x max_output_tokens, automatic consolidation occurs
The class uses temperature=0 for deterministic extraction results
Document chunking preserves paragraph boundaries when possible, falling back to sentence-level splitting for large paragraphs

Similar Components

AI-powered semantic similarity - components with related functionality:

class QueryBasedExtractor_v2 90.7% similar

A class that performs targeted information extraction from text using LLM-based query-guided extraction, with support for handling long documents through chunking and token management.
From: /tf/active/vicechatdev/OneCo_hybrid_RAG.py
class QueryBasedExtractor_v1 90.2% similar

A class that performs targeted information extraction from text using LLM-based query-guided extraction, with support for handling long documents through chunking and token management.
From: /tf/active/vicechatdev/vice_ai/hybrid_rag_engine.py
class RegulatoryExtractor 64.6% similar

A class for extracting structured metadata from regulatory guideline PDF documents using LLM-based analysis and storing the results in an Excel tracking spreadsheet.
From: /tf/active/vicechatdev/reg_extractor.py
class QueryParser 62.7% similar

A parser class that converts LLM-generated query response text into structured dictionaries containing various search query types, metadata, and parameters.
From: /tf/active/vicechatdev/QA_updater/core/query_parser.py
class DocumentExtractor 58.1% similar

A document text extraction class that supports multiple file formats including Word, PowerPoint, PDF, and plain text files, with automatic format detection and conversion capabilities.
From: /tf/active/vicechatdev/leexi/document_extractor.py

← Back to Browse

Assistant

Hi! I can help improve this code. Tell me what you'd like to enhance (e.g., "add error handling", "optimize performance", "improve readability", "add type hints").

Code Comparison

Original Code

                            class QueryBasedExtractor:
    """
    Extract relevant information from documents using a small LLM
    Used in Extensive and Full Reading modes
    """
    
    def __init__(self, api_key: str, model_name: str = "gpt-4o-mini", 
                 max_output_tokens: int = 2048, parent_rag=None):
        """
        Initialize extractor
        
        Args:
            api_key: OpenAI API key
            model_name: Small LLM model for extraction
            max_output_tokens: Maximum tokens in extracted output
            parent_rag: Reference to parent RAG engine for usage tracking
        """
        self.api_key = api_key
        self.model_name = model_name
        self.max_output_tokens = max_output_tokens
        self.tokenizer = tiktoken.get_encoding("cl100k_base")
        self.parent_rag = parent_rag  # For usage tracking
        
        # Use shared factory to ensure GPT-5 compatibility for small LLM as well
        self.llm = get_llm_instance(
            model_name=model_name,
            temperature=0,
            max_tokens=max_output_tokens
        )
    
    def count_tokens(self, text: str) -> int:
        """Count tokens in text"""
        return len(self.tokenizer.encode(text))
    
    def chunk_text(self, text: str, max_chunk_tokens: int) -> List[str]:
        """
        Split text into chunks that fit within token limit
        
        Args:
            text: Text to chunk
            max_chunk_tokens: Maximum tokens per chunk
            
        Returns:
            List of text chunks
        """
        # Try to split on paragraphs first
        paragraphs = text.split('\n\n')
        
        chunks = []
        current_chunk = []
        current_tokens = 0
        
        for para in paragraphs:
            para_tokens = self.count_tokens(para)
            
            # If single paragraph is too large, split it further
            if para_tokens > max_chunk_tokens:
                # If we have accumulated content, save it
                if current_chunk:
                    chunks.append('\n\n'.join(current_chunk))
                    current_chunk = []
                    current_tokens = 0
                
                # Split large paragraph by sentences
                sentences = para.split('. ')
                for sentence in sentences:
                    sent_tokens = self.count_tokens(sentence)
                    if current_tokens + sent_tokens > max_chunk_tokens:
                        if current_chunk:
                            chunks.append('\n\n'.join(current_chunk))
                        current_chunk = [sentence]
                        current_tokens = sent_tokens
                    else:
                        current_chunk.append(sentence)
                        current_tokens += sent_tokens
            else:
                # Normal paragraph
                if current_tokens + para_tokens > max_chunk_tokens:
                    # Save current chunk and start new one
                    if current_chunk:
                        chunks.append('\n\n'.join(current_chunk))
                    current_chunk = [para]
                    current_tokens = para_tokens
                else:
                    current_chunk.append(para)
                    current_tokens += para_tokens
        
        # Add remaining content
        if current_chunk:
            chunks.append('\n\n'.join(current_chunk))
        
        return chunks
    
    def extract_from_document(self, document_text: str, query: str) -> str:
        """
        Extract relevant information from a document based on query
        
        Args:
            document_text: Full document text
            query: User query to guide extraction
            
        Returns:
            Extracted relevant information
        """
        # Track usage if parent RAG is available
        if self.parent_rag:
            self.parent_rag.small_llm_usage["document_extraction"] += 1
            self.parent_rag.small_llm_usage["total_calls"] += 1
        
        # If document is already small enough, return as-is
        if self.count_tokens(document_text) <= self.max_output_tokens:
            return document_text
        
        # Calculate token budget for document text in the extraction prompt
        # Model limit for the extraction LLM (using small LLM)
        model_limit = 128000  # GPT-4o-mini limit
        reserved_for_output = self.max_output_tokens  # 8000 tokens
        safety_margin = 1000
        
        # Base prompt template tokens (without document text)
        base_prompt = f"""You are performing targeted information extraction. Extract ONLY the most relevant 
information from the provided document that directly addresses the user's query.

IMPORTANT INSTRUCTIONS:
- DO NOT summarize or paraphrase - extract exact relevant passages
- Maintain original wording and details crucial for answering the query
- Include complete sentences and necessary context
- Extract passages in order of relevance
- If the document doesn't contain relevant information, state that clearly
- Maximum extraction length: {self.max_output_tokens} tokens

USER QUERY:
{query}

DOCUMENT TEXT:
[DOCUMENT_PLACEHOLDER]

EXTRACTED RELEVANT INFORMATION:
"""
        base_tokens = self.count_tokens(base_prompt)
        
        # Available tokens for document text
        max_doc_tokens = model_limit - reserved_for_output - safety_margin - base_tokens
        
        # Truncate document if needed
        doc_tokens = self.count_tokens(document_text)
        if doc_tokens > max_doc_tokens:
            logger.warning(f"Document too large ({doc_tokens} tokens), truncating to {max_doc_tokens} tokens for extraction")
            # Truncate using tokenizer
            tokens = self.tokenizer.encode(document_text)
            truncated_tokens = tokens[:max_doc_tokens]
            document_text = self.tokenizer.decode(truncated_tokens)
            document_text += "\n\n[... document truncated due to length ...]"
        
        # Create extraction prompt with size-controlled document
        prompt = f"""You are performing targeted information extraction. Extract ONLY the most relevant 
information from the provided document that directly addresses the user's query.

IMPORTANT INSTRUCTIONS:
- DO NOT summarize or paraphrase - extract exact relevant passages
- Maintain original wording and details crucial for answering the query
- Include complete sentences and necessary context
- Extract passages in order of relevance
- If the document doesn't contain relevant information, state that clearly
- Maximum extraction length: {self.max_output_tokens} tokens

USER QUERY:
{query}

DOCUMENT TEXT:
{document_text}

EXTRACTED RELEVANT INFORMATION:
"""
        
        response = self.llm.invoke(prompt)
        return response.content.strip()
    
    def extract_from_multiple_documents(self, documents: List[Dict[str, str]], 
                                       query: str) -> str:
        """
        Extract and consolidate information from multiple documents
        
        Args:
            documents: List of documents with 'text' and 'metadata'
            query: User query
            
        Returns:
            Consolidated extracted information
        """
        extractions = []
        
        for i, doc in enumerate(documents):
            logger.info(f"Extracting from document {i+1}/{len(documents)}: {doc.get('file_name', 'Unknown')}")
            
            extracted = self.extract_from_document(doc['text'], query)
            
            if extracted and extracted.strip():
                # Add document attribution
                file_name = doc.get('file_name', f'Document {i+1}')
                extractions.append(f"### From: {file_name}\n\n{extracted}")
        
        # Combine all extractions
        combined = "\n\n---\n\n".join(extractions)
        
        # If still too long, do consolidation pass
        if self.count_tokens(combined) > self.max_output_tokens * 2:
            return self._consolidate_extractions(combined, query)
        
        return combined
    
    def _consolidate_extractions(self, combined_text: str, query: str) -> str:
        """Consolidate multiple extractions into a coherent summary"""
        prompt = f"""You are consolidating information extracted from multiple documents.
Create a coherent synthesis that includes the most important and relevant information
to answer the user's query, avoiding redundancy.

INSTRUCTIONS:
- Focus on the most relevant information
- Maintain original wording for key facts and details
- Remove redundant information
- Organize information logically
- Keep document attributions when significant
- Maximum length: {self.max_output_tokens * 2} tokens

USER QUERY:
{query}

EXTRACTED INFORMATION FROM MULTIPLE DOCUMENTS:
{combined_text}

CONSOLIDATED INFORMATION:
"""
        
        response = self.llm.invoke(prompt)
        return response.content.strip()
                        

Improved Code

🔍 Code Extractor

class QueryBasedExtractor

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

`init(self, api_key: str, model_name: str = 'gpt-4o-mini', max_output_tokens: int = 2048, parent_rag=None)`

`count_tokens(self, text: str) -> int`

`chunk_text(self, text: str, max_chunk_tokens: int) -> List[str]`

`extract_from_document(self, document_text: str, query: str) -> str`

`extract_from_multiple_documents(self, documents: List[Dict[str, str]], query: str) -> str`

`_consolidate_extractions(self, combined_text: str, query: str) -> str`

Attributes

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

class QueryBasedExtractor_v2 90.7% similar

class QueryBasedExtractor_v1 90.2% similar

class RegulatoryExtractor 64.6% similar

class QueryParser 62.7% similar

class DocumentExtractor 58.1% similar

class QueryBasedExtractor

Purpose

Source Code

Parameters

Parameter Details

Return Value

Class Interface

Methods

__init__(self, api_key: str, model_name: str = 'gpt-4o-mini', max_output_tokens: int = 2048, parent_rag=None)

count_tokens(self, text: str) -> int

chunk_text(self, text: str, max_chunk_tokens: int) -> List[str]

extract_from_document(self, document_text: str, query: str) -> str

extract_from_multiple_documents(self, documents: List[Dict[str, str]], query: str) -> str

_consolidate_extractions(self, combined_text: str, query: str) -> str

Attributes

Dependencies

Required Imports

Conditional/Optional Imports

Usage Example

Best Practices

Tags

Similar Components

class QueryBasedExtractor_v2 90.7% similar

class QueryBasedExtractor_v1 90.2% similar

class RegulatoryExtractor 64.6% similar

class QueryParser 62.7% similar

class DocumentExtractor 58.1% similar

✨ Improve Code: QueryBasedExtractor

Code Comparison

`init(self, api_key: str, model_name: str = 'gpt-4o-mini', max_output_tokens: int = 2048, parent_rag=None)`

`count_tokens(self, text: str) -> int`

`chunk_text(self, text: str, max_chunk_tokens: int) -> List[str]`

`extract_from_document(self, document_text: str, query: str) -> str`

`extract_from_multiple_documents(self, documents: List[Dict[str, str]], query: str) -> str`

`_consolidate_extractions(self, combined_text: str, query: str) -> str`