class QueryBasedExtractor_v2
A class that performs targeted information extraction from text using LLM-based query-guided extraction, with support for handling long documents through chunking and token management.
/tf/active/vicechatdev/OneCo_hybrid_RAG.py
76 - 287
complex
Purpose
QueryBasedExtractor is designed to extract relevant information from text documents based on user-provided queries. It uses a small LLM (default: gpt-4o-mini) to perform intelligent extraction that preserves original wording while focusing only on query-relevant content. The class handles token counting, automatic chunking for long texts, and multi-pass extraction to ensure output stays within token limits. It's particularly useful for reducing large documents to their most relevant portions before further processing or analysis.
Source Code
class QueryBasedExtractor:
def __init__(self, max_output_tokens=1024, api_key=None, model_name="gpt-4o-mini"):
"""
Initialize the extractor with configuration for a small LLM.
Args:
max_output_tokens: Maximum tokens for the extracted output
api_key: API key for the LLM service
model_name: Small LLM model to use
"""
self.max_output_tokens = max_output_tokens
self.api_key = api_key
self.model_name = model_name
# Set up tiktoken encoder for token counting
import tiktoken
self.tokenizer = tiktoken.get_encoding("cl100k_base")
# Set up OpenAI client if API key is provided
if api_key:
import openai
import os
os.environ["OPENAI_API_KEY"] = api_key
self.client = openai.OpenAI(api_key=api_key)
def count_tokens(self, text):
"""Count tokens in a string."""
return len(self.tokenizer.encode(text))
def call_llm(self, prompt):
"""
Call the LLM with the prompt.
Args:
prompt: The formatted prompt for extraction
Returns:
Extracted text from the LLM
"""
from langchain_openai import ChatOpenAI
# Use LangChain's ChatOpenAI for consistency with OneCo_hybrid_RAG
llm = ChatOpenAI(
model=self.model_name,
temperature=0,
max_tokens=self.max_output_tokens
)
response = llm.invoke(prompt)
return response.content
def create_extraction_prompt(self, queries, text):
"""
Create a prompt for targeted information extraction based on queries.
Args:
queries: List of queries to guide the extraction
text: Text to extract from
Returns:
Formatted prompt string
"""
formatted_queries = "\n".join([f"- {q}" for q in queries])
# Design an extraction-focused prompt based on OneCo_hybrid_RAG style
prompt = f"""
You are performing targeted information extraction. Given the queries below, extract ONLY the most relevant
passages from the provided text that directly address these queries.
IMPORTANT INSTRUCTIONS:
- DO NOT summarize or paraphrase - extract the exact relevant passages
- Maintain original wording and details crucial for answering the queries
- Include complete sentences and necessary context around key points
- Extract passages in order of relevance to the queries
- If important details are in different parts of the text, include all relevant sections
- Extract ONLY information relevant to the queries
- The extraction MUST be self-contained and make sense on its own
- Maximum output length: {self.max_output_tokens} tokens
QUERIES:
{formatted_queries}
TEXT TO EXTRACT FROM:
{text}
RELEVANT EXTRACTED INFORMATION:
"""
return prompt
def extract(self, text, queries):
"""
Extract relevant information from text based on queries.
Args:
text: Text to extract from
queries: List of queries to guide extraction
Returns:
Extracted relevant information
"""
# Check text length to determine if extraction is needed
text_tokens = self.count_tokens(text)
# If text is already under token limit, just return it
if text_tokens <= self.max_output_tokens:
print("Text is within token limit, no extraction needed.")
return text
# Create extraction prompt
prompt = self.create_extraction_prompt(queries, text)
# Check prompt size to ensure it fits in model context
prompt_tokens = self.count_tokens(prompt)
# For very large texts that won't fit in model context, we need chunking
if prompt_tokens > 100000: # Assuming context limit of a small model
return self.process_long_text(text, queries)
# Otherwise, do direct extraction
print("Extracting information from text...")
return self.call_llm(prompt)
def process_long_text(self, text, queries):
"""
Process very long text by splitting into chunks and extracting from each.
Args:
text: Long text to process
queries: List of queries for extraction
Returns:
Combined extraction from all chunks
"""
# Calculate how much space we need for queries and prompt template
query_text = "\n".join([f"- {q}" for q in queries])
prompt_template = self.create_extraction_prompt([], "")
fixed_tokens = self.count_tokens(prompt_template) + self.count_tokens(query_text)
# Calculate available space for text in each chunk
# 7000 is a conservative estimate of context window for small model like gpt-4o-mini
# Adjust based on the actual model being used
available_tokens = 100000 - fixed_tokens - 100 # 100 token buffer
# Split text into chunks that fit in context window
text_tokens = self.tokenizer.encode(text)
chunks = []
for i in range(0, len(text_tokens), available_tokens):
chunk_tokens = text_tokens[i:i+available_tokens]
chunk_text = self.tokenizer.decode(chunk_tokens)
chunks.append(chunk_text)
# Process each chunk and collect extractions
all_extractions = []
for i, chunk in enumerate(chunks):
print(f"Processing chunk {i+1}/{len(chunks)}")
# Create an extraction prompt for this chunk
chunk_prompt = f"""
You are performing targeted information extraction. Extract ONLY the most relevant
passages from this text chunk that directly address the queries below.
IMPORTANT CONTEXT:
- This is chunk {i+1} of {len(chunks)} from a larger document
- Extract only information relevant to the queries
- DO NOT summarize - extract exact relevant passages
- Maintain original wording and crucial details
- Maximum extraction length: {self.max_output_tokens // len(chunks)} tokens
QUERIES:
{query_text}
TEXT CHUNK {i+1}/{len(chunks)}:
{chunk}
RELEVANT EXTRACTED INFORMATION:
"""
extracted = self.call_llm(chunk_prompt)
if extracted.strip():
all_extractions.append(extracted.strip())
# Combine all extractions
combined = "\n\n".join(all_extractions)
# If combined extractions are still too long, do a second pass
if self.count_tokens(combined) > self.max_output_tokens:
consolidation_prompt = f"""
You are performing final extraction consolidation. You have extracts from different parts
of a document that address the queries below.
Your task is to create a single coherent extract that includes ONLY the most important and
relevant passages to answer the queries, while avoiding redundancy.
IMPORTANT INSTRUCTIONS:
- Focus only on the most relevant information for the queries
- Maintain original wording from the extracts
- Remove redundant information that appears in multiple extracts
- Create a coherent, self-contained extract
- Maximum output length: {self.max_output_tokens} tokens
QUERIES:
{query_text}
EXTRACTS TO CONSOLIDATE:
{combined}
FINAL CONSOLIDATED EXTRACT:
"""
return self.call_llm(consolidation_prompt)
return combined
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
bases |
- | - |
Parameter Details
max_output_tokens: Maximum number of tokens allowed in the extracted output. Default is 1024. This controls the size of the final extraction and is used to determine if extraction is needed at all (texts shorter than this are returned as-is).
api_key: OpenAI API key for authentication. If provided, sets up the OpenAI client and stores the key in environment variables. Can be None if the key is already set in the environment.
model_name: Name of the OpenAI model to use for extraction. Default is 'gpt-4o-mini'. Should be a model that supports chat completions and has sufficient context window for the extraction tasks.
Return Value
The class constructor returns a QueryBasedExtractor instance. The main extract() method returns a string containing the extracted relevant information from the input text, reduced to focus on query-relevant content and constrained by max_output_tokens. If the input text is already within the token limit, it returns the original text unchanged.
Class Interface
Methods
__init__(self, max_output_tokens=1024, api_key=None, model_name='gpt-4o-mini')
Purpose: Initialize the QueryBasedExtractor with configuration for token limits, API authentication, and model selection. Sets up tiktoken encoder and OpenAI client.
Parameters:
max_output_tokens: Maximum tokens for extracted output (default: 1024)api_key: OpenAI API key for authentication (default: None)model_name: LLM model name to use (default: 'gpt-4o-mini')
Returns: None - constructor initializes instance attributes
count_tokens(self, text: str) -> int
Purpose: Count the number of tokens in a given text string using tiktoken's cl100k_base encoding.
Parameters:
text: String to count tokens for
Returns: Integer count of tokens in the text
call_llm(self, prompt: str) -> str
Purpose: Call the configured LLM with a prompt and return the extracted text. Uses LangChain's ChatOpenAI for consistency.
Parameters:
prompt: Formatted prompt string for extraction
Returns: String containing the LLM's response content
create_extraction_prompt(self, queries: list, text: str) -> str
Purpose: Create a formatted prompt for targeted information extraction based on provided queries and text. Includes detailed instructions for the LLM.
Parameters:
queries: List of query strings to guide the extractiontext: Text content to extract information from
Returns: Formatted prompt string ready for LLM invocation
extract(self, text: str, queries: list) -> str
Purpose: Main extraction method that extracts relevant information from text based on queries. Automatically handles short texts, normal extraction, and long text chunking.
Parameters:
text: Text document to extract information fromqueries: List of query strings to guide what information to extract
Returns: String containing extracted relevant information, constrained by max_output_tokens
process_long_text(self, text: str, queries: list) -> str
Purpose: Process very long texts that exceed model context limits by splitting into chunks, extracting from each chunk, and consolidating results.
Parameters:
text: Long text document that exceeds normal context limitsqueries: List of query strings for extraction guidance
Returns: String containing combined and consolidated extractions from all chunks
Attributes
| Name | Type | Description | Scope |
|---|---|---|---|
max_output_tokens |
int | Maximum number of tokens allowed in the extracted output | instance |
api_key |
str or None | OpenAI API key for authentication | instance |
model_name |
str | Name of the OpenAI model to use for extraction | instance |
tokenizer |
tiktoken.Encoding | Tiktoken encoder instance using cl100k_base encoding for token counting | instance |
client |
openai.OpenAI | OpenAI client instance, only created if api_key is provided | instance |
Dependencies
tiktokenopenailangchain_openaios
Required Imports
import tiktoken
import openai
import os
from langchain_openai import ChatOpenAI
Conditional/Optional Imports
These imports are only needed under specific conditions:
import tiktoken
Condition: imported in __init__ when the class is instantiated
Required (conditional)import openai
Condition: imported in __init__ only if api_key is provided
Optionalimport os
Condition: imported in __init__ only if api_key is provided to set environment variable
Optionalfrom langchain_openai import ChatOpenAI
Condition: imported in call_llm method when LLM is invoked
Required (conditional)Usage Example
# Basic usage
from query_based_extractor import QueryBasedExtractor
# Initialize the extractor
extractor = QueryBasedExtractor(
max_output_tokens=1024,
api_key='your-openai-api-key',
model_name='gpt-4o-mini'
)
# Define queries to guide extraction
queries = [
'What are the main findings of the study?',
'What methodology was used?',
'What are the limitations mentioned?'
]
# Extract relevant information from a long document
long_text = """[Your long document text here]..."""
extracted_info = extractor.extract(long_text, queries)
print(f"Original tokens: {extractor.count_tokens(long_text)}")
print(f"Extracted tokens: {extractor.count_tokens(extracted_info)}")
print(f"\nExtracted content:\n{extracted_info}")
# For very long documents, the class automatically handles chunking
very_long_text = """[Very long document that exceeds context window]..."""
extracted = extractor.extract(very_long_text, queries)
Best Practices
- Always provide an API key either through the constructor or as an environment variable before calling extract()
- The class automatically determines if extraction is needed based on token count - texts under max_output_tokens are returned unchanged
- For very long documents (>100k tokens), the class automatically chunks the text and performs multi-pass extraction
- Queries should be specific and focused to get the best extraction results
- The extractor preserves original wording rather than summarizing, making it suitable for maintaining factual accuracy
- Token counting uses tiktoken's cl100k_base encoding, which may not exactly match the model's tokenizer but provides a good approximation
- The class uses temperature=0 for deterministic extraction results
- When processing long texts, each chunk gets a proportional token allocation (max_output_tokens / num_chunks)
- A second consolidation pass is performed if combined chunk extractions exceed max_output_tokens
- The 100,000 token threshold for chunking is conservative and may need adjustment based on the specific model's context window
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
class QueryBasedExtractor_v1 97.9% similar
-
class QueryBasedExtractor 90.7% similar
-
class DocumentExtractor 61.1% similar
-
class RegulatoryExtractor 60.0% similar
-
class PDFTextExtractor 55.9% similar