function get_document_info
Retrieves indexing status and metadata for a document, including whether it's indexed, its document ID, chunk count, and reindexing status.
/tf/active/vicechatdev/docchat/app.py
479 - 514
moderate
Purpose
This function checks if a document has been indexed in a vector database (ChromaDB) and retrieves associated metadata. It's used to determine document processing status, track chunks stored in the database, and identify if a document needs reindexing. This is typically part of a document management or RAG (Retrieval-Augmented Generation) system where documents are chunked and stored for semantic search.
Source Code
def get_document_info(file_path):
"""Get indexing status and metadata for a document"""
try:
# Check if document is indexed using the indexer's method
indexed_info = document_indexer.is_document_indexed(Path(file_path))
if not indexed_info:
return {
'indexed': False,
'doc_id': None,
'chunk_count': 0
}
doc_id = indexed_info.get('doc_id')
# Get chunk count from ChromaDB using file_path (not doc_id)
results = document_indexer.collection.get(
where={"file_path": str(file_path)}
)
chunk_count = len(results['ids']) if results and 'ids' in results else 0
return {
'indexed': True,
'doc_id': doc_id,
'chunk_count': chunk_count,
'needs_reindex': indexed_info.get('needs_reindex', False)
}
except Exception as e:
logger.warning(f"[GET_DOC_INFO] Error getting info for {file_path}: {e}")
return {
'indexed': False,
'doc_id': None,
'chunk_count': 0
}
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
file_path |
- | - | positional_or_keyword |
Parameter Details
file_path: Path to the document file (string or Path object). This is used to query the document indexer and ChromaDB collection to retrieve indexing information. The path should be a valid file system path to the document being queried.
Return Value
Returns a dictionary with the following keys: 'indexed' (bool) - whether the document is indexed; 'doc_id' (str or None) - unique identifier for the document in the index; 'chunk_count' (int) - number of chunks the document was split into (0 if not indexed); 'needs_reindex' (bool, optional) - whether the document requires reindexing. On error, returns a default dictionary with indexed=False, doc_id=None, and chunk_count=0.
Dependencies
pathliblogging
Required Imports
from pathlib import Path
import logging
Usage Example
import logging
from pathlib import Path
from document_indexer import DocumentIndexer
# Setup logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)
# Initialize document indexer (assuming it's configured)
document_indexer = DocumentIndexer(collection_name='documents')
# Get document info
file_path = '/path/to/document.pdf'
info = get_document_info(file_path)
if info['indexed']:
print(f"Document ID: {info['doc_id']}")
print(f"Chunks: {info['chunk_count']}")
print(f"Needs reindex: {info.get('needs_reindex', False)}")
else:
print("Document is not indexed")
Best Practices
- Ensure the document_indexer global variable is properly initialized before calling this function
- The function handles exceptions gracefully and returns a default response on error, but check logs for warnings
- The file_path parameter is converted to string when querying ChromaDB, so Path objects are acceptable
- The function queries ChromaDB by 'file_path' metadata field, ensure documents are indexed with this field
- Consider caching results if calling this function frequently for the same documents
- The 'needs_reindex' field is optional in the return value and may not always be present
- This function depends on external state (document_indexer and logger), ensure proper dependency injection in production code
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
class DocumentIndexer 66.1% similar
-
function index_documents_example 59.3% similar
-
function api_documents 58.2% similar
-
function get_document_stats 57.1% similar
-
function test_incremental_indexing 56.1% similar