🔍 Code Extractor

function get_document_info

Maturity: 44

Retrieves indexing status and metadata for a document, including whether it's indexed, its document ID, chunk count, and reindexing status.

File:
/tf/active/vicechatdev/docchat/app.py
Lines:
479 - 514
Complexity:
moderate

Purpose

This function checks if a document has been indexed in a vector database (ChromaDB) and retrieves associated metadata. It's used to determine document processing status, track chunks stored in the database, and identify if a document needs reindexing. This is typically part of a document management or RAG (Retrieval-Augmented Generation) system where documents are chunked and stored for semantic search.

Source Code

def get_document_info(file_path):
    """Get indexing status and metadata for a document"""
    try:
        # Check if document is indexed using the indexer's method
        indexed_info = document_indexer.is_document_indexed(Path(file_path))
        
        if not indexed_info:
            return {
                'indexed': False,
                'doc_id': None,
                'chunk_count': 0
            }
        
        doc_id = indexed_info.get('doc_id')
        
        # Get chunk count from ChromaDB using file_path (not doc_id)
        results = document_indexer.collection.get(
            where={"file_path": str(file_path)}
        )
        
        chunk_count = len(results['ids']) if results and 'ids' in results else 0
        
        return {
            'indexed': True,
            'doc_id': doc_id,
            'chunk_count': chunk_count,
            'needs_reindex': indexed_info.get('needs_reindex', False)
        }
        
    except Exception as e:
        logger.warning(f"[GET_DOC_INFO] Error getting info for {file_path}: {e}")
        return {
            'indexed': False,
            'doc_id': None,
            'chunk_count': 0
        }

Parameters

Name Type Default Kind
file_path - - positional_or_keyword

Parameter Details

file_path: Path to the document file (string or Path object). This is used to query the document indexer and ChromaDB collection to retrieve indexing information. The path should be a valid file system path to the document being queried.

Return Value

Returns a dictionary with the following keys: 'indexed' (bool) - whether the document is indexed; 'doc_id' (str or None) - unique identifier for the document in the index; 'chunk_count' (int) - number of chunks the document was split into (0 if not indexed); 'needs_reindex' (bool, optional) - whether the document requires reindexing. On error, returns a default dictionary with indexed=False, doc_id=None, and chunk_count=0.

Dependencies

  • pathlib
  • logging

Required Imports

from pathlib import Path
import logging

Usage Example

import logging
from pathlib import Path
from document_indexer import DocumentIndexer

# Setup logger
logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

# Initialize document indexer (assuming it's configured)
document_indexer = DocumentIndexer(collection_name='documents')

# Get document info
file_path = '/path/to/document.pdf'
info = get_document_info(file_path)

if info['indexed']:
    print(f"Document ID: {info['doc_id']}")
    print(f"Chunks: {info['chunk_count']}")
    print(f"Needs reindex: {info.get('needs_reindex', False)}")
else:
    print("Document is not indexed")

Best Practices

  • Ensure the document_indexer global variable is properly initialized before calling this function
  • The function handles exceptions gracefully and returns a default response on error, but check logs for warnings
  • The file_path parameter is converted to string when querying ChromaDB, so Path objects are acceptable
  • The function queries ChromaDB by 'file_path' metadata field, ensure documents are indexed with this field
  • Consider caching results if calling this function frequently for the same documents
  • The 'needs_reindex' field is optional in the return value and may not always be present
  • This function depends on external state (document_indexer and logger), ensure proper dependency injection in production code

Similar Components

AI-powered semantic similarity - components with related functionality:

  • class DocumentIndexer 66.1% similar

    A class for indexing documents into ChromaDB with support for multiple file formats (PDF, Word, PowerPoint, Excel, text files), smart incremental indexing, and document chunk management.

    From: /tf/active/vicechatdev/docchat/document_indexer.py
  • function index_documents_example 59.3% similar

    A demonstration function that indexes documents from a specified folder using a DocumentIndexer, creating the folder if it doesn't exist, and displays indexing results and collection statistics.

    From: /tf/active/vicechatdev/docchat/example_usage.py
  • function api_documents 58.2% similar

    Flask API endpoint that retrieves statistics and metadata about indexed documents from a document indexer service.

    From: /tf/active/vicechatdev/docchat/app.py
  • function get_document_stats 57.1% similar

    Retrieves aggregated statistics about controlled documents from a Neo4j database, including status and type distributions for visualization in charts.

    From: /tf/active/vicechatdev/CDocs/controllers/admin_controller.py
  • function test_incremental_indexing 56.1% similar

    Comprehensive test function that validates incremental indexing functionality of a document indexing system, including initial indexing, change detection, re-indexing, and force re-indexing scenarios.

    From: /tf/active/vicechatdev/docchat/test_incremental_indexing.py
← Back to Browse