function test_incremental_indexing
Comprehensive test function that validates incremental indexing functionality of a document indexing system, including initial indexing, change detection, re-indexing, and force re-indexing scenarios.
/tf/active/vicechatdev/docchat/test_incremental_indexing.py
32 - 207
complex
Purpose
This test function validates the complete incremental indexing workflow of a DocumentIndexer system. It creates a temporary test environment with multiple documents and runs six distinct test scenarios: (1) initial indexing of documents, (2) skipping unchanged documents on re-run, (3) detecting and re-indexing modified documents, (4) indexing only new documents while skipping existing ones, (5) force re-indexing all documents, and (6) verifying collection statistics. The function ensures that the indexing system correctly tracks document states, detects changes, and optimizes indexing operations by avoiding unnecessary re-processing.
Source Code
def test_incremental_indexing():
"""Test that incremental indexing works correctly"""
print("=" * 80)
print("INCREMENTAL INDEXING TEST")
print("=" * 80)
print()
# Create temporary test directory
test_dir = Path(tempfile.mkdtemp(prefix="docchat_test_"))
print(f"Created test directory: {test_dir}")
try:
# Initialize indexer with a test collection
test_collection = "test_incremental_indexing"
print(f"Initializing indexer with collection: {test_collection}")
indexer = DocumentIndexer(collection_name=test_collection)
# Clear collection for fresh test
print("Clearing collection...")
indexer.clear_collection()
print()
# TEST 1: Initial indexing of 3 documents
print("TEST 1: Initial Indexing")
print("-" * 40)
doc1 = test_dir / "doc1.txt"
doc2 = test_dir / "doc2.txt"
doc3 = test_dir / "doc3.txt"
create_test_document(doc1, "This is document 1. It contains important information.")
create_test_document(doc2, "This is document 2. It has different content.")
create_test_document(doc3, "This is document 3. More unique information here.")
results = indexer.index_folder(test_dir, force_reindex=False)
print(f"✓ Total files: {results['total']}")
print(f"✓ New documents indexed: {results['success'] - results['reindexed']}")
print(f"✓ Re-indexed: {results['reindexed']}")
print(f"✓ Skipped: {results['skipped']}")
print(f"✓ Failed: {results['failed']}")
assert results['total'] == 3, f"Expected 3 files, got {results['total']}"
assert results['success'] == 3, f"Expected 3 successful, got {results['success']}"
assert results['reindexed'] == 0, f"Expected 0 re-indexed, got {results['reindexed']}"
assert results['skipped'] == 0, f"Expected 0 skipped, got {results['skipped']}"
assert results['failed'] == 0, f"Expected 0 failed, got {results['failed']}"
print("✅ TEST 1 PASSED: All 3 documents indexed successfully")
print()
# TEST 2: Re-run indexing without changes - should skip all
print("TEST 2: Re-indexing Without Changes")
print("-" * 40)
results = indexer.index_folder(test_dir, force_reindex=False)
print(f"✓ Total files: {results['total']}")
print(f"✓ New documents indexed: {results['success'] - results['reindexed']}")
print(f"✓ Re-indexed: {results['reindexed']}")
print(f"✓ Skipped: {results['skipped']}")
print(f"✓ Failed: {results['failed']}")
assert results['total'] == 3, f"Expected 3 files, got {results['total']}"
assert results['skipped'] == 3, f"Expected 3 skipped, got {results['skipped']}"
assert results['success'] == 0, f"Expected 0 new indexed, got {results['success']}"
assert results['reindexed'] == 0, f"Expected 0 re-indexed, got {results['reindexed']}"
print("✅ TEST 2 PASSED: All 3 documents skipped (already up-to-date)")
print()
# TEST 3: Modify one document - should re-index only that one
print("TEST 3: Re-indexing After Modification")
print("-" * 40)
print("Modifying doc2.txt...")
modify_test_document(doc2, "This is document 2 with UPDATED content!")
results = indexer.index_folder(test_dir, force_reindex=False)
print(f"✓ Total files: {results['total']}")
print(f"✓ New documents indexed: {results['success'] - results['reindexed']}")
print(f"✓ Re-indexed: {results['reindexed']}")
print(f"✓ Skipped: {results['skipped']}")
print(f"✓ Failed: {results['failed']}")
assert results['total'] == 3, f"Expected 3 files, got {results['total']}"
assert results['skipped'] == 2, f"Expected 2 skipped, got {results['skipped']}"
assert results['success'] == 1, f"Expected 1 successful, got {results['success']}"
assert results['reindexed'] == 1, f"Expected 1 re-indexed, got {results['reindexed']}"
print("✅ TEST 3 PASSED: Only modified document re-indexed, others skipped")
print()
# TEST 4: Add new document - should index only the new one
print("TEST 4: Adding New Document")
print("-" * 40)
doc4 = test_dir / "doc4.txt"
create_test_document(doc4, "This is a brand new document 4.")
results = indexer.index_folder(test_dir, force_reindex=False)
print(f"✓ Total files: {results['total']}")
print(f"✓ New documents indexed: {results['success'] - results['reindexed']}")
print(f"✓ Re-indexed: {results['reindexed']}")
print(f"✓ Skipped: {results['skipped']}")
print(f"✓ Failed: {results['failed']}")
assert results['total'] == 4, f"Expected 4 files, got {results['total']}"
assert results['skipped'] == 3, f"Expected 3 skipped, got {results['skipped']}"
assert results['success'] == 1, f"Expected 1 successful, got {results['success']}"
assert results['reindexed'] == 0, f"Expected 0 re-indexed, got {results['reindexed']}"
print("✅ TEST 4 PASSED: Only new document indexed, existing ones skipped")
print()
# TEST 5: Force re-index - should re-index all
print("TEST 5: Force Re-indexing")
print("-" * 40)
results = indexer.index_folder(test_dir, force_reindex=True)
print(f"✓ Total files: {results['total']}")
print(f"✓ New documents indexed: {results['success'] - results['reindexed']}")
print(f"✓ Re-indexed: {results['reindexed']}")
print(f"✓ Skipped: {results['skipped']}")
print(f"✓ Failed: {results['failed']}")
assert results['total'] == 4, f"Expected 4 files, got {results['total']}"
assert results['success'] == 4, f"Expected 4 successful, got {results['success']}"
assert results['reindexed'] == 4, f"Expected 4 re-indexed, got {results['reindexed']}"
assert results['skipped'] == 0, f"Expected 0 skipped, got {results['skipped']}"
print("✅ TEST 5 PASSED: All documents force re-indexed")
print()
# TEST 6: Check collection stats
print("TEST 6: Collection Statistics")
print("-" * 40)
stats = indexer.get_collection_stats()
print(f"✓ Total chunks: {stats['total_chunks']}")
print(f"✓ Total documents: {stats['total_documents']}")
print(f"✓ Files: {', '.join(stats['file_names'])}")
assert stats['total_documents'] == 4, f"Expected 4 documents, got {stats['total_documents']}"
assert len(stats['file_names']) == 4, f"Expected 4 file names, got {len(stats['file_names'])}"
print("✅ TEST 6 PASSED: Collection stats correct")
print()
# Clean up
print("Cleaning up test collection...")
indexer.clear_collection()
print("=" * 80)
print("✅ ALL INCREMENTAL INDEXING TESTS PASSED!")
print("=" * 80)
print()
print("Summary:")
print(" ✓ Initial indexing works")
print(" ✓ Unchanged documents are skipped")
print(" ✓ Modified documents are detected and re-indexed")
print(" ✓ New documents are indexed, existing ones skipped")
print(" ✓ Force re-index works for all documents")
print(" ✓ Collection statistics are accurate")
print()
finally:
# Clean up test directory
if test_dir.exists():
shutil.rmtree(test_dir)
print(f"Cleaned up test directory: {test_dir}")
Return Value
This function does not return any value (implicitly returns None). It performs assertions throughout execution and prints detailed test results to stdout. If any assertion fails, it raises an AssertionError with a descriptive message.
Dependencies
pathlibtempfileshutilconfigdocument_indexer
Required Imports
from pathlib import Path
import tempfile
import shutil
import config
from document_indexer import DocumentIndexer
Usage Example
# Ensure helper functions are defined:
# def create_test_document(path, content):
# path.write_text(content)
# def modify_test_document(path, content):
# path.write_text(content)
# Run the test
test_incremental_indexing()
# Expected output:
# ================================================================================
# INCREMENTAL INDEXING TEST
# ================================================================================
# Created test directory: /tmp/docchat_test_xyz123
# Initializing indexer with collection: test_incremental_indexing
# ...
# ✅ ALL INCREMENTAL INDEXING TESTS PASSED!
# Summary:
# ✓ Initial indexing works
# ✓ Unchanged documents are skipped
# ✓ Modified documents are detected and re-indexed
# ✓ New documents are indexed, existing ones skipped
# ✓ Force re-index works for all documents
# ✓ Collection statistics are accurate
Best Practices
- This is a test function and should be run in a test environment, not in production
- The function creates temporary files and directories which are cleaned up in a finally block to ensure proper resource management
- Requires helper functions create_test_document() and modify_test_document() to be defined in the same module
- Uses a unique test collection name to avoid conflicts with production data
- Clears the test collection both at the start and end to ensure test isolation
- Each test scenario includes detailed assertions with descriptive error messages for debugging
- The function prints comprehensive output for manual verification and debugging
- Temporary directory cleanup is guaranteed through try-finally block even if tests fail
- Tests should be run sequentially as they build upon each other's state
- The DocumentIndexer must support the force_reindex parameter and return results dictionary with keys: total, success, reindexed, skipped, failed
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function index_documents_example 65.9% similar
-
class DocumentIndexer 58.0% similar
-
function main_v22 57.0% similar
-
function get_document_info 56.1% similar
-
function test_reference_system_completeness 53.1% similar