🔍 Code Extractor

function matches_source_filter

Maturity: 61

Checks if a document path matches any of the provided source filters using exact match, folder prefix match, path component sequence match, or filename match.

File:
/tf/active/vicechatdev/docchat/rag_engine.py
Lines:
31 - 89
Complexity:
moderate

Purpose

This function provides flexible path matching for filtering documents in a document retrieval or indexing system. It supports multiple matching strategies: exact path matching for specific files, prefix matching for folders and their contents, contiguous path component sequence matching for partial paths, and filename-only matching when the filter contains no path separators. This is useful for implementing document filtering in RAG systems, search interfaces, or document management tools where users need to filter by file paths, folders, or filenames.

Source Code

def matches_source_filter(doc_path: str, source_filters: List[str]) -> bool:
    """
    Check if a document path matches any of the source filters.
    Handles both exact file matches and folder prefix matches.
    Also matches by filename if full path doesn't match.
    
    Args:
        doc_path: Path of the document to check
        source_filters: List of file paths or folder paths
        
    Returns:
        True if document matches any filter
    """
    import logging
    from pathlib import Path
    logger = logging.getLogger(__name__)
    
    # Normalize doc_path to use forward slashes
    doc_path_normalized = doc_path.replace('\\', '/')
    
    for filter_path in source_filters:
        # Normalize filter_path
        filter_normalized = filter_path.replace('\\', '/')
        
        # Exact match (for files)
        if doc_path_normalized == filter_normalized:
            logger.debug(f"Exact match: {doc_path} == {filter_path}")
            return True
        
        # Prefix match (for folders)
        # Check if doc is in the folder (starts with folder path + /)
        if doc_path_normalized.startswith(filter_normalized + '/'):
            logger.debug(f"Folder prefix match: {doc_path} starts with {filter_path}/")
            return True
        
        # Check if filter appears as a complete path component
        # Split both paths and check if filter's components appear in order at the end of doc path
        doc_parts = [p for p in doc_path_normalized.split('/') if p]
        filter_parts = [p for p in filter_normalized.split('/') if p]
        
        logger.debug(f"[FILTER_MATCH] Checking sequence match: doc_parts={doc_parts[-3:]}, filter_parts={filter_parts}")
        
        # Try to find filter_parts as a contiguous sequence in doc_parts
        if len(filter_parts) <= len(doc_parts):
            for i in range(len(doc_parts) - len(filter_parts) + 1):
                if doc_parts[i:i+len(filter_parts)] == filter_parts:
                    logger.debug(f"[FILTER_MATCH] ✅ Path component sequence match: {filter_parts} found at index {i}")
                    return True
        
        # Filename match (in case filter is just a filename without path)
        doc_filename = Path(doc_path_normalized).name
        filter_filename = Path(filter_normalized).name
        # Only match filename if filter has no path separators (is just a filename)
        if '/' not in filter_normalized and doc_filename == filter_filename:
            logger.debug(f"Filename match: {doc_filename} (from {doc_path}) == {filter_filename}")
            return True
    
    logger.debug(f"No match: {doc_path} not in filters {source_filters}")
    return False

Parameters

Name Type Default Kind
doc_path str - positional_or_keyword
source_filters List[str] - positional_or_keyword

Parameter Details

doc_path: The full or relative path of the document to check against filters. Can use either forward slashes (/) or backslashes (\) as path separators, which will be normalized internally. Example: 'docs/project/readme.md' or 'docs\project\readme.md'

source_filters: A list of file paths, folder paths, or filenames to match against. Each filter can be: (1) an exact file path, (2) a folder path (will match all files within), (3) a partial path sequence that appears in the document path, or (4) a filename without path separators. Examples: ['docs/project', 'readme.md', 'project/readme.md']

Return Value

Type: bool

Returns a boolean value: True if the doc_path matches any of the filters in source_filters using any of the matching strategies (exact, prefix, sequence, or filename match); False if no matches are found

Dependencies

  • logging
  • pathlib

Required Imports

import logging
from pathlib import Path
from typing import List

Usage Example

from typing import List
import logging
from pathlib import Path

# Configure logging to see debug messages (optional)
logging.basicConfig(level=logging.DEBUG)

def matches_source_filter(doc_path: str, source_filters: List[str]) -> bool:
    # ... (function code here)
    pass

# Example 1: Exact file match
doc = "docs/project/readme.md"
filters = ["docs/project/readme.md", "other/file.txt"]
result = matches_source_filter(doc, filters)
print(f"Exact match: {result}")  # True

# Example 2: Folder prefix match
doc = "docs/project/subfolder/file.txt"
filters = ["docs/project"]
result = matches_source_filter(doc, filters)
print(f"Folder match: {result}")  # True

# Example 3: Path component sequence match
doc = "root/docs/project/readme.md"
filters = ["project/readme.md"]
result = matches_source_filter(doc, filters)
print(f"Sequence match: {result}")  # True

# Example 4: Filename-only match
doc = "some/path/to/readme.md"
filters = ["readme.md"]
result = matches_source_filter(doc, filters)
print(f"Filename match: {result}")  # True

# Example 5: No match
doc = "docs/other/file.txt"
filters = ["docs/project", "readme.md"]
result = matches_source_filter(doc, filters)
print(f"No match: {result}")  # False

Best Practices

  • The function normalizes path separators automatically, so both forward slashes and backslashes work correctly across platforms
  • Filters without path separators (e.g., 'readme.md') will only match by filename, not as part of a path
  • For folder filtering, do not include a trailing slash in the filter - the function adds it automatically for prefix matching
  • The function performs multiple matching strategies in order: exact match, prefix match, sequence match, then filename match - the first match returns True immediately
  • Enable debug logging to see detailed matching information for troubleshooting filter behavior
  • Path component sequence matching allows partial paths to match (e.g., 'project/file.txt' matches 'root/docs/project/file.txt')
  • Empty path components (from double slashes) are automatically filtered out during path splitting
  • The function is case-sensitive for path matching - ensure filter cases match document path cases on case-sensitive file systems

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function get_folder_documents 49.3% similar

    Retrieves and filters documents from a SharePoint folder by matching the folder name in document paths.

    From: /tf/active/vicechatdev/SPFCsync/test_folder_structure.py
  • function is_valid_document_file 46.6% similar

    Validates whether a given filename has an extension corresponding to a supported document type by checking against a predefined list of valid document extensions.

    From: /tf/active/vicechatdev/CDocs/utils/__init__.py
  • function search_documents 45.8% similar

    Searches for documents in a Neo4j graph database based on multiple optional filter criteria including text query, document type, department, status, and owner.

    From: /tf/active/vicechatdev/document_controller_backup.py
  • function search_documents_v1 44.6% similar

    Searches for controlled documents in a Neo4j graph database based on multiple optional filter criteria including text query, document type, department, status, and owner.

    From: /tf/active/vicechatdev/CDocs/controllers/document_controller.py
  • function search_documents_in_filecloud 44.1% similar

    Searches for controlled documents in FileCloud using text search and optional metadata filters, returning structured document information including UIDs, versions, and metadata.

    From: /tf/active/vicechatdev/CDocs/controllers/filecloud_controller.py
← Back to Browse