function matches_source_filter
Checks if a document path matches any of the provided source filters using exact match, folder prefix match, path component sequence match, or filename match.
/tf/active/vicechatdev/docchat/rag_engine.py
31 - 89
moderate
Purpose
This function provides flexible path matching for filtering documents in a document retrieval or indexing system. It supports multiple matching strategies: exact path matching for specific files, prefix matching for folders and their contents, contiguous path component sequence matching for partial paths, and filename-only matching when the filter contains no path separators. This is useful for implementing document filtering in RAG systems, search interfaces, or document management tools where users need to filter by file paths, folders, or filenames.
Source Code
def matches_source_filter(doc_path: str, source_filters: List[str]) -> bool:
"""
Check if a document path matches any of the source filters.
Handles both exact file matches and folder prefix matches.
Also matches by filename if full path doesn't match.
Args:
doc_path: Path of the document to check
source_filters: List of file paths or folder paths
Returns:
True if document matches any filter
"""
import logging
from pathlib import Path
logger = logging.getLogger(__name__)
# Normalize doc_path to use forward slashes
doc_path_normalized = doc_path.replace('\\', '/')
for filter_path in source_filters:
# Normalize filter_path
filter_normalized = filter_path.replace('\\', '/')
# Exact match (for files)
if doc_path_normalized == filter_normalized:
logger.debug(f"Exact match: {doc_path} == {filter_path}")
return True
# Prefix match (for folders)
# Check if doc is in the folder (starts with folder path + /)
if doc_path_normalized.startswith(filter_normalized + '/'):
logger.debug(f"Folder prefix match: {doc_path} starts with {filter_path}/")
return True
# Check if filter appears as a complete path component
# Split both paths and check if filter's components appear in order at the end of doc path
doc_parts = [p for p in doc_path_normalized.split('/') if p]
filter_parts = [p for p in filter_normalized.split('/') if p]
logger.debug(f"[FILTER_MATCH] Checking sequence match: doc_parts={doc_parts[-3:]}, filter_parts={filter_parts}")
# Try to find filter_parts as a contiguous sequence in doc_parts
if len(filter_parts) <= len(doc_parts):
for i in range(len(doc_parts) - len(filter_parts) + 1):
if doc_parts[i:i+len(filter_parts)] == filter_parts:
logger.debug(f"[FILTER_MATCH] ✅ Path component sequence match: {filter_parts} found at index {i}")
return True
# Filename match (in case filter is just a filename without path)
doc_filename = Path(doc_path_normalized).name
filter_filename = Path(filter_normalized).name
# Only match filename if filter has no path separators (is just a filename)
if '/' not in filter_normalized and doc_filename == filter_filename:
logger.debug(f"Filename match: {doc_filename} (from {doc_path}) == {filter_filename}")
return True
logger.debug(f"No match: {doc_path} not in filters {source_filters}")
return False
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
doc_path |
str | - | positional_or_keyword |
source_filters |
List[str] | - | positional_or_keyword |
Parameter Details
doc_path: The full or relative path of the document to check against filters. Can use either forward slashes (/) or backslashes (\) as path separators, which will be normalized internally. Example: 'docs/project/readme.md' or 'docs\project\readme.md'
source_filters: A list of file paths, folder paths, or filenames to match against. Each filter can be: (1) an exact file path, (2) a folder path (will match all files within), (3) a partial path sequence that appears in the document path, or (4) a filename without path separators. Examples: ['docs/project', 'readme.md', 'project/readme.md']
Return Value
Type: bool
Returns a boolean value: True if the doc_path matches any of the filters in source_filters using any of the matching strategies (exact, prefix, sequence, or filename match); False if no matches are found
Dependencies
loggingpathlib
Required Imports
import logging
from pathlib import Path
from typing import List
Usage Example
from typing import List
import logging
from pathlib import Path
# Configure logging to see debug messages (optional)
logging.basicConfig(level=logging.DEBUG)
def matches_source_filter(doc_path: str, source_filters: List[str]) -> bool:
# ... (function code here)
pass
# Example 1: Exact file match
doc = "docs/project/readme.md"
filters = ["docs/project/readme.md", "other/file.txt"]
result = matches_source_filter(doc, filters)
print(f"Exact match: {result}") # True
# Example 2: Folder prefix match
doc = "docs/project/subfolder/file.txt"
filters = ["docs/project"]
result = matches_source_filter(doc, filters)
print(f"Folder match: {result}") # True
# Example 3: Path component sequence match
doc = "root/docs/project/readme.md"
filters = ["project/readme.md"]
result = matches_source_filter(doc, filters)
print(f"Sequence match: {result}") # True
# Example 4: Filename-only match
doc = "some/path/to/readme.md"
filters = ["readme.md"]
result = matches_source_filter(doc, filters)
print(f"Filename match: {result}") # True
# Example 5: No match
doc = "docs/other/file.txt"
filters = ["docs/project", "readme.md"]
result = matches_source_filter(doc, filters)
print(f"No match: {result}") # False
Best Practices
- The function normalizes path separators automatically, so both forward slashes and backslashes work correctly across platforms
- Filters without path separators (e.g., 'readme.md') will only match by filename, not as part of a path
- For folder filtering, do not include a trailing slash in the filter - the function adds it automatically for prefix matching
- The function performs multiple matching strategies in order: exact match, prefix match, sequence match, then filename match - the first match returns True immediately
- Enable debug logging to see detailed matching information for troubleshooting filter behavior
- Path component sequence matching allows partial paths to match (e.g., 'project/file.txt' matches 'root/docs/project/file.txt')
- Empty path components (from double slashes) are automatically filtered out during path splitting
- The function is case-sensitive for path matching - ensure filter cases match document path cases on case-sensitive file systems
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function get_folder_documents 49.3% similar
-
function is_valid_document_file 46.6% similar
-
function search_documents 45.8% similar
-
function search_documents_v1 44.6% similar
-
function search_documents_in_filecloud 44.1% similar