function analyze_file_types
Analyzes file types within a replica database structure, counting different file categories and tracking file extensions.
/tf/active/vicechatdev/e-ink-llm/cloudtest/analyze_replica.py
132 - 168
simple
Purpose
This function processes a database dictionary containing nodes with extracted files, categorizing them by type (PDF, notebook, RM, content files) and counting occurrences of each file extension. It's designed for analyzing file distribution in a replica system, likely for a note-taking or document management application (possibly reMarkable tablet data).
Source Code
def analyze_file_types(database: Dict[str, Any]) -> Dict[str, Any]:
"""Analyze file types in the replica"""
nodes = database.get('nodes', {})
file_stats = {
'pdf_files': 0,
'notebook_files': 0,
'rm_files': 0,
'content_files': 0,
'total_extracted_files': 0,
'file_extensions': {}
}
for uuid, node in nodes.items():
extracted_files = node.get('extracted_files', [])
file_stats['total_extracted_files'] += len(extracted_files)
for file_path in extracted_files:
file_path_obj = Path(file_path)
ext = file_path_obj.suffix.lower()
if ext == '.pdf':
file_stats['pdf_files'] += 1
elif ext == '.rm':
file_stats['rm_files'] += 1
elif file_path_obj.name == 'content':
file_stats['content_files'] += 1
elif '_notebook' in str(file_path_obj.parent):
file_stats['notebook_files'] += 1
# Count extensions
if ext:
file_stats['file_extensions'][ext] = file_stats['file_extensions'].get(ext, 0) + 1
else:
file_stats['file_extensions']['[no extension]'] = file_stats['file_extensions'].get('[no extension]', 0) + 1
return file_stats
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
database |
Dict[str, Any] | - | positional_or_keyword |
Parameter Details
database: A dictionary containing replica data with a 'nodes' key. Each node should have a 'uuid' as key and contain an 'extracted_files' list with file paths. Expected structure: {'nodes': {uuid: {'extracted_files': [file_paths]}}}
Return Value
Type: Dict[str, Any]
Returns a dictionary with file statistics containing: 'pdf_files' (int: count of PDF files), 'notebook_files' (int: files in notebook directories), 'rm_files' (int: .rm format files), 'content_files' (int: files named 'content'), 'total_extracted_files' (int: total file count), and 'file_extensions' (dict: mapping of extensions to their counts, with '[no extension]' for files without extensions)
Required Imports
from pathlib import Path
from typing import Dict, Any
Usage Example
from pathlib import Path
from typing import Dict, Any
def analyze_file_types(database: Dict[str, Any]) -> Dict[str, Any]:
# ... function code ...
pass
# Example usage
database = {
'nodes': {
'uuid-1': {
'extracted_files': [
'/path/to/document.pdf',
'/path/to/notes_notebook/page1.rm',
'/path/to/content'
]
},
'uuid-2': {
'extracted_files': [
'/path/to/another.pdf',
'/path/to/file.txt'
]
}
}
}
results = analyze_file_types(database)
print(f"Total files: {results['total_extracted_files']}")
print(f"PDF files: {results['pdf_files']}")
print(f"Extensions: {results['file_extensions']}")
Best Practices
- Ensure the database parameter contains a 'nodes' key with properly structured node data to avoid empty results
- File paths in 'extracted_files' should be valid path strings that can be processed by pathlib.Path
- The function safely handles missing 'nodes' or 'extracted_files' keys by using .get() with default values
- File extension matching is case-insensitive (uses .lower())
- Files without extensions are tracked under the '[no extension]' key in file_extensions dictionary
- The function does not validate if file paths actually exist on the filesystem, it only analyzes the paths stored in the database
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function print_database_analysis 65.7% similar
-
function analyze_rm_filename_patterns 55.0% similar
-
function analyze_pylontech_document 54.2% similar
-
function main_v113 54.1% similar
-
function analyze_hierarchy 53.3% similar