🔍 Code Extractor

function analyze_file_types

Maturity: 49

Analyzes file types within a replica database structure, counting different file categories and tracking file extensions.

File:
/tf/active/vicechatdev/e-ink-llm/cloudtest/analyze_replica.py
Lines:
132 - 168
Complexity:
simple

Purpose

This function processes a database dictionary containing nodes with extracted files, categorizing them by type (PDF, notebook, RM, content files) and counting occurrences of each file extension. It's designed for analyzing file distribution in a replica system, likely for a note-taking or document management application (possibly reMarkable tablet data).

Source Code

def analyze_file_types(database: Dict[str, Any]) -> Dict[str, Any]:
    """Analyze file types in the replica"""
    nodes = database.get('nodes', {})
    
    file_stats = {
        'pdf_files': 0,
        'notebook_files': 0,
        'rm_files': 0,
        'content_files': 0,
        'total_extracted_files': 0,
        'file_extensions': {}
    }
    
    for uuid, node in nodes.items():
        extracted_files = node.get('extracted_files', [])
        file_stats['total_extracted_files'] += len(extracted_files)
        
        for file_path in extracted_files:
            file_path_obj = Path(file_path)
            ext = file_path_obj.suffix.lower()
            
            if ext == '.pdf':
                file_stats['pdf_files'] += 1
            elif ext == '.rm':
                file_stats['rm_files'] += 1
            elif file_path_obj.name == 'content':
                file_stats['content_files'] += 1
            elif '_notebook' in str(file_path_obj.parent):
                file_stats['notebook_files'] += 1
            
            # Count extensions
            if ext:
                file_stats['file_extensions'][ext] = file_stats['file_extensions'].get(ext, 0) + 1
            else:
                file_stats['file_extensions']['[no extension]'] = file_stats['file_extensions'].get('[no extension]', 0) + 1
    
    return file_stats

Parameters

Name Type Default Kind
database Dict[str, Any] - positional_or_keyword

Parameter Details

database: A dictionary containing replica data with a 'nodes' key. Each node should have a 'uuid' as key and contain an 'extracted_files' list with file paths. Expected structure: {'nodes': {uuid: {'extracted_files': [file_paths]}}}

Return Value

Type: Dict[str, Any]

Returns a dictionary with file statistics containing: 'pdf_files' (int: count of PDF files), 'notebook_files' (int: files in notebook directories), 'rm_files' (int: .rm format files), 'content_files' (int: files named 'content'), 'total_extracted_files' (int: total file count), and 'file_extensions' (dict: mapping of extensions to their counts, with '[no extension]' for files without extensions)

Required Imports

from pathlib import Path
from typing import Dict, Any

Usage Example

from pathlib import Path
from typing import Dict, Any

def analyze_file_types(database: Dict[str, Any]) -> Dict[str, Any]:
    # ... function code ...
    pass

# Example usage
database = {
    'nodes': {
        'uuid-1': {
            'extracted_files': [
                '/path/to/document.pdf',
                '/path/to/notes_notebook/page1.rm',
                '/path/to/content'
            ]
        },
        'uuid-2': {
            'extracted_files': [
                '/path/to/another.pdf',
                '/path/to/file.txt'
            ]
        }
    }
}

results = analyze_file_types(database)
print(f"Total files: {results['total_extracted_files']}")
print(f"PDF files: {results['pdf_files']}")
print(f"Extensions: {results['file_extensions']}")

Best Practices

  • Ensure the database parameter contains a 'nodes' key with properly structured node data to avoid empty results
  • File paths in 'extracted_files' should be valid path strings that can be processed by pathlib.Path
  • The function safely handles missing 'nodes' or 'extracted_files' keys by using .get() with default values
  • File extension matching is case-insensitive (uses .lower())
  • Files without extensions are tracked under the '[no extension]' key in file_extensions dictionary
  • The function does not validate if file paths actually exist on the filesystem, it only analyzes the paths stored in the database

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function print_database_analysis 65.7% similar

    Prints a comprehensive, formatted analysis of a reMarkable tablet replica database, including statistics, hierarchy information, file types, and a content tree visualization.

    From: /tf/active/vicechatdev/e-ink-llm/cloudtest/analyze_replica.py
  • function analyze_rm_filename_patterns 55.0% similar

    Analyzes and documents the rm-filename header patterns used in reMarkable cloud sync API requests by examining raw log data and printing a comprehensive report of file naming conventions, upload sequences, and implementation requirements.

    From: /tf/active/vicechatdev/e-ink-llm/cloudtest/analyze_headers.py
  • function analyze_pylontech_document 54.2% similar

    Performs deep forensic analysis of a specific Pylontech document stored in reMarkable Cloud, examining all document components (content, metadata, pagedata, PDF) to identify patterns and differences between app-uploaded and API-uploaded documents.

    From: /tf/active/vicechatdev/e-ink-llm/cloudtest/analyze_pylontech_details.py
  • function main_v113 54.1% similar

    Analyzes and compares .content files for PDF documents stored in reMarkable cloud storage, identifying differences between working and non-working documents.

    From: /tf/active/vicechatdev/e-ink-llm/cloudtest/analyze_content_files.py
  • function analyze_hierarchy 53.3% similar

    Analyzes a hierarchical database structure to extract statistics about nodes, their relationships, depths, and types.

    From: /tf/active/vicechatdev/e-ink-llm/cloudtest/analyze_replica.py
← Back to Browse