🔍 Code Extractor

function calculate_similarity

Maturity: 54

Computes the cosine similarity between two embedding vectors, returning a normalized score between 0 and 1 that measures their directional alignment.

File:
/tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
Lines:
6 - 21
Complexity:
simple

Purpose

This function calculates the cosine similarity metric between two numerical vectors, commonly used in machine learning and NLP applications to measure semantic similarity between embeddings, compare document representations, or find nearest neighbors in vector spaces. The cosine similarity measures the cosine of the angle between two vectors, with 1 indicating identical direction, 0 indicating orthogonality, and values approaching 0 indicating dissimilarity.

Source Code

def calculate_similarity(vec1: List[float], vec2: List[float]) -> float:
    """
    Calculate cosine similarity between two embedding vectors.
    
    Args:
        vec1: First embedding vector
        vec2: Second embedding vector
        
    Returns:
        Cosine similarity score between 0 and 1
    """
    # Reshape vectors for sklearn's cosine_similarity
    v1 = np.array(vec1).reshape(1, -1)
    v2 = np.array(vec2).reshape(1, -1)
    
    return float(cosine_similarity(v1, v2)[0][0])

Parameters

Name Type Default Kind
vec1 List[float] - positional_or_keyword
vec2 List[float] - positional_or_keyword

Parameter Details

vec1: First embedding vector as a list of floating-point numbers. Must be a non-empty list with the same dimensionality as vec2. Typically represents a numerical embedding from a machine learning model (e.g., word embeddings, sentence embeddings, or feature vectors).

vec2: Second embedding vector as a list of floating-point numbers. Must be a non-empty list with the same dimensionality as vec1. Should represent the same type of embedding as vec1 for meaningful comparison.

Return Value

Type: float

Returns a float representing the cosine similarity score between the two input vectors. The value ranges from 0 to 1, where 1 indicates the vectors point in exactly the same direction (maximum similarity), 0 indicates orthogonal vectors (no similarity), and values closer to 0 indicate increasing dissimilarity. Note: While cosine similarity can theoretically range from -1 to 1, the docstring indicates this implementation returns values between 0 and 1, suggesting the input vectors are expected to have non-negative components or the context assumes similarity interpretation.

Dependencies

  • numpy
  • scikit-learn
  • typing

Required Imports

import numpy as np
from typing import List
from sklearn.metrics.pairwise import cosine_similarity

Usage Example

import numpy as np
from typing import List
from sklearn.metrics.pairwise import cosine_similarity

def calculate_similarity(vec1: List[float], vec2: List[float]) -> float:
    v1 = np.array(vec1).reshape(1, -1)
    v2 = np.array(vec2).reshape(1, -1)
    return float(cosine_similarity(v1, v2)[0][0])

# Example usage
vector1 = [1.0, 2.0, 3.0, 4.0]
vector2 = [2.0, 4.0, 6.0, 8.0]

similarity_score = calculate_similarity(vector1, vector2)
print(f"Cosine similarity: {similarity_score}")
# Output: Cosine similarity: 1.0 (vectors are in same direction)

# Compare different vectors
vector3 = [1.0, 0.0, 0.0, 0.0]
vector4 = [0.0, 1.0, 0.0, 0.0]
similarity_score2 = calculate_similarity(vector3, vector4)
print(f"Cosine similarity: {similarity_score2}")
# Output: Cosine similarity: 0.0 (vectors are orthogonal)

Best Practices

  • Ensure both input vectors have the same dimensionality; mismatched dimensions will cause numpy/sklearn errors
  • Input vectors should be non-empty lists to avoid division by zero or invalid operations
  • For large-scale similarity computations, consider batch processing using sklearn's cosine_similarity directly with 2D arrays instead of calling this function repeatedly
  • Be aware that cosine similarity is scale-invariant (only considers direction, not magnitude), so vectors [1,2,3] and [2,4,6] will have similarity of 1.0
  • If working with sparse vectors or very high-dimensional data, consider using scipy.sparse matrices for memory efficiency
  • The function converts the result to float explicitly, which is useful for JSON serialization or database storage

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function build_similarity_matrix 63.2% similar

    Computes a pairwise cosine similarity matrix for a collection of embedding vectors, where each cell (i,j) represents the similarity between embedding i and embedding j.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
  • function find_similar_documents 46.9% similar

    Identifies pairs of similar documents by comparing their embeddings and returns those exceeding a specified similarity threshold, sorted by similarity score.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
  • function calculate_cv_v1 46.5% similar

    Calculates the Coefficient of Variation (CV) for a dataset, expressed as a percentage. CV measures relative variability by dividing standard deviation by mean.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/d1e252f5-950c-4ad7-b425-86b4b02c3c62/analysis_4.py
  • function calculate_cv 43.8% similar

    Calculates the coefficient of variation (CV) for a dataset, expressed as a percentage of the standard deviation relative to the mean.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/d48d7789-9627-4e96-9f48-f90b687cd07d/analysis_1.py
  • function calculate_cv_v2 42.1% similar

    Calculates the coefficient of variation (CV) for a group of numerical values, expressed as a percentage.

    From: /tf/active/vicechatdev/vice_ai/smartstat_scripts/f5da873e-41e6-4f34-b3e4-f7443d4d213b/analysis_5.py
← Back to Browse