🔍 Code Extractor

function build_similarity_matrix

Maturity: 51

Computes a pairwise cosine similarity matrix for a collection of embedding vectors, where each cell (i,j) represents the similarity between embedding i and embedding j.

File:
/tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
Lines:
24 - 35
Complexity:
simple

Purpose

This function is used to measure the similarity between multiple embedding vectors simultaneously by creating a square matrix of cosine similarity scores. Common use cases include clustering similar documents, finding duplicate or near-duplicate content, building recommendation systems, semantic search result ranking, and analyzing relationships between text embeddings. The resulting matrix is symmetric with diagonal values of 1.0 (perfect self-similarity).

Source Code

def build_similarity_matrix(embeddings: List[List[float]]) -> np.ndarray:
    """
    Build a similarity matrix for all embeddings.
    
    Args:
        embeddings: List of embedding vectors
        
    Returns:
        Similarity matrix as numpy array
    """
    embeddings_array = np.array(embeddings)
    return cosine_similarity(embeddings_array)

Parameters

Name Type Default Kind
embeddings List[List[float]] - positional_or_keyword

Parameter Details

embeddings: A list of embedding vectors where each embedding is represented as a list of floats. All embeddings must have the same dimensionality. Typically these are dense vector representations from models like Word2Vec, BERT, or OpenAI embeddings. Empty lists or embeddings with inconsistent dimensions will cause errors. Expected shape: [[float, float, ...], [float, float, ...], ...]

Return Value

Type: np.ndarray

Returns a 2D numpy ndarray containing cosine similarity scores. The matrix is square with dimensions (n x n) where n is the number of input embeddings. Values range from -1 to 1, where 1 indicates identical vectors, 0 indicates orthogonal vectors, and -1 indicates opposite vectors. The matrix is symmetric (matrix[i][j] == matrix[j][i]) and the diagonal contains all 1.0 values (self-similarity). Type: numpy.ndarray with dtype float64.

Dependencies

  • numpy
  • sklearn
  • typing

Required Imports

import numpy as np
from typing import List
from sklearn.metrics.pairwise import cosine_similarity

Usage Example

import numpy as np
from typing import List
from sklearn.metrics.pairwise import cosine_similarity

def build_similarity_matrix(embeddings: List[List[float]]) -> np.ndarray:
    embeddings_array = np.array(embeddings)
    return cosine_similarity(embeddings_array)

# Example usage
embeddings = [
    [1.0, 0.5, 0.2],
    [0.9, 0.6, 0.1],
    [0.1, 0.2, 0.9]
]

similarity_matrix = build_similarity_matrix(embeddings)
print(similarity_matrix)
print(f"Shape: {similarity_matrix.shape}")
print(f"Similarity between embedding 0 and 1: {similarity_matrix[0][1]:.4f}")

Best Practices

  • Ensure all embeddings have the same dimensionality before passing to this function
  • For large datasets (thousands of embeddings), consider the memory requirements as the resulting matrix size is O(n²)
  • The input embeddings should be normalized if you want pure cosine similarity; sklearn's cosine_similarity handles normalization internally
  • Consider using sparse matrices or batch processing for very large embedding collections to avoid memory issues
  • The resulting matrix is symmetric, so you only need to process the upper or lower triangle for efficiency in downstream tasks
  • Check for NaN or infinite values in input embeddings as they will propagate to the similarity matrix

Similar Components

AI-powered semantic similarity - components with related functionality:

  • function calculate_similarity 63.2% similar

    Computes the cosine similarity between two embedding vectors, returning a normalized score between 0 and 1 that measures their directional alignment.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
  • function find_similar_documents 58.4% similar

    Identifies pairs of similar documents by comparing their embeddings and returns those exceeding a specified similarity threshold, sorted by similarity score.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
  • class SimilarityCleaner 45.3% similar

    A document cleaning class that identifies and removes duplicate or highly similar documents based on embedding vector similarity, keeping only representative documents from each similarity group.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/cleaners/similarity_cleaner.py
  • class TextClusterer 39.4% similar

    A class that clusters similar documents based on their embeddings using various clustering algorithms (K-means, Agglomerative, DBSCAN) and optionally generates summaries for each cluster.

    From: /tf/active/vicechatdev/chromadb-cleanup/src/clustering/text_clusterer.py
  • function cross_index 37.8% similar

    Efficiently indexes into a Cartesian product of iterables without materializing the full product, using a linear index to retrieve the corresponding tuple of values.

    From: /tf/active/vicechatdev/patches/util.py
← Back to Browse