function build_similarity_matrix
Computes a pairwise cosine similarity matrix for a collection of embedding vectors, where each cell (i,j) represents the similarity between embedding i and embedding j.
/tf/active/vicechatdev/chromadb-cleanup/src/utils/similarity_utils.py
24 - 35
simple
Purpose
This function is used to measure the similarity between multiple embedding vectors simultaneously by creating a square matrix of cosine similarity scores. Common use cases include clustering similar documents, finding duplicate or near-duplicate content, building recommendation systems, semantic search result ranking, and analyzing relationships between text embeddings. The resulting matrix is symmetric with diagonal values of 1.0 (perfect self-similarity).
Source Code
def build_similarity_matrix(embeddings: List[List[float]]) -> np.ndarray:
"""
Build a similarity matrix for all embeddings.
Args:
embeddings: List of embedding vectors
Returns:
Similarity matrix as numpy array
"""
embeddings_array = np.array(embeddings)
return cosine_similarity(embeddings_array)
Parameters
| Name | Type | Default | Kind |
|---|---|---|---|
embeddings |
List[List[float]] | - | positional_or_keyword |
Parameter Details
embeddings: A list of embedding vectors where each embedding is represented as a list of floats. All embeddings must have the same dimensionality. Typically these are dense vector representations from models like Word2Vec, BERT, or OpenAI embeddings. Empty lists or embeddings with inconsistent dimensions will cause errors. Expected shape: [[float, float, ...], [float, float, ...], ...]
Return Value
Type: np.ndarray
Returns a 2D numpy ndarray containing cosine similarity scores. The matrix is square with dimensions (n x n) where n is the number of input embeddings. Values range from -1 to 1, where 1 indicates identical vectors, 0 indicates orthogonal vectors, and -1 indicates opposite vectors. The matrix is symmetric (matrix[i][j] == matrix[j][i]) and the diagonal contains all 1.0 values (self-similarity). Type: numpy.ndarray with dtype float64.
Dependencies
numpysklearntyping
Required Imports
import numpy as np
from typing import List
from sklearn.metrics.pairwise import cosine_similarity
Usage Example
import numpy as np
from typing import List
from sklearn.metrics.pairwise import cosine_similarity
def build_similarity_matrix(embeddings: List[List[float]]) -> np.ndarray:
embeddings_array = np.array(embeddings)
return cosine_similarity(embeddings_array)
# Example usage
embeddings = [
[1.0, 0.5, 0.2],
[0.9, 0.6, 0.1],
[0.1, 0.2, 0.9]
]
similarity_matrix = build_similarity_matrix(embeddings)
print(similarity_matrix)
print(f"Shape: {similarity_matrix.shape}")
print(f"Similarity between embedding 0 and 1: {similarity_matrix[0][1]:.4f}")
Best Practices
- Ensure all embeddings have the same dimensionality before passing to this function
- For large datasets (thousands of embeddings), consider the memory requirements as the resulting matrix size is O(n²)
- The input embeddings should be normalized if you want pure cosine similarity; sklearn's cosine_similarity handles normalization internally
- Consider using sparse matrices or batch processing for very large embedding collections to avoid memory issues
- The resulting matrix is symmetric, so you only need to process the upper or lower triangle for efficiency in downstream tasks
- Check for NaN or infinite values in input embeddings as they will propagate to the similarity matrix
Tags
Similar Components
AI-powered semantic similarity - components with related functionality:
-
function calculate_similarity 63.2% similar
-
function find_similar_documents 58.4% similar
-
class SimilarityCleaner 45.3% similar
-
class TextClusterer 39.4% similar
-
function cross_index 37.8% similar