This module contains functions and classes for computing similarities across a collection of vectors=documents in the Vector Space Model.
The main classes are :
Once the similarity object has been initialized, you can query for document similarity simply by
>>> similarities = similarity_object[query_vector]
or iterate over within-corpus similarities with
>>> for similarities in similarity_object:
>>> ...
Compute similarity against a corpus of documents by storing its term-document (or concept-document) matrix in memory. The similarity measure used is cosine between two vectors.
This allows fast similarity searches (simple sparse matrix-vector multiplication), but loses the memory-independence of an iterative corpus.
The matrix is internally stored as a numpy array.
If numBest is left unspecified, similarity queries return a full list (one float for every document in the corpus, including the query document):
If numBest is set, queries return numBest most similar documents, as a sorted list:
>>> sms = MatrixSimilarity(corpus, numBest = 3)
>>> sms[vec12]
[(12, 1.0), (30, 0.95), (5, 0.45)]
Return similarity of sparse vector doc to all documents in the corpus.
doc may be either a bag-of-words iterable (standard corpus document), or a numpy array, or a scipy.sparse matrix.
Compute cosine similarity against a corpus of documents. This is done by a full sequential scan of the corpus.
If your corpus is reasonably small (fits in RAM), consider using MatrixSimilarity or SparseMatrixSimilarity instead, for (much) faster similarity searches.
If numBest is left unspecified, similarity queries return a full list (one float for every document in the corpus, including the query document):
If numBest is set, queries return numBest most similar documents, as a sorted list:
>>> sms = MatrixSimilarity(corpus, numBest = 3)
>>> sms[vec12]
[(12, 1.0), (30, 0.95), (5, 0.45)]
Compute similarity against a corpus of documents by storing its sparse term-document (or concept-document) matrix in memory. The similarity measure used is cosine between two vectors.
This allows fast similarity searches (simple sparse matrix-vector multiplication), but loses the memory-independence of an iterative corpus.
The matrix is internally stored as a scipy.sparse.csr matrix.
If numBest is left unspecified, similarity queries return a full list (one float for every document in the corpus, including the query document):
If numBest is set, queries return numBest most similar documents, as a sorted list:
>>> sms = SparseMatrixSimilarity(corpus, numBest = 3)
>>> sms[vec12]
[(12, 1.0), (30, 0.95), (5, 0.45)]
Return similarity of sparse vector doc to all documents in the corpus.
doc may be either a bag-of-words iterable (standard corpus document), or a numpy array, or a scipy.sparse matrix.