similarities.docsim – Document similarity queries

This module contains functions and classes for computing similarities across a collection of vectors=documents in the Vector Space Model.

The main classes are :

  1. Similarity – answers similarity queries by linearly scanning over the corpus. This is slow but memory independent.
  2. MatrixSimilarity – stores the whole corpus in memory, computes similarity by in-memory matrix-vector multiplication. This is much faster than the general Similarity, so use this when dealing with smaller corpora (must fit in RAM).
  3. SparseMatrixSimilarity – same as MatrixSimilarity, but uses less memory if the vectors are sparse.

Once the similarity object has been initialized, you can query for document similarity simply by

>>> similarities = similarity_object[query_vector]

or iterate over within-corpus similarities with

>>> for similarities in similarity_object:
>>>     ...

class gensim.similarities.docsim.MatrixSimilarity(corpus, numBest=None, dtype=<type 'numpy.float32'>, numFeatures=None)

Compute similarity against a corpus of documents by storing its term-document (or concept-document) matrix in memory. The similarity measure used is cosine between two vectors.

This allows fast similarity searches (simple sparse matrix-vector multiplication), but loses the memory-independence of an iterative corpus.

The matrix is internally stored as a numpy array.

If numBest is left unspecified, similarity queries return a full list (one float for every document in the corpus, including the query document):

If numBest is set, queries return numBest most similar documents, as a sorted list:

>>> sms = MatrixSimilarity(corpus, numBest = 3)
>>> sms[vec12]
[(12, 1.0), (30, 0.95), (5, 0.45)]
getSimilarities(doc)

Return similarity of sparse vector doc to all documents in the corpus.

doc may be either a bag-of-words iterable (standard corpus document), or a numpy array, or a scipy.sparse matrix.

classmethod load(fname)
Load a previously saved object from file (also see save).
save(fname)
Save the object to file via pickling (also see load).
class gensim.similarities.docsim.Similarity(corpus, numBest=None)

Compute cosine similarity against a corpus of documents. This is done by a full sequential scan of the corpus.

If your corpus is reasonably small (fits in RAM), consider using MatrixSimilarity or SparseMatrixSimilarity instead, for (much) faster similarity searches.

If numBest is left unspecified, similarity queries return a full list (one float for every document in the corpus, including the query document):

If numBest is set, queries return numBest most similar documents, as a sorted list:

>>> sms = MatrixSimilarity(corpus, numBest = 3)
>>> sms[vec12]
[(12, 1.0), (30, 0.95), (5, 0.45)]
classmethod load(fname)
Load a previously saved object from file (also see save).
save(fname)
Save the object to file via pickling (also see load).
class gensim.similarities.docsim.SparseMatrixSimilarity(corpus, numBest=None, dtype=<type 'numpy.float32'>)

Compute similarity against a corpus of documents by storing its sparse term-document (or concept-document) matrix in memory. The similarity measure used is cosine between two vectors.

This allows fast similarity searches (simple sparse matrix-vector multiplication), but loses the memory-independence of an iterative corpus.

The matrix is internally stored as a scipy.sparse.csr matrix.

If numBest is left unspecified, similarity queries return a full list (one float for every document in the corpus, including the query document):

If numBest is set, queries return numBest most similar documents, as a sorted list:

>>> sms = SparseMatrixSimilarity(corpus, numBest = 3)
>>> sms[vec12]
[(12, 1.0), (30, 0.95), (5, 0.45)]
getSimilarities(doc)

Return similarity of sparse vector doc to all documents in the corpus.

doc may be either a bag-of-words iterable (standard corpus document), or a numpy array, or a scipy.sparse matrix.

classmethod load(fname)
Load a previously saved object from file (also see save).
save(fname)
Save the object to file via pickling (also see load).

Previous topic

models.tfidfmodel – TF-IDF model