Previous topic

models.ldamodel – Latent Dirichlet Allocation

Next topic

models.tfidfmodel – TF-IDF model

models.lsimodel – Latent Semantic Indexing

Module for Latent Semantic Indexing.

class gensim.models.lsimodel.LsiModel(corpus=None, id2word=None, numTopics=200, extraDims=10, chunks=100, dtype=<type 'numpy.float64'>)

Objects of this class allow building and maintaining a model for Latent Semantic Indexing (also known as Latent Semantic Analysis).

The main methods are:

  1. constructor, which initializes the projection into latent topics space,
  2. the [] method, which returns representation of any input document in the latent space,
  3. the addDocuments() method, which allows for incrementally updating the model with new documents.

Model persistency is achieved via its load/save methods.

numTopics is the number of requested factors (latent dimensions).

After the model has been trained, you can estimate topics for an arbitrary, unseen document, using the topics = self[document] dictionary notation. You can also add new training documents, with self.addDocuments, so that training can be stopped and resumed at any time, and the LSI transformation is available at any point.

extraDims is the number of extra dimensions that will be internally computed (ie. numTopics + extraDims) to improve numerical properties of the SVD algorithm. These extra dimensions will be eventually chopped off for the final projection. Set to 0 to save memory; set to ~10 to 2*numTopics for increased SVD precision.

If you specify a corpus, it will be used to train the model. See the method addDocuments for a description of the chunks and decay parameters.

The algorithm is based on Brand, 2006: Fast low-rank modifications of the thin singular value decomposition.

Example:

>>> lsi = LsiModel(corpus, numTopics = 10)
>>> print lsi[doc_tfidf]
>>> lsi.addDocuments(corpus2) # update LSI on additional documents
>>> print lsi[doc_tfidf]
addDocuments(corpus, chunks=100, decay=1.0, reorth=False, updateProjection=True)

Update singular value decomposition factors to take into account a new corpus of documents.

Training proceeds in chunks of chunks documents at a time. This parameter is a tradeoff between increased speed (bigger chunks) vs. lower memory footprint (smaller chunks). Default is processing 100 documents at a time.

Setting decay < 1.0 causes re-orientation towards new data trends in the input document stream, by giving less emphasis to old observations. This allows SVD to gradually “forget” old observations and give more preference to new ones. The decay is applied once after every chunks documents.

This function corresponds to the general update of Brand (section 2), specialized for A = docs.T and B trivial (only append the new columns). For a function that supports arbitrary updates (appending columns, erasing columns, column revisions and recentering), see the svdUpdate function in this module.

classmethod load(fname)
Load a previously saved object from file (also see save).
printTopic(topicNo, topN=10)

Return a specified topic (0 <= topicNo < self.numTopics) as string in human readable format.

>>> lsimodel.printTopic(10, topN = 5)
'-0.340 * "category" + 0.298 * "$M$" + 0.183 * "algebra" + -0.174 * "functor" + -0.168 * "operator"'
save(fname)
Save the object to file via pickling (also see load).
svdAddCols(docs, decay=1.0, reorth=False)

If X = self.u * self.s * self.v^T is the current decomposition, update it so that self.u * self.s * self.v^T = [X docs.T], that is, append new columns to the original matrix.

docs is a dense matrix containing the new observations as rows.

gensim.models.lsimodel.iterSvd(corpus, numTerms, numFactors, numIter=200, initRate=None, convergence=0.0001)

Perform iterative Singular Value Decomposition on a streaming corpus, returning numFactors greatest factors (ie., not necessarily the full spectrum).

The parameters numIter (maximum number of iterations) and initRate (gradient descent step size) guide convergency of the algorithm. It requires numFactors passes over the corpus.

See Genevieve Gorrell: Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing. EACL 2006.

Use of this function is deprecated; although it works, it is several orders of magnitude slower than the direct (non-stochastic) version based on Brand (which operates in a single pass, too) => use svdAddCols/svdUpdate to compute SVD iteratively. I keep this function here purely for backup reasons.

gensim.models.lsimodel.svdUpdate(U, S, V, a, b)

Update SVD of an (m x n) matrix X = U * S * V^T so that [X + a * b^T] = U’ * S’ * V’^T and return U’, S’, V’.

The original matrix X is not needed at all, so this function implements flexible online updates to an existing decomposition.

a and b are (m, 1) and (n, 1) matrices.

You can set V to None if you’re not interested in the right singular vectors. In that case, the returned V’ will also be None (saves memory).

This is the rank-1 update as described in Brand, 2006: Fast low-rank modifications of the thin singular value decomposition