Previous topic

models.ldamodel – Latent Dirichlet Allocation

Next topic

models.tfidfmodel – TF-IDF model

models.lsimodel – Latent Semantic Indexing

Module for Latent Semantic Indexing.

It actually contains several algorithms for decomposition of extremely large corpora, a combination of which effectively and transparently allows building LSI models for:

  • corpora much larger than RAM: only constant memory needed, independent of the corpus size (though still dependent on the feature set size)
  • corpora that are streamed: documents are only accessed sequentially, not random-accessed
  • corpora that cannot even be temporarily stored, each document can only be seen once and must be processed immediately (one-pass algorithm)
  • distributed computing for ultra large corpora, making use of a cluster of machines

Wall-clock performance on the English Wikipedia (2G corpus positions, 3.2M documents, 100K features, 0.5G non-zero entries in the final TF-IDF matrix), requesting the top 400 LSI factors:

algorithm serial distributed
one-pass update algo (chunks=factors) 109h 19h
one-pass merge algo (chunks=40K docs) 8.5h 2.3h
two-pass randomized algo (chunks=40K) 2.5h N/A [1]

serial = Core 2 Duo MacBook Pro 2.53Ghz, 4GB RAM, libVec

distributed = cluster of six logical nodes on four physical machines, each with dual core Xeon 2.0GHz, 4GB RAM, ATLAS

[1]The two-pass algo could be distributed too, but most time is already spent reading/decompressing the input from disk, and the extra network traffic due to data distribution would likely make it actually slower.
class gensim.models.lsimodel.LsiModel(corpus=None, numTopics=200, id2word=None, chunks=20000, decay=1.0, distributed=False, onepass=False)

Objects of this class allow building and maintaining a model for Latent Semantic Indexing (also known as Latent Semantic Analysis).

The main methods are:

  1. constructor, which initializes the projection into latent topics space,
  2. the [] method, which returns representation of any input document in the latent space,
  3. the addDocuments() method, which allows for incrementally updating the model with new documents.

Model persistency is achieved via its load/save methods.

numTopics is the number of requested factors (latent dimensions).

After the model has been trained, you can estimate topics for an arbitrary, unseen document, using the topics = self[document] dictionary notation. You can also add new training documents, with self.addDocuments, so that training can be stopped and resumed at any time, and the LSI transformation is available at any point.

If you specify a corpus, it will be used to train the model. See the method addDocuments for a description of the chunks and decay parameters.

If your document stream is one-pass only (the stream cannot be repeated), turn on onepass to force a single pass SVD algorithm (slower).

Turn on distributed to force distributed computing.

Example:

>>> lsi = LsiModel(corpus, numTopics=10)
>>> print lsi[doc_tfidf] # project some document into LSI space
>>> lsi.addDocuments(corpus2) # update LSI on additional documents
>>> print lsi[doc_tfidf]
addDocuments(corpus, chunks=None, decay=None)

Update singular value decomposition to take into account a new corpus of documents.

Training proceeds in chunks of chunks documents at a time. The size of chunks is a tradeoff between increased speed (bigger chunks) vs. lower memory footprint (smaller chunks). If the distributed mode is on, each chunk is sent to a different worker/computer.

Setting decay < 1.0 causes re-orientation towards new data trends in the input document stream, by giving less emphasis to old observations. This allows SVD to gradually “forget” old observations (documents) and give more preference to new ones.

classmethod load(fname)
Load a previously saved object from file (also see save).
printDebug(numTopics=5, numWords=10)

Print (to log) the most salient words of the first numTopics topics.

Unlike printTopics(), this looks for words that are significant for a particular topic and not for others. This should result in a more human-interpretable description of topics.

printTopic(topicNo, topN=10)

Return a specified topic (=left singular vector), 0 <= topicNo < self.numTopics, as string.

Return only the topN words which contribute the most to the direction of the topic (both negative and positive).

>>> lsimodel.printTopic(10, topN = 5)
'-0.340 * "category" + 0.298 * "$M$" + 0.183 * "algebra" + -0.174 * "functor" + -0.168 * "operator"'
save(fname)
Save the object to file via pickling (also see load).
gensim.models.lsimodel.clipSpectrum(s, k, discard=0.001)

Given eigenvalues s, return how many factors should be kept to avoid storing spurious (tiny, numerically instable) values.

This will ignore the tail of the spectrum with relative combined mass < min(discard, 1/k).

The returned value is clipped against k (= at most k factors).

gensim.models.lsimodel.iterSvd(corpus, numTerms, numFactors, numIter=200, initRate=None, convergence=0.0001)

Perform iterative Singular Value Decomposition on a streaming corpus, returning numFactors greatest factors U,S,V^T (ie., not necessarily the full spectrum).

The parameters numIter (maximum number of iterations) and initRate (gradient descent step size) guide convergency of the algorithm.

The algorithm performs at most numFactors*numIters passes over the corpus.

Use of this function is deprecated; although it works, it is several orders of magnitude slower than our own, direct (non-stochastic) version (which operates in a single pass, too, and can be distributed). I keep this function here purely for backup reasons.

See Genevieve Gorrell: Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing. EACL 2006.

gensim.models.lsimodel.stochasticSvd(corpus, rank, num_terms=None, chunks=20000, extra_dims=None, dtype=<type 'numpy.float64'>, eps=9.9999999999999995e-07)

Return U, S – the left singular vectors and the singular values of the streamed input corpus.

This may actually return less than the requested number of top rank factors, in case the input is of lower rank. Also note that the decomposition is unique up the the sign of the left singular vectors (columns of U).

This is a streamed, two-pass algorithm, without power-iterations. In case you can only afford a single pass over the input corpus, set onepass=True in LsiModel and avoid using this algorithm.

The decomposition algorithm is based on Halko, Martinsson, Tropp. Finding structure with randomness, 2009.

gensim.models.lsimodel.svdUpdate(U, S, V, a, b)

Update SVD of an (m x n) matrix X = U * S * V^T so that [X + a * b^T] = U’ * S’ * V’^T and return U’, S’, V’.

The original matrix X is not needed at all, so this function implements one-pass streaming rank-1 updates to an existing decomposition.

a and b are (m, 1) and (n, 1) matrices.

You can set V to None if you’re not interested in the right singular vectors. In that case, the returned V’ will also be None (saves memory).

The blocked merge algorithm in LsiModel.addDocuments() is much faster; I keep this fnc here purely for backup reasons.

This is the rank-1 update as described in Brand, 2006: Fast low-rank modifications of the thin singular value decomposition, but without separating the basis from rotations.