Module for Latent Semantic Indexing.
Objects of this class allow building and maintaining a model for Latent Semantic Indexing (also known as Latent Semantic Analysis).
The main methods are:
Model persistency is achieved via its load/save methods.
numTopics is the number of requested factors (latent dimensions).
After the model has been trained, you can estimate topics for an arbitrary, unseen document, using the topics = self[document] dictionary notation. You can also add new training documents, with self.addDocuments, so that training can be stopped and resumed at any time, and the LSI transformation is available at any point.
If you specify a corpus, it will be used to train the model. See the method addDocuments for a description of the chunks and decay parameters.
The algorithm will automatically try to find active nodes on other computers and run in a distributed manner; if this fails, it falls back to serial mode (single core). To suppress distributed computing, set the serial_only constructor parameter to True.
Example:
>>> lsi = LsiModel(corpus, numTopics = 10)
>>> print lsi[doc_tfidf]
>>> lsi.addDocuments(corpus2) # update LSI on additional documents
>>> print lsi[doc_tfidf]
Update singular value decomposition factors to take into account a new corpus of documents.
Training proceeds in chunks of chunks documents at a time. If the distributed mode is on, each chunk is sent to a different worker/computer. Size of chunks is a tradeoff between increased speed (bigger chunks) vs. lower memory footprint (smaller chunks). Default is processing 10,000 documents at a time.
Setting decay < 1.0 causes re-orientation towards new data trends in the input document stream, by giving less emphasis to old observations. This allows SVD to gradually “forget” old observations and give more preference to new ones. The decay is applied once after every chunks documents.
Print (to log) the most salient words of the first numTopics topics.
Unlike printTopics(), this looks for words that are significant for a particular topic and not for others. This should result in a more human-interpretable description of topics.
Return a specified topic (=left singular vector), 0 <= topicNo < self.numTopics, as string.
Return only the topN words which contribute the most to the direction of the topic (both negative and positive).
>>> lsimodel.printTopic(10, topN = 5)
'-0.340 * "category" + 0.298 * "$M$" + 0.183 * "algebra" + -0.174 * "functor" + -0.168 * "operator"'
Perform iterative Singular Value Decomposition on a streaming corpus, returning numFactors greatest factors U,S,V^T (ie., not necessarily the full spectrum).
The parameters numIter (maximum number of iterations) and initRate (gradient descent step size) guide convergency of the algorithm.
The algorithm performs at most numFactors*numIters passes over the corpus.
See Genevieve Gorrell: Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing. EACL 2006.
Use of this function is deprecated; although it works, it is several orders of magnitude slower than our own, direct (non-stochastic) version (which operates in a single pass, too, and can be distributed).
I keep this function here purely for backup reasons.
Update SVD of an (m x n) matrix X = U * S * V^T so that [X + a * b^T] = U’ * S’ * V’^T and return U’, S’, V’.
The original matrix X is not needed at all, so this function implements flexible online updates to an existing decomposition.
a and b are (m, 1) and (n, 1) matrices.
You can set V to None if you’re not interested in the right singular vectors. In that case, the returned V’ will also be None (saves memory).
This is the rank-1 update as described in Brand, 2006: Fast low-rank modifications of the thin singular value decomposition, but without separating the basis from rotations.