Module for Latent Semantic Indexing.
Objects of this class allow building and maintaining a model for Latent Semantic Indexing (also known as Latent Semantic Analysis).
The main methods are:
Model persistency is achieved via its load/save methods.
Find latent space based on the corpus provided.
numTopics is the number of requested factors (latent dimensions).
After the model has been initialized, you can estimate topics for an arbitrary, unseen document, using the topics = self[document] dictionary notation.
Example:
>>> lsi = LsiModel(corpus, numTopics = 10)
>>> print lsi[doc_tfidf]
Run SVD decomposition on the corpus. This will define the latent space into which terms and documents will be mapped.
The SVD is created incrementally, in blocks of chunks documents. In the end, a self.projection matrix is constructed that can be used to transform documents into the latent space. The U, S, V decomposition itself is discarded, unless keepDecomposition is True, in which case it is stored in self.u, self.s and self.v.
dtype dictates precision used for intermediate computations; the final projection will however always be of type numpy.float32.
The algorithm is adapted from: M. Brand. 2006. Fast low-rank modifications of the thin singular value decomposition
Return a specified topic (0 <= topicNo < self.numTopics) as string in human readable format.
>>> lsimodel.printTopic(10, topN = 5)
'-0.340 * "category" + 0.298 * "$M$" + 0.183 * "algebra" + -0.174 * "functor" + -0.168 * "operator"'
Update singular value decomposition factors to take into account new documents docs.
This function corresponds to the general update of Brand (section 2), specialized for A = docs.T and B trivial (no update to matrix rows).
The documents are assumed to be a list of full vectors (ie. not sparse 2-tuples).
Compute new decomposition u’, s’, v’ so that if the current matrix X decomposes to u * s * v^T ~= X, then u’ * s’ * v’^T ~= [X docs^T]
u, s, v and their new values u’, s’, v’ are stored within self (ie. as self.u, self.v etc.).
self.v can be set to None, in which case it is completely ignored. This saves a bit of speed and a lot of memory, especially for huge corpora (size of v is linear in the number of added documents).
Perform iterative Singular Value Decomposition on a streaming matrix (corpus), returning numFactors greatest factors (ie., not necessarily full spectrum).
The parameters numIter (maximum number of iterations) and initRate (gradient descent step size) guide convergency of the algorithm.
See Genevieve Gorrell: Generalized Hebbian Algorithm for Incremental Singular Value Decomposition in Natural Language Processing. EACL 2006.
Use of this function is deprecated; although it works, it is several orders of magnitude slower than the direct (non-stochastic) version based on Brand. Use svdAddCols/svdUpdate to compute SVD iteratively. I keep this function here purely for backup reasons.
Update SVD of an (m x n) matrix X = U * S * V^T so that [X + a * b^T] = U’ * S’ * V’^T and return U’, S’, V’.
a and b are (m, 1) and (n, 1) rank-1 matrices, so that svdUpdate can simulate incremental addition of one new document and/or term to an already existing decomposition.