This module contains math helper functions.
Wrap a term-document matrix on disk (in matrix-market format), and present it as an object which supports iteration over the rows (~documents).
Note that the file is read into memory one document at a time, not the whole matrix at once (unlike scipy.io.mmread). This allows us to process corpora which are larger than the available RAM.
Initialize the matrix reader.
The input refers to a file on local filesystem, which is expected to be in the sparse (coordinate) Matrix Market format. Documents are assumed to be rows of the matrix (and document features are columns).
input is either a string (file path) or a file-like object that supports seek(0) (e.g. gzip.GzipFile, bz2.BZ2File).
Store corpus in Matrix Market format.
Save the vector space representation of an entire corpus to disk.
Note that the documents are processed one at a time, so the whole corpus is allowed to be larger than the available RAM.
Write a single sparse vector to the file.
Sparse vector is any iterable yielding (field id, field value) pairs.
Convert corpus into a sparse matrix, in scipy.sparse.csc_matrix format.
The corpus must not be empty (at least one document).
Scale a vector to unit length. The only exception is the zero vector, which is returned back unchanged.
If the input is sparse (list of 2-tuples), output will also be sparse. Otherwise, output will be a numpy array.