This module contains basic interfaces used throughout the whole gensim package.
The interfaces are realized as abstract base classes (ie., some optional functionality is provided in the interface itself, so that the interfaces can be subclassed).
Interface (abstract base class) for corpora. A corpus is simply an iterable, where each iteration step yields one document:
>>> for doc in corpus:
>>> # do something with the doc...
A document is a sequence of (fieldId, fieldValue) 2-tuples:
>>> for attr_id, attr_value in doc:
>>> # do something with the attribute
Note that although a default len() method is provided, it is very inefficient (performs a linear scan through the corpus to determine its length). Wherever the corpus size is needed and known in advance (or at least doesn’t change so that it can be cached), the len() method should be overridden.
See the gensim.corpora.svmlightcorpus module for an example of a corpus.
Saving the corpus with the save method (inherited from utils.SaveLoad) will only store the in-memory (binary, pickled) object representation=the stream state, and not the documents themselves. See the saveCorpus static method for serializing the actual stream content.
Load a previously saved object from file (also see save).
Save the object to file via pickling (also see load).
Save an existing corpus to disk.
Some formats also support saving the dictionary (feature_id->word mapping), which can in this case be provided by the optional id2word parameter.
>>> MmCorpus.saveCorpus('file.mm', corpus)
Some corpora also support an index of where each document begins, so that the documents on disk can be accessed in O(1) time (see the corpora.IndexedCorpus base class). In this case, saveCorpus is automatically called internally by serialize, which does saveCorpus plus saves the index at the same time, so you want to store the corpus with:
>>> MmCorpus.serialize('file.mm', corpus) # stores index as well, allowing random access to individual documents
Abstract interface for similarity searches over a corpus.
In all instances, there is a corpus against which we want to perform the similarity search.
For each similarity search, the input is a document and the output are its similarities to individual corpus documents.
Similarity queries are realized by calling self[query_document].
There is also a convenience wrapper, where iterating over self yields similarities of each document in the corpus against the whole corpus (ie., the query is each corpus document in turn).
Return similarity of a sparse vector doc to all documents in the corpus.
The document is assumed to be either of unit length or empty.
Load a previously saved object from file (also see save).
Save the object to file via pickling (also see load).
Interface for transformations. A ‘transformation’ is any object which accepts a sparse document via the dictionary notation [] and returns another sparse document in its stead.
See the gensim.models.tfidfmodel module for an example of a transformation.
Load a previously saved object from file (also see save).
Save the object to file via pickling (also see load).