This module contains basic interfaces used throughout the whole gensim package.
The interfaces are realized as abstract base classes (ie., some optional functionality is provided in the interface itself, so that the interfaces can be subclassed).
Interface for corpora. A corpus is simply an iterable, where each iteration step yields one document. A document is a list of (fieldId, fieldValue) 2-tuples.
See the corpora package for some example corpus implementations.
Note that although a default len() method is provided, it is very inefficient (performs a linear scan through the corpus to determine its length). Wherever the corpus size is needed and known in advance (or at least doesn’t change so that it can be cached), the len() method should be overridden.
Interface for transformations. A ‘transformation’ is any object which accepts a sparse document via the dictionary notation [] and returns another sparse document in its stead.
See the tfidfmodel module for an example of a transformation.