This package contains algorithms for extracting document representations from their raw bag-of-word counts.
Remap feature ids to new values.
Given a mapping between old ids and new ids (some old ids may be missing = these features are to be discarded), this will wrap a corpus so that iterating over VocabTransform[corpus] returns the same vectors but with the new ids.
Old features that have no counterpart in the new ids are discarded. This can be used to filter vocabulary of a corpus “online”:
>>> old2new = dict((oldid, newid) for newid, oldid in enumerate(ids_you_want_to_keep))
>>> vt = VocabTransform(old2new)
>>> for vec_with_new_ids in vt[corpus_with_old_ids]:
>>> ...
Load a previously saved object from file (also see save).
Save the object to file via pickling (also see load).