Corpus in the Matrix Market format.
Corpus in the Matrix Market format.
Return document at file offset offset (in bytes)
Load a previously saved object from file (also see save).
Save the object to file via pickling (also see load).
Save a corpus in the Matrix Market format to disk.
This function is automatically called by MmCorpus.serialize; don’t call it directly, call serialize instead.
Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).
This relies on the underlying corpus class serializer providing (in addition to standard iteration):
- saveCorpus method that returns a sequence of byte offsets, one for
each saved document,
the docbyoffset(offset) method, which returns a document positioned at offset bytes within the persistent storage (file).
Example:
>>> MmCorpus.serialize('test.mm', corpus)
>>> mm = MmCorpus('test.mm') # `mm` document stream now has random access, `mm[42]` etc.