Corpus in SVMlight format.
Corpus in SVMlight format.
Quoting http://svmlight.joachims.org/: The input file contains the training examples. The first lines may contain comments and are ignored if they start with #. Each of the following lines represents one training example and is of the following format:
<line> .=. <target> <feature>:<value> <feature>:<value> ... <feature>:<value> # <info>
<target> .=. +1 | -1 | 0 | <float>
<feature> .=. <integer> | "qid"
<value> .=. <float>
<info> .=. <string>
The “qid” feature (used for SVMlight ranking), if present, is ignored.
Although not mentioned in the specification above, SVMlight also expect its feature ids to be 1-based (counting starts at 1). We convert features to 0-base internally by decrementing all ids when loading a SVMlight input file, and increment them again when saving as SVMlight.
Initialize the corpus from a file.
Return the document stored at file position offset.
Load a previously saved object from file (also see save).
Save the object to file via pickling (also see load).
Save a corpus in the SVMlight format.
The SVMlight <target> class tag is set to 0 for all documents.
This function is automatically called by SvmLightCorpus.serialize; don’t call it directly, call serialize instead.
Iterate through the document stream corpus, saving the documents to fname and recording byte offset of each document. Save the resulting index structure to file index_fname (or fname.index is not set).
This relies on the underlying corpus class serializer providing (in addition to standard iteration):
- save_corpus method that returns a sequence of byte offsets, one for
each saved document,
the docbyoffset(offset) method, which returns a document positioned at offset bytes within the persistent storage (file).
Example:
>>> MmCorpus.serialize('test.mm', corpus)
>>> mm = MmCorpus('test.mm') # `mm` document stream now has random access
>>> print mm[42] # retrieve document no. 42, etc.