corpora.lowcorpus – Corpus in List-of-Words format

Corpus in GibbsLda++ format of List-Of-Words.

class gensim.corpora.lowcorpus.LowCorpus(fname, id2word=None, line2words=<function splitOnSpace at 0x191c730>)

List_Of_Words corpus handles input in GibbsLda++ format.

Quoting http://gibbslda.sourceforge.net/#3.2_Input_Data_Format:

Both data for training/estimating the model and new data (i.e., previously 
unseen data) have the same format as follows:

[M]
[document1]
[document2]
...
[documentM]

in which the first line is the total number for documents [M]. Each line 
after that is one document. [documenti] is the ith document of the dataset 
that consists of a list of Ni words/terms.

[documenti] = [wordi1] [wordi2] ... [wordiNi]

in which all [wordij] (i=1..M, j=1..Ni) are text strings and they are separated 
by the blank character.

Initialize the corpus from a file.

id2word and line2words are optional parameters.

If provided, id2word is a dictionary mapping between wordIds (integers) and words (strings). If not provided, the mapping is constructed from the documents.

line2words is a function which converts lines into tokens. Defaults to simple splitting on spaces.

classmethod load(fname)
Load a previously saved object from file (also see save).
save(fname)
Save the object to file via pickling (also see load).
static saveCorpus(fname, corpus, id2word=None)
Save a corpus in the List-of-words format.

Previous topic

corpora.dmlcorpus – Corpus in DML-CZ format

Next topic

corpora.mmcorpus – Corpus in Matrix Market format