Blei’s LDA-C format.
Corpus in Blei’s LDA-C format.
The corpus is represented as two files: one describing the documents, and another describing the mapping between words and their ids.
Each document is one line: N fieldId1:fieldValue1 fieldId2:fieldValue2 ... fieldIdN:fieldValueN
The vocabulary is a file with words, one word per line; word at line K has an implicit id=K.
Initialize the corpus from a file.
fnameVocab is the file with vocabulary; if not specified, it defaults to ‘fname.vocab’
Save a corpus in the Matrix Market format.
There are actually two files saved: fname and fname.vocab, where fname.vocab is the vocabulary file.