Previous topic

corpora.dictionary – Construct word<->id mappings

Next topic

corpora.lowcorpus – Corpus in List-of-Words format

corpora.dmlcorpus – Corpus in DML-CZ format

Corpus for the DML-CZ project.

class gensim.corpora.dmlcorpus.DmlConfig(configId, resultDir, acceptLangs=None)

DmlConfig contains parameters necessary for the abstraction of a ‘corpus of articles’ (see the DmlCorpus class).

Articles may come from different sources (=different locations on disk/netword, different file formats etc.), so the main purpose of DmlConfig is to keep all sources in one place.

Apart from glueing sources together, DmlConfig also decides where to store output files and which articles to accept for the corpus (= an additional filter over the sources).

class gensim.corpora.dmlcorpus.DmlCorpus

DmlCorpus implements a collection of articles. It is initialized via a DmlConfig object, which holds information about where to look for the articles and how to process them.

Apart from being a regular corpus (bag-of-words iterable with a len() method), DmlCorpus has methods for building a dictionary (mapping between words and their ids).

articleDir(docNo)

Return absolute normalized path on filesystem to article no. docNo.

buildDictionary()

Populate dictionary mapping and statistics.

This is done by sequentially retrieving the article fulltexts, splitting them into tokens and converting tokens to their ids (creating new ids as necessary).

getMeta(docNo)

Return metadata for article no. docNo.

classmethod load(fname)

Load a previously saved object from file (also see save).

processConfig(config, shuffle=False)

Parse the directories specified in the config, looking for suitable articles.

This updates the self.documents var, which keeps a list of (source id, article uri) 2-tuples. Each tuple is a unique identifier of one article.

Note that some articles are ignored based on config settings (for example if the article’s language doesn’t match any language specified in the config etc.).

save(fname)

Save the object to file via pickling (also see load).

saveAsText()

Store the corpus to disk, in a human-readable text format.

This actually saves multiple files:

  1. Pure document-term co-occurence frequency counts, as a Matrix Market file.
  2. Token to integer mapping, as a text file.
  3. Document to document URI mapping, as a text file.

The exact filesystem paths and filenames are determined from the config.

static saveCorpus(fname, corpus, id2word=None)

Save an existing corpus to disk.

Some formats also support saving the dictionary (feature_id->word mapping), which can in this case be provided by the optional id2word parameter.

>>> MmCorpus.saveCorpus('file.mm', corpus)

Some corpora also support an index of where each document begins, so that the documents on disk can be accessed in O(1) time (see the corpora.IndexedCorpus base class). In this case, saveCorpus is automatically called internally by serialize, which does saveCorpus plus saves the index at the same time, so you want to store the corpus with:

>>> MmCorpus.serialize('file.mm', corpus) # stores index as well, allowing random access to individual documents