Corpus for the DML-CZ project.
DmlConfig contains parameters necessary for the abstraction of a ‘corpus of articles’ (see the DmlCorpus class).
Articles may come from different sources (=different locations on disk/netword, different file formats etc.), so the main purpose of DmlConfig is to keep all sources in one place.
Apart from glueing sources together, DmlConfig also decides where to store output files and which articles to accept for the corpus (= an additional filter over the sources).
DmlCorpus implements a collection of articles. It is initialized via a DmlConfig object, which holds information about where to look for the articles and how to process them.
Apart from being a regular corpus (bag-of-words iterable with a len() method), DmlCorpus has methods for building a dictionary (mapping between words and their ids).
Populate dictionary mapping and statistics.
This is done by sequentially retrieving the article fulltexts, splitting them into tokens and converting tokens to their ids (creating new ids as necessary).
Parse the directories specified in the config, looking for suitable articles.
This updates the self.documents var, which keeps a list of (source id, article uri) 2-tuples. Each tuple is a unique identifier of one article.
Note that some articles are ignored based on config settings (for example if the article’s language doesn’t match any language specified in the config etc.).
Store the corpus to disk, in a human-readable text format.
This actually saves multiple files:
The exact filesystem paths and filenames are determined from the config.