VariKN Language Modeling Toolkit manual page

Variable-order Kneser-Ney smoothed n-gram model toolkit is a specialized toolkit created for growing and pruning Kneser-Ney smoothed n-gram models. If you are looking for a general purpose language modeling toolkit, you probably should look elsewhere (for example SRILM toolkit). Python wrappers through swig are also provided. See python-wrapper/tests for examples.

General

File processing features

The tools support reading and writing for plain ascii files, stdin and stdout, gzipped files, bzip2 compressed files and UNIX pipes. Whether a file should be used as is, or uncompressed is determined by the file suffix (.gz, .bz2, .anything_else). Stdin and stdout are denoted by "-". The reading from a pipe can be forced with the unix pipe symbol (for example "| cat file.txt | preprocess.pl", note the leading "|").

Default tags

The sentence start "<s>" and end "</s>" should be put marked to the training data by the user. It is also possible to train models without explicit sentence boundaries, but this is not theoretically justified. For sub-word models, the tag "<w>" is reserved to signify word break. For sub-word models with sentence breaks the data is assumed to processed in the following format:
<s> <w> w1-1 w1-2 w1-3 <w> w2-1 <w> w3-1 w3-2 <w> </s>
where wA-B is the Bth part of the A:th word.

Programs:

counts2kn

Counts2kn trains a Kneser-Ney smoothed model from the training data using all n-grams up to given order n and then possibly pruning the model.

counts2kn [OPTIONS] text_in lm_out

text_in should contain the training set (except for the held-out part for discount optimization) and the results are written to lm_out.

Mandatory options:
-n--norderThe desired n-gram model order.

Other options:
-h--helpPrint help
-o--opti The file containing the held-out part of training set. This will be used for training the discounts. Suitable size for held out set is around 100 000 words/tokens. If not set, use leave-one-out discount estimates.
-a--arpa Output arpa instead of binary. Recommended for compatibility with other tools. This is the only output compatible with SRILM toolkit.
-x--narpa Output nonstandard interpolated arpa instead of binary. Saves a little memory during model creation but the resulting models should be converted to standard back-off form.
-p--prunetreshold Pruning treshold for removing the least useful n-grams from the model. 0.0 for no pruning, 1.0 for lots of pruning. Corresponds to epsilon in [1].
-f--nfirst Number of the most common words to be included in the language model vocabulary
-d--ndrop Drop the words seen less than x times from the language model vocabulary.
-s--smallvocab The vocabulary does not exceed 65000 words. Saves a lot of memory.
-A--absolute Use absolute discounting instead of Kneser-Ney smoothing.
-C--clear_history Clear language model history at the sentence boundaries. Recommended.
-3--3nzer Use modified KN smoothing, that is 3 discount parameters per model order. Recommended. Increases the memory consumption somewhat, should be omitted if memory is tight.
-O--cutoffs Use count cutoffs, --cutoffs "val1 val2 ... valN". Remove n-grams seen less or equal than val times. Val is specified for each order of the model, if the cutoffs are only specified for a few first orders, the last cutoff value is used for all higher order n-grams. All unigrams are included in any case, so if several cutoff values are given, val1 has no real effect.
-L--longint Store the counts in "long int" type of variable instead of "int". This is necessary when the number of tokens in the training set exceeds the number that can be stored in a regular integer. Increases memory consumption somewhat.

varigram_kn

Performs an incremental growing of Kneser-Ney smoothed n-gram model.

varigram_kn [OPTIONS] textin LM_out

text_in should contain the training set (except for the held-out part for discount optimization) and the results are written to lm_out. Suitable size for held out set is around 100 000 words/tokens.

Mandatory options:
-D--dscale The treshold for accepting new n-grams to the model. 0.05 for generating a fairly small model, 0.001 for a large model. Corresponds to delta in [1].

Other options:
-h--helpPrint help
-o--opti The file containing the held-out part of training set. This will be used for training the discounts. Suitable size for held out set is around 100 000 words/tokens. If not specified, use leave-one-out estimates for discounts.
-n--norderMaximum n-gram order that will be searched.
-a--arpa Output arpa instead of binary. Recommended for compatibility with other tools. This is the only output compatible with SRILM toolkit.
-x--narpa Output nonstandard interpolated arpa instead of binary. Saves a little memory during model creation but the resulting models should be converted to standard back-off form.
-E--dscale2 Pruning treshold for removing the least useful n-grams from the model. 1.0 for lots of pruning, 0 for no pruning. Corresponds to epsilon in [1].
-f--nfirst Number of the most common words to be included in the language model vocabulary
-d--ndrop Drop the words seen less than x times from the language model vocabulary.
-s--smallvocab The vocabulary does not exceed 65000 words. Saves a lot of memory.
-A--absolute Use absolute discounting instead of Kneser-Ney smoothing.
-C--clear_history Clear language model history at the sentence boundaries. Recommended.
-3--3nzer Use modified KN smoothing, that is 3 discount parameters per model order. Recommended. Increases the memory consumption somewhat, should be omitted if memory is tight.
-S--smallmem Do not load the training data into memory. Instead read it from the disk each time it is needed. Saves some memory, slows training down somewhat.
-O--cutoffs Use count cutoffs, --cutoffs "val1 val2 ... valN". Remove n-grams seen less or equal than val times. Val is specified for each order of the model, if the cutoffs are only specified for a few first orders, the last cutoff value is used for all higher order n-grams. All unigrams are included in any case, so if several cutoff values are given, val1 has no real effect.
-L--longint Store the counts in "long int" type of variable instead of "int". This is necessary when the number of tokens in the training set exceeds the number that can be stored in a regular integer. Increases memory consumption somewhat.

perplexity

Calculates the model perplexity and cross-entroy with respect to the test set.

perplexity [OPTIONS] text_in results_out

text_in should contain the test set and the results are printed to results_out.

Mandatory options:
-a--arpa The input language model is in either standard arpa format or interpolated arpa.
-A--bin The input language model is in binary format. Either "-a" or "-A" must be specified.

Other options:
-h--helpPrint help
-C--ccs File containing the list of context cues that should be ignored during perplexity computation.
-W--wb File containing word break symbols. The language model is assumed to be a sub-word n-gram model and word breaks are explicitly marked.
-X--mb File containing morph break prefixes or postfixes. The language model is assumed to be a sub-word n-gram model and morphs that are not preceeded (or followed) by a word break are marked with a prefix (or postfix) string. Prefix strings start with "^" (e.g. "^#" tells that a token starting with "#" is not preceeded by a word break) and postfix strings end with "$" (e.g. "+$" tells that a token ending with "+" is not followed by a word break). The file should also include sentence start and end tags (e.g. "^<s>" and "^</s>"), otherwise they are considered as words.
-u--unk The string is used as the unknown word symbols. For compability reasons only.
-U--unkwarn Warn if unknown tokens are seen
-i--interpolate Interpolate with given arpa LM.
-I--inter_coeff Interpolation coefficient. The interpolated model will be weighted by coeff whereas the main model will be weighted by 1.0-coeff.
-t--init_hist The number of symbols assumed to be known from the sentence start. Normally 1, for sub-word n-grams the inital word break should be assumed known and this should be set to 2. Default 0 (fix this).
-S--probstream The filename, where the individual probabilities given to each word should be put.

simpleinterpolate2arpa

Outputs an approximate interpolation of two arpa backoff models. Exact interpolation cannot be expressed as an arpa model. Experimental code.

simpleinterpolate2arpa "lm1_in.arpa,weight1;lm2_in.arpa,weight2" arpa_out

Examples

To train a 3-gram model do:
counts2kn -an 3 -p 0.1 -o held_out.txt train.txt model.arpa
or
counts2kn -asn 3 -p 0.1 -o held_out.txt train.txt.gz model.arpa.bz2
Adding the flag "-s" reduces the memory use, but limits the vocabulary < 65000.

To evaluate the just created model:
perplexity -t 1 -a model.arpa.bz2 test_set.txt -
or
perplexity -S stream_out.txt.gz -t 1 -a model.arpa.bz2 "| cat test_set.txt | preprocess.pl" out.txt
Note, that for evaluating a language model based on subword units, the parameter -t 2 should be used since the two first tokens (sentence start and word break) are assumed to be known.

To create a grown model do:
varigram_kn -a -o held_out.txt -D 0.1 -E 0.25 -s -C train.txt grown.arpa.gz

Known needs for improvement