Previous topic

corpora.bleicorpus – Corpus in Blei’s LDA-C format

Next topic

corpora.lowcorpus – Corpus in List-of-Words format

corpora.dictionary – Construct word<->id mappings

This module implements the concept of Dictionary – a mapping between words and their integer ids.

Dictionaries can be created from a corpus and can later be pruned according to document frequency (removing (un)common words via the Dictionary.filter_extremes() method), save/loaded from disk (via Dictionary.save() and Dictionary.load() methods) etc.

class gensim.corpora.dictionary.Dictionary(documents=None)

Dictionary encapsulates the mapping between normalized words and their integer ids.

The main function is doc2bow, which converts a collection of words to its bag-of-words representation, optionally also updating the dictionary mapping with newly encountered words and their ids.

add_documents(documents)

Build dictionary from a collection of documents. Each document is a list of tokens = tokenized and normalized utf-8 encoded strings.

This is only a convenience wrapper for calling doc2bow on each document with allow_update=True.

>>> print Dictionary(["máma mele maso".split(), "ema má máma".split()])
Dictionary(5 unique tokens)
compactify()

Assign new word ids to all words.

This is done to make the ids more compact, e.g. after some tokens have been removed via filter_tokens() and there are gaps in the id series. Calling this method will remove the gaps.

doc2bow(document, allow_update=False, return_missing=False)

Convert document (a list of words) into the bag-of-words format = list of (tokenId, tokenCount) 2-tuples. Each word is assumed to be a tokenized and normalized utf-8 encoded string.

If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its self.dfs by one.

If allow_update is not set, this function is const, i.e. read-only.

filter_extremes(no_below=5, no_above=0.5, keep_n=100000)

Filter out tokens that appear in

  1. less than no_below documents (absolute number) or
  2. more than no_above documents (fraction of total corpus size, not absolute number).
  3. after (1) and (2), keep only the first keep_n’ most frequent tokens (or keep all if `None).

After the pruning, shrink resulting gaps in word ids.

Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!

filter_tokens(bad_ids=None, good_ids=None)

Remove the selected bad_ids tokens from all dictionary mappings, or, keep selected good_ids in the mapping and remove the rest.

bad_ids and good_ids are collections of word ids to be removed.

keys()

Return a list of all token ids.

classmethod load(fname)

Load a previously saved object from file (also see save).

static load_from_text(fname)

Load a previously stored Dictionary from a text file. Mirror function to save_as_text.

save(fname)

Save the object to file via pickling (also see load).

save_as_text(fname)

Save this Dictionary to a text file, in format: id[TAB]word_utf8[TAB]document frequency[NEWLINE].

Note: use save/load to store in binary format instead (pickle).