Previous topic

corpora.bleicorpus – Corpus in Blei’s LDA-C format

Next topic

corpora.hashdictionary – Construct word<->id mappings

corpora.dictionary – Construct word<->id mappings

This module implements the concept of Dictionary – a mapping between words and their integer ids.

Dictionaries can be created from a corpus and can later be pruned according to document frequency (removing (un)common words via the Dictionary.filter_extremes() method), save/loaded from disk (via Dictionary.save() and Dictionary.load() methods) etc.

class gensim.corpora.dictionary.Dictionary(documents=None)

Dictionary encapsulates the mapping between normalized words and their integer ids.

The main function is doc2bow, which converts a collection of words to its bag-of-words representation: a list of (word_id, word_frequency) 2-tuples.

add_documents(documents)

Build dictionary from a collection of documents. Each document is a list of tokens = tokenized and normalized utf-8 encoded strings.

This is only a convenience wrapper for calling doc2bow on each document with allow_update=True.

>>> print Dictionary(["máma mele maso".split(), "ema má máma".split()])
Dictionary(5 unique tokens)
compactify()

Assign new word ids to all words.

This is done to make the ids more compact, e.g. after some tokens have been removed via filter_tokens() and there are gaps in the id series. Calling this method will remove the gaps.

doc2bow(document, allow_update=False, return_missing=False)

Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized utf-8 encoded string. No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.

If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.

If allow_update is not set, this function is const, aka read-only.

filter_extremes(no_below=5, no_above=0.5, keep_n=100000)

Filter out tokens that appear in

  1. less than no_below documents (absolute number) or
  2. more than no_above documents (fraction of total corpus size, not absolute number).
  3. after (1) and (2), keep only the first keep_n most frequent tokens (or keep all if None).

After the pruning, shrink resulting gaps in word ids.

Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!

filter_tokens(bad_ids=None, good_ids=None)

Remove the selected bad_ids tokens from all dictionary mappings, or, keep selected good_ids in the mapping and remove the rest.

bad_ids and good_ids are collections of word ids to be removed.

static from_corpus(corpus)

Create Dictionary from an existing corpus. This can be useful if you only have a term-document BOW matrix (represented by corpus), but not the original text corpus.

This will scan the term-document count matrix for all word ids that appear in it, then construct and return Dictionary which maps each word_id -> str(word_id).

keys()

Return a list of all token ids.

classmethod load(fname)

Load a previously saved object from file (also see save).

static load_from_text(fname)

Load a previously stored Dictionary from a text file. Mirror function to save_as_text.

save(fname)

Save the object to file via pickling (also see load).

save_as_text(fname)

Save this Dictionary to a text file, in format: id[TAB]word_utf8[TAB]document frequency[NEWLINE].

Note: use save/load to store in binary format instead (pickle).