dictionary

This module implements the concept of Dictionary – a mapping between words and their internal ids.

The actual process of id translation proceeds in three steps:
  1. get the input word (eg. ‘Answering’)
  2. map word to its normalized form (eg. ‘answer’)
  3. map the normalized form to integer id (eg. 42)

Dictionaries can be created from a corpus and can later be pruned according to document frequency (removing (un)common words via the filterExtremes() method), save/loaded from disk via save() and load() methods etc.

class gensim.corpora.dictionary.Dictionary

Dictionary encapsulates mappings between words, their normalized forms and ids of those normalized forms.

The main function is doc2bow, which coverts a collection of words to its bow representation, optionally also updating the dictionary mappings with new words and their ids.

doc2bow(document, normalizeWord, allowUpdate=False)

Convert document (a list of words) into bag-of-words format = list of (tokenId, tokenCount) 2-tuples.

normalizeWord must be a function that accepts one utf-8 encoded string and returns another. Possible choices are identity, lowercasing etc.

If allowUpdate is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its self.docFreq by one.

filterExtremes(noBelow=5, noAbove=0.5)
Filter out tokens that appear in
  1. less than noBelow documents (absolute number) or
  2. more than noAbove documents (fraction of total corpus size, not absolute number).

At the same time rebuild the dictionary, shrinking resulting gaps in tokenIds (lowering len(self) and freeing up memory in the process).

Note that the same token may have a different tokenId before and after the call to this function!

filterTokens(badIds)
Remove the selected tokens from all dictionary mappings.
static fromDocuments(documents, normalizeWord)

Build dictionary from a collection of documents. Each document is a list of words (ie. tokenized strings).

The normalizeWord function is used to convert each word to its utf-8 encoded canonical form (identity, lowercasing, stemming, ...); use whichever normalization suits you.

>>> print Dictionary.fromDocuments(["máma mele maso".split(), "ema má mama".split()], utils.deaccent)
Dictionary(5 unique tokens covering 6 surface forms)
classmethod load(fname)
Load a previously saved object from file (also see save).
rebuildDictionary()

Assign new tokenIds to all tokens.

This is done to make tokenIds more compact, ie. after some tokens have been removed via filterTokens() and there are gaps in the tokenId series. Calling this method will remove the gaps.

save(fname)
Save the object to file via pickling (also see load).
class gensim.corpora.dictionary.Token(token, intId)
Object representing a single token.

Previous topic

bleicorpus

Next topic

dmlcorpus

This Page