This module implements the concept of Dictionary – a mapping between words and their integer ids.
Dictionaries can be created from a corpus and can later be pruned according to document frequency (removing (un)common words via the Dictionary.filter_extremes() method), save/loaded from disk (via Dictionary.save() and Dictionary.load() methods) etc.
Dictionary encapsulates the mapping between normalized words and their integer ids.
The main function is doc2bow, which converts a collection of words to its bag-of-words representation, optionally also updating the dictionary mapping with newly encountered words and their ids.
Build dictionary from a collection of documents. Each document is a list of tokens = tokenized and normalized utf-8 encoded strings.
This is only a convenience wrapper for calling doc2bow on each document with allow_update=True.
>>> print Dictionary(["máma mele maso".split(), "ema má máma".split()])
Dictionary(5 unique tokens)
Assign new word ids to all words.
This is done to make the ids more compact, e.g. after some tokens have been removed via filter_tokens() and there are gaps in the id series. Calling this method will remove the gaps.
Convert document (a list of words) into the bag-of-words format = list of (tokenId, tokenCount) 2-tuples. Each word is assumed to be a tokenized and normalized utf-8 encoded string.
If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its self.dfs by one.
If allow_update is not set, this function is const, i.e. read-only.
Filter out tokens that appear in
After the pruning, shrink resulting gaps in word ids.
Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!
Remove the selected bad_ids tokens from all dictionary mappings, or, keep selected good_ids in the mapping and remove the rest.
bad_ids and good_ids are collections of word ids to be removed.
Return a list of all token ids.
Load a previously saved object from file (also see save).
Load a previously stored Dictionary from a text file. Mirror function to save_as_text.
Save the object to file via pickling (also see load).
Save this Dictionary to a text file, in format: id[TAB]word_utf8[TAB]document frequency[NEWLINE].
Note: use save/load to store in binary format instead (pickle).