This module implements the concept of Dictionary – a mapping between words and their integer ids.
Dictionaries can be created from a corpus and can later be pruned according to document frequency (removing (un)common words via the Dictionary.filterExtremes() method), save/loaded from disk (via Dictionary.save() and Dictionary.load() methods) etc.
Dictionary encapsulates mappings between normalized words and their integer ids.
The main function is doc2bow, which converts a collection of words to its bag-of-words representation, optionally also updating the dictionary mapping with newly encountered words and their ids.
Build dictionary from a collection of documents. Each document is a list of tokens = tokenized and normalized utf-8 encoded strings.
This is only a convenience wrapper for calling doc2bow on each document with allowUpdate=True.
>>> print Dictionary.fromDocuments(["máma mele maso".split(), "ema má máma".split()])
Dictionary(5 unique tokens)
Assign new word ids to all words.
This is done to make the ids more compact, e.g. after some tokens have been removed via filterTokens() and there are gaps in the id series. Calling this method will remove the gaps.
Convert document (a list of words) into the bag-of-words format = list of (tokenId, tokenCount) 2-tuples. Each word is assumed to be a tokenized and normalized utf-8 encoded string.
If allowUpdate is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its self.dfs by one.
If allowUpdate is not set, this function is const, i.e. read-only.
Filter out tokens that appear in
After the pruning, shrink resulting gaps in word ids.
Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!
Remove the selected badIds tokens from all dictionary mappings, or, keep selected goodIds in the mapping and remove the rest.
badIds and goodIds are collections of word ids to be removed.
Return a list of all token ids.
Load a previously saved object from file (also see save).
Load a previously stored Dictionary from a text file. Mirror function to saveAsText.
Save the object to file via pickling (also see load).
Save this Dictionary to a text file, in format: id[TAB]word_utf8[TAB]document frequency[NEWLINE].
Note: use save/load to store in binary format instead (pickle).