This module implements the concept of Dictionary – a mapping between words and their integer ids.
Dictionaries can be created from a corpus and can later be pruned according to document frequency (removing (un)common words via the Dictionary.filter_extremes() method), save/loaded from disk (via Dictionary.save() and Dictionary.load() methods) etc.
Dictionary encapsulates the mapping between normalized words and their integer ids.
The main function is doc2bow, which converts a collection of words to its bag-of-words representation: a list of (word_id, word_frequency) 2-tuples.
Build dictionary from a collection of documents. Each document is a list of tokens = tokenized and normalized utf-8 encoded strings.
This is only a convenience wrapper for calling doc2bow on each document with allow_update=True.
>>> print Dictionary(["máma mele maso".split(), "ema má máma".split()])
Dictionary(5 unique tokens)
Assign new word ids to all words.
This is done to make the ids more compact, e.g. after some tokens have been removed via filter_tokens() and there are gaps in the id series. Calling this method will remove the gaps.
Convert document (a list of words) into the bag-of-words format = list of (token_id, token_count) 2-tuples. Each word is assumed to be a tokenized and normalized utf-8 encoded string. No further preprocessing is done on the words in document; apply tokenization, stemming etc. before calling this method.
If allow_update is set, then also update dictionary in the process: create ids for new words. At the same time, update document frequencies – for each word appearing in this document, increase its document frequency (self.dfs) by one.
If allow_update is not set, this function is const, aka read-only.
Filter out tokens that appear in
After the pruning, shrink resulting gaps in word ids.
Note: Due to the gap shrinking, the same word may have a different word id before and after the call to this function!
Remove the selected bad_ids tokens from all dictionary mappings, or, keep selected good_ids in the mapping and remove the rest.
bad_ids and good_ids are collections of word ids to be removed.
Create Dictionary from an existing corpus. This can be useful if you only have a term-document BOW matrix (represented by corpus), but not the original text corpus.
This will scan the term-document count matrix for all word ids that appear in it, then construct and return Dictionary which maps each word_id -> str(word_id).
Return a list of all token ids.
Load a previously saved object from file (also see save).
Load a previously stored Dictionary from a text file. Mirror function to save_as_text.
Save the object to file via pickling (also see load).
Save this Dictionary to a text file, in format: id[TAB]word_utf8[TAB]document frequency[NEWLINE].
Note: use save/load to store in binary format instead (pickle).