USAGE: %(program)s WIKI_XML_DUMP OUTPUT_PREFIX [VOCABULARY_SIZE]
Convert articles from a Wikipedia dump to (sparse) vectors. The input is a bz2-compressed dump of Wikipedia articles, in XML format.
This actually creates three files:
The output Matrix Market files can then be compressed (e.g., by bzip2) to save disk space; gensim’s corpus iterators can work with compressed input, too.
VOCABULARY_SIZE controls how many of the most frequent words to keep (after removing all tokens that appear in more than 10 percent documents). Defaults to 100,000.
Example: ./wikicorpus.py ~/gensim/results/enwiki-20100622-pages-articles.xml.bz2 ~/gensim/results/wiki_en
Convert a corpus to another, with different feature ids.
Given a mapping between old ids and new ids (some old ids may be missing, i.e. the mapping need not be a bijection), this will wrap a corpus so that iterating over VocabTransform[corpus] returns the same vectors but with the new ids.
Old features that have no counterpart in the new ids are discarded. This can be used to filter vocabulary of a corpus:
>>> old2new = dict((oldid, newid) for newid, oldid in enumerate(remaining_ids))
>>> id2word = dict((newid, oldid2token[oldid]) for oldid, newid in old2new.iteritems())
>>> vt = VocabTransform(old2new, id2token)
Load a previously saved object from file (also see save).
Save the object to file via pickling (also see load).
Treat a wikipedia articles dump (*articles.xml.bz2) as a (read-only) corpus.
The documents are extracted on-the-fly, so that the whole (massive) dump can stay compressed on disk.
>>> wiki = WikiCorpus('enwiki-20100622-pages-articles.xml.bz2') # create word->word_id mapping, takes almost 8h
>>> wiki.saveAsText('wiki_en_vocab200k') # another 8h, creates a file in MatrixMarket format plus file with id->word
Initialize the corpus. This scans the corpus once, to determine its vocabulary (only the first keep_words most frequent words that appear in at least noBelow documents are kept).
Iterate over the dump, returning text version of each article.
Only articles of sufficient length are returned (short articles & redirects etc are ignored).
Note that this iterates over the texts; if you want vectors, just use the standard corpus interface instead of this function:
>>> for vec in wiki_corpus:
>>> print doc
Load a previously saved object from file (also see save).
Load previously stored mapping between words and their ids.
The result can be used as the id2word parameter for input to transformations.
Save the object to file via pickling (also see load).
Store the corpus to disk, in a human-readable text format.
This actually saves two files:
Save an existing corpus to disk.
Some formats also support saving the dictionary (feature_id->word mapping), which can in this case be provided by the optional id2word parameter.
>>> MmCorpus.saveCorpus('file.mm', corpus)
Some corpora also support an index of where each document begins, so that the documents on disk can be accessed in O(1) time (see the corpora.IndexedCorpus base class). In this case, saveCorpus is automatically called internally by serialize, which does saveCorpus plus saves the index at the same time, so you want to store the corpus with:
>>> MmCorpus.serialize('file.mm', corpus) # stores index as well, allowing random access to individual documents
Store id->word mapping to a file, in format id[TAB]word_utf8[TAB]document frequency[NEWLINE].
Filter out wiki mark-up from utf8 string raw, leaving only text.
Tokenize a piece of text from wikipedia. The input string content is assumed to be mark-up free (see filterWiki()).
Return list of tokens as utf8 bytestrings. Ignore words shorted than 2 or longer that 15 characters (not bytes!).