Previous topic

interfaces – Core gensim interfaces

Next topic

matutils – Math utils

utils – Various utility functions

This module contains various general utility functions.

class gensim.utils.FakeDict(num_terms)

Objects of this class act as dictionaries that map integer->str(integer), for a specified range of integers <0, num_terms).

This is meant to avoid allocating real dictionaries when num_terms is huge, which is a waste of memory.

keys()

Override the dict.keys() function, which is used to determine the maximum internal id of a corpus = the vocabulary dimensionality.

HACK: To avoid materializing the whole range(0, self.num_terms), this returns [self.num_terms - 1] only.

class gensim.utils.RepeatCorpus(corpus, reps)

Used in the tutorial on distributed computing and likely not useful anywhere else.

Wrap a corpus as another corpus of length reps. This is achieved by repeating documents from corpus over and over again, until the requested length len(result)==reps is reached. Repetition is done on-the-fly=efficiently, via itertools.

>>> corpus = [[(1, 0.5)], []] # 2 documents
>>> list(RepeatCorpus(corpus, 5)) # repeat 2.5 times to get 5 documents
[[(1, 0.5)], [], [(1, 0.5)], [], [(1, 0.5)]]
classmethod load(fname)

Load a previously saved object from file (also see save).

save(fname)

Save the object to file via pickling (also see load).

class gensim.utils.SaveLoad

Objects which inherit from this class have save/load functions, which un/pickle them to disk.

This uses cPickle for de/serializing, so objects must not contains unpicklable attributes, such as lambda functions etc.

classmethod load(fname)

Load a previously saved object from file (also see save).

save(fname)

Save the object to file via pickling (also see load).

gensim.utils.any2unicode(text, encoding='utf8', errors='strict')

Convert a string (bytestring in encoding or unicode), to unicode.

gensim.utils.any2utf8(text, errors='strict', encoding='utf8')

Convert a string (unicode or bytestring in encoding), to bytestring in utf8.

gensim.utils.chunkize(corpus, chunksize, maxsize=0)

Split a stream of values into smaller chunks. Each chunk is of length chunksize, except the last one which may be smaller. A once-only input stream (corpus from a generator) is ok, chunking is done efficiently via itertools.

If maxsize > 1, don’t wait idly in between successive chunk yields, but rather keep filling a short queue (of size at most maxsize) with forthcoming chunks in advance. This is realized by starting a separate thread, and is meant to reduce I/O delays, which can be significant when corpus comes from a slow medium (like harddisk).

If maxsize==0, don’t fool around with threads and simply yield the chunksize via chunkize_serial() (no I/O optimizations).

>>> for chunk in chunkize(xrange(10), 4): print chunk
[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9]
gensim.utils.chunkize_serial(corpus, chunksize)

Split a stream of values into smaller chunks. Each chunk is of length chunksize, except the last one which may be smaller. A once-only input stream (corpus from a generator) is ok.

>>> for chunk in chunkize_serial(xrange(10), 4): print list(chunk)
[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9]
gensim.utils.deaccent(text)

Remove accentuation from the given string. Input text is either a unicode string or utf8 encoded bytestring.

Return input string with accents removed, as unicode.

>>> deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek")
u'Sef chomutovskych komunistu dostal postou bily prasek'
gensim.utils.decode_htmlentities(text)

Decode HTML entities in text, coded as hex, decimal or named.

Adapted from http://github.com/sku/python-twitter-ircbot/blob/321d94e0e40d0acc92f5bf57d126b57369da70de/html_decode.py

>>> u = u'E tu vivrai nel terrore - L&#x27;aldil&#xE0; (1981)'
>>> print decode_htmlentities(u).encode('UTF-8')
E tu vivrai nel terrore - L'aldilà (1981)
>>> print decode_htmlentities("l&#39;eau")
l'eau
>>> print decode_htmlentities("foo &lt; bar")
foo < bar
gensim.utils.dict_from_corpus(corpus)

Scan corpus for all word ids that appear in it, then construct and return a mapping which maps each wordId -> str(wordId).

This function is used whenever words need to be displayed (as opposed to just their ids) but no wordId->word mapping was provided. The resulting mapping only covers words actually used in the corpus, up to the highest wordId found.

gensim.utils.get_max_id(corpus)

Return highest feature id that appears in the corpus.

For empty corpora (no features at all), return -1.

gensim.utils.get_my_ip()

Try to obtain our external ip (from the pyro nameserver’s point of view)

This tries to sidestep the issue of bogus /etc/hosts entries and other local misconfigurations, which often mess up hostname resolution.

If all else fails, fall back to simple socket.gethostbyname() lookup.

gensim.utils.is_corpus(obj)

Check whether obj is a corpus. Return (is_corpus, new) 2-tuple, where new is obj if obj was an iterable, or new yields the same sequence as obj if it was an iterator.

obj is a corpus if it supports iteration over documents, where a document is in turn anything that acts as a sequence of 2-tuples (int, float).

Note: An “empty” corpus (empty input sequence) is ambiguous, so in this case the result is forcefully defined as is_corpus=False.

gensim.utils.pickle(obj, fname, protocol=-1)

Pickle object obj to file fname.

gensim.utils.revdict(d)

Reverse a dictionary mapping.

When two keys map to the same value, only one of them will be kept in the result (which one is kept is arbitrary).

gensim.utils.synchronous(tlockname)

A decorator to place an instance-based lock around a method.

Adapted from http://code.activestate.com/recipes/577105-synchronization-decorator-for-class-methods/

gensim.utils.to_unicode(text, encoding='utf8', errors='strict')

Convert a string (bytestring in encoding or unicode), to unicode.

gensim.utils.to_utf8(text, errors='strict', encoding='utf8')

Convert a string (unicode or bytestring in encoding), to bytestring in utf8.

gensim.utils.tokenize(text, lowercase=False, deacc=False, errors='strict', to_lower=False, lower=False)

Iteratively yield tokens as unicode strings, optionally also lowercasing them and removing accent marks.

Input text may be either unicode or utf8-encoded byte string.

The tokens on output are maximal contiguous sequences of alphabetic characters (no digits!).

>>> list(tokenize('Nic nemůže letět rychlostí vyšší, než 300 tisíc kilometrů za sekundu!', deacc = True))
[u'Nic', u'nemuze', u'letet', u'rychlosti', u'vyssi', u'nez', u'tisic', u'kilometru', u'za', u'sekundu']
gensim.utils.toptexts(query, texts, index, n=10)

Debug fnc to help inspect the top n most similar documents (according to a similarity index index), to see if they are actually related to the query.

texts is any object that can return something insightful for each document via texts[docid], such as its fulltext or snippet.

Return a list of 3-tuples (docid, doc’s similarity to the query, texts[docid]).

gensim.utils.unpickle(fname)

Load pickled object from fname