This module contains various general utility functions.
Objects of this class act as dictionaries that map integer->str(integer), for a specified range of integers <0, num_terms).
This is meant to avoid allocating real dictionaries when num_terms is huge, which is a waste of memory.
Override the dict.keys() function, which is used to determine the maximum internal id of a corpus = the vocabulary dimensionality.
HACK: To avoid materializing the whole range(0, self.num_terms), this returns [self.num_terms - 1] only.
Used in the tutorial on distributed computing and likely not useful anywhere else.
Wrap a corpus as another corpus of length reps. This is achieved by repeating documents from corpus over and over again, until the requested length len(result)==reps is reached. Repetition is done on-the-fly=efficiently, via itertools.
>>> corpus = [[(1, 0.5)], []] # 2 documents
>>> list(RepeatCorpus(corpus, 5)) # repeat 2.5 times to get 5 documents
[[(1, 0.5)], [], [(1, 0.5)], [], [(1, 0.5)]]
Load a previously saved object from file (also see save).
Save the object to file via pickling (also see load).
Objects which inherit from this class have save/load functions, which un/pickle them to disk.
This uses cPickle for de/serializing, so objects must not contains unpicklable attributes, such as lambda functions etc.
Load a previously saved object from file (also see save).
Save the object to file via pickling (also see load).
Convert a string (bytestring in encoding or unicode), to unicode.
Convert a string (unicode or bytestring in encoding), to bytestring in utf8.
Split a stream of values into smaller chunks. Each chunk is of length chunksize, except the last one which may be smaller. A once-only input stream (corpus from a generator) is ok, chunking is done efficiently via itertools.
If maxsize > 1, don’t wait idly in between successive chunk yields, but rather keep filling a short queue (of size at most maxsize) with forthcoming chunks in advance. This is realized by starting a separate thread, and is meant to reduce I/O delays, which can be significant when corpus comes from a slow medium (like harddisk).
If maxsize==0, don’t fool around with threads and simply yield the chunksize via chunkize_serial() (no I/O optimizations).
>>> for chunk in chunkize(xrange(10), 4): print chunk
[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9]
Split a stream of values into smaller chunks. Each chunk is of length chunksize, except the last one which may be smaller. A once-only input stream (corpus from a generator) is ok.
>>> for chunk in chunkize_serial(xrange(10), 4): print list(chunk)
[0, 1, 2, 3]
[4, 5, 6, 7]
[8, 9]
Remove accentuation from the given string. Input text is either a unicode string or utf8 encoded bytestring.
Return input string with accents removed, as unicode.
>>> deaccent("Šéf chomutovských komunistů dostal poštou bílý prášek")
u'Sef chomutovskych komunistu dostal postou bily prasek'
Decode HTML entities in text, coded as hex, decimal or named.
Adapted from http://github.com/sku/python-twitter-ircbot/blob/321d94e0e40d0acc92f5bf57d126b57369da70de/html_decode.py
>>> u = u'E tu vivrai nel terrore - L'aldilà (1981)'
>>> print decode_htmlentities(u).encode('UTF-8')
E tu vivrai nel terrore - L'aldilà (1981)
>>> print decode_htmlentities("l'eau")
l'eau
>>> print decode_htmlentities("foo < bar")
foo < bar
Scan corpus for all word ids that appear in it, then construct and return a mapping which maps each wordId -> str(wordId).
This function is used whenever words need to be displayed (as opposed to just their ids) but no wordId->word mapping was provided. The resulting mapping only covers words actually used in the corpus, up to the highest wordId found.
Return highest feature id that appears in the corpus.
For empty corpora (no features at all), return -1.
Try to obtain our external ip (from the pyro nameserver’s point of view)
This tries to sidestep the issue of bogus /etc/hosts entries and other local misconfigurations, which often mess up hostname resolution.
If all else fails, fall back to simple socket.gethostbyname() lookup.
Check whether obj is a corpus. Return (is_corpus, new) 2-tuple, where new is obj if obj was an iterable, or new yields the same sequence as obj if it was an iterator.
obj is a corpus if it supports iteration over documents, where a document is in turn anything that acts as a sequence of 2-tuples (int, float).
Note: An “empty” corpus (empty input sequence) is ambiguous, so in this case the result is forcefully defined as is_corpus=False.
Pickle object obj to file fname.
Reverse a dictionary mapping.
When two keys map to the same value, only one of them will be kept in the result (which one is kept is arbitrary).
A decorator to place an instance-based lock around a method.
Adapted from http://code.activestate.com/recipes/577105-synchronization-decorator-for-class-methods/
Convert a string (bytestring in encoding or unicode), to unicode.
Convert a string (unicode or bytestring in encoding), to bytestring in utf8.
Iteratively yield tokens as unicode strings, optionally also lowercasing them and removing accent marks.
Input text may be either unicode or utf8-encoded byte string.
The tokens on output are maximal contiguous sequences of alphabetic characters (no digits!).
>>> list(tokenize('Nic nemůže letět rychlostí vyšší, než 300 tisíc kilometrů za sekundu!', deacc = True))
[u'Nic', u'nemuze', u'letet', u'rychlosti', u'vyssi', u'nez', u'tisic', u'kilometru', u'za', u'sekundu']
Debug fnc to help inspect the top n most similar documents (according to a similarity index index), to see if they are actually related to the query.
texts is any object that can return something insightful for each document via texts[docid], such as its fulltext or snippet.
Return a list of 3-tuples (docid, doc’s similarity to the query, texts[docid]).
Load pickled object from fname