Previous topic

models.rpmodel – Random Projections

Next topic

models.logentropy_model – LogEntropy model

models.hdpmodel – Hierarchical Dirichlet Process

This module encapsulates functionality for the online Hierarchical Dirichlet Process algorithm.

It allows both model estimation from a training corpus and inference of topic distribution on new, unseen documents.

The core estimation code is directly adapted from the onlinelhdp.py script by C. Wang see Wang, Paisley, Blei: Online Variational Inference for the Hierarchical Dirichlet Process, JMLR (2011).

http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf

The algorithm:

  • is streamed: training documents come in sequentially, no random access,
  • runs in constant memory w.r.t. the number of documents: size of the training corpus does not affect memory footprint
class gensim.models.hdpmodel.HdpModel(corpus, id2word, max_chunks=None, max_time=None, chunksize=256, kappa=1.0, tau=64.0, K=15, T=150, alpha=1, gamma=1, eta=0.01, scale=1.0, var_converge=0.0001, outputdir=None)

The constructor estimates Hierachical Dirichlet Process model parameters based on a training corpus:

>>> hdp = HdpModel(corpus, id2word)
>>> hdp.print_topics(topics=20, topn=10)

The model doesn’t support inference of topics on new, unseen documents, yet.

Model persistency is achieved through its load/save methods.

gamma: first level concentration alpha: second level concentration eta: the topic Dirichlet T: top level truncation level K: second level truncation level kappa: learning rate tau: slow down parameter max_time: stop training after this many seconds max_chunks: stop after having processed this many chunks (wrap around corpus beginning in another corpus pass, if there are not enough chunks in the corpus)

doc_e_step(doc, ss, Elogsticks_1st, word_list, unique_words, doc_word_ids, doc_word_counts, var_converge)

e step for a single doc

hdp_to_lda()

Compute the LDA almost equivalent HDP.

classmethod load(fname)

Load a previously saved object from file (also see save).

optimal_ordering()

ordering the topics

save(fname)

Save the object to file via pickling (also see load).

save_options()

legacy method; use self.save() instead

save_topics(doc_count=None)

legacy method; use self.save() instead

update_expectations()

Since we’re doing lazy updates on lambda, at any given moment the current state of lambda may not be accurate. This function updates all of the elements of lambda and Elogbeta so that if (for example) we want to print out the topics we’ve learned we’ll get the correct behavior.

gensim.models.hdpmodel.dirichlet_expectation(alpha)

For a vector theta ~ Dir(alpha), compute E[log(theta)] given alpha.

gensim.models.hdpmodel.expect_log_sticks(sticks)

For stick-breaking hdp, return the E[log(sticks)]