Table Of Contents

Next topic

Introduction

Gensim – Python Framework for Vector Space Modelling

What’s new?

Version 0.7 is out!

  • Latent Semantic Indexing (LSI) and Latent Dirichlet Allocation (LDA) are now faster, consume less memory and can be run in distributed mode!
  • Optimizations to vocabulary generation.
  • Input corpus iterator can come from a compressed file (bzip2, gzip, ...), to save disk space when dealing with very large corpora.

gensim now contains two algorithms for Latent Semantic Indexing:

  1. streamed two-pass algorithm: takes 2.5 hours on the English Wikipedia (3.2 mil. documents).
  2. streamed single pass algorithm: slower (takes 8.5h), but only accesses each document once; use this if your input comes streaming in and you cannot store it persistently.

Of course, the option of incrementally adding new documents to an existing decomposition, without the need to recompute everything from scratch, remains from the previous version.

For an overview on what gensim does (or does not do), go to the introduction.

To download and install gensim, consult the install page.

For examples on how to use it, try the tutorials.

Quick Reference Example

>>> from gensim import corpora, models, similarities
>>>
>>> # load corpus iterator from a Matrix Market file on disk
>>> corpus = corpora.MmCorpus('/path/to/corpus.mm')
>>>
>>> # initialize a transformation (Latent Semantic Indexing with 200 latent dimensions)
>>> lsi = models.LsiModel(corpus, numTopics=200)
>>>
>>> # convert the same corpus to latent space and index it
>>> index = similarities.MatrixSimilarity(lsi[corpus])
>>>
>>> # perform similarity query of another vector in LSI space against the whole corpus
>>> sims = index[query]