Objects of this class realize the transformation between word-document co-occurrence matrix (integers) into a locally/globally weighted TF_IDF matrix (positive floats).
The main methods are:
>>> tfidf = TfidfModel(corpus)
>>> print = tfidf[some_doc]
>>> tfidf.save('/tmp/foo.tfidf_model')
Model persistency is achieved via its load/save methods.
Compute tf-idf by multiplying a local component (term frequency) with a global component (inverse document frequency), and normalizing the resulting documents to unit length. Formula for unnormalized weight of term i in document j in a corpus of D documents:
weight_{i,j} = frequency_{i,j} * log_2(D / document_freq_{i})
or, more generally:
weight_{i,j} = wlocal(frequency_{i,j}) * wglobal(document_freq_{i}, D)
so you can plug in your own custom wlocal and wglobal functions.
Default for wlocal is identity (other options: math.sqrt, math.log1p, ...) and default for wglobal is log_2(total_docs / doc_freq), giving the formula above.
normalize dictates how the final transformed vectors will be normalized. normalize=True means set to unit length (default); False means don’t normalize. You can also set normalize to your own function that accepts and returns a sparse vector.
If dictionary is specified, it must be a corpora.Dictionary object and it will be used to directly construct the inverse document frequency mapping (then corpus, if specified, is ignored).
Compute inverse document weights, which will be used to modify term frequencies for documents.
Load a previously saved object from file (also see save).
Save the object to file via pickling (also see load).
Compute default inverse-document-frequency for a term with document frequency doc_freq:
idf = add + log(totaldocs / doc_freq)
Precompute the inverse document frequency mapping for all terms.