You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mahout.apache.org by co...@apache.org on 2010/01/09 15:55:00 UTC

[CONF] Apache Lucene Mahout > TF-IDF - Term Frequency-Inverse Document Frequency

Space: Apache Lucene Mahout (http://cwiki.apache.org/confluence/display/MAHOUT)
Page: TF-IDF - Term Frequency-Inverse Document Frequency (http://cwiki.apache.org/confluence/display/MAHOUT/TF-IDF+-+Term+Frequency-Inverse+Document+Frequency)

Added by David Stuart:
---------------------------------------------------------------------
{excerpt}Is a weight measure often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus.{excerpt} In other words if a term/word appears lots in a document but also appears lots in the corpus/collection as a whole it will get a lower score. An example of this would be "the", "and", "it" but depending on your source material it maybe other words that are very common to the source matter.


 See Also:
 * http://en.wikipedia.org/wiki/Tf%E2%80%93idf
 * http://nlp.stanford.edu/IR-book/html/htmledition/tf-idf-weighting-1.html


Change your notification preferences: http://cwiki.apache.org/confluence/users/viewnotifications.action