You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Ani Tumanyan <an...@bnotions.com> on 2013/12/03 16:03:31 UTC

TF-IDF confusion

Hello everyone,

I'm working on a project, where I'm trying to extract topics from news articles. I have around 500,000 articles as a dataset. Here are the steps that I'm following:

1. First of all I'm doing some sort of preprocessing. For this I'm using Behemoth to annotate the document and get rid of non-English documents,
2. Then I'm running Mahout's sparse vector command to generate TF-IDF vectors. The problem with TF-IDF vector is that the number of words for a document is far more than the number of words in TF vectors. Moreover there are some words/terms in TF-IDF vector that didn't appear in that specific document anyway. Is this a correct behaviour or there is something wrong with my approach?

Thanks in advance!

Ani

Re: TF-IDF confusion

Posted by Ted Dunning <te...@gmail.com>.

Ani,

I really don't understand your second point.

Here is how I view things ... if you can phrase things in those terms, it
might help me understand your question.

The TF part of TF-IDF refers to the term frequencies in a document.
 Typically, each possible word is assigned to a positive integer that
represents a position in a vector.  A term frequency vector is a sparse
vector with counts or functions of counts at locations corresponding to the
words in a document.

If the document has words that were do not have assigned positions in the
vector, they are either ignored or the counts are put into a special
"UNKNOWN-WORD" position.

By definition, there is no way that the term frequency vector can be too
long or to short.  Likewise, a document's length only matters if the counts
get too large to store (completely implausible for this to happen since we
use a double).

The IDF part of TF-IDF refers to weights that are applied to these TF
vectors.  These weights are conventionally computed by using the log of the
number of documents which have the corresponding word.  The IDF weighting
has one weight for each position in the term frequency vector and thus
length is again not a problem.

This is why I don't understand your second point.

Is it that you mean that many of the words in the document do not have
assigned positions in the term frequency vector?  If so, that you means
that you didn't analyze the corpus ahead of time to get a good dictionary
of word locations.

Or is it that you are worried that the counts would be large?

On Tue, Dec 3, 2013 at 7:03 AM, Ani Tumanyan <an...@bnotions.com> wrote:

> Hello everyone,
>
> I'm working on a project, where I'm trying to extract topics from news
> articles. I have around 500,000 articles as a dataset. Here are the steps
> that I'm following:
>
> 1. First of all I'm doing some sort of preprocessing. For this I'm using
> Behemoth to annotate the document and get rid of non-English documents,
> 2. Then I'm running Mahout's sparse vector command to generate TF-IDF
> vectors. The problem with TF-IDF vector is that the number of words for a
> document is far more than the number of words in TF vectors. Moreover there
> are some words/terms in TF-IDF vector that didn't appear in that specific
> document anyway. Is this a correct behaviour or there is something wrong
> with my approach?
>
> Thanks in advance!
>
> Ani