You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Koert Kuipers <ko...@tresata.com> on 2013/01/10 18:21:20 UTC

vector encoding of text documents

i noticed that the mahout encoding of text documents in a k-means example
uses TF-IDF and then converts the documents into vectors (so one vector per
document) where every word gets mapped to an int for the vector index and
the words TFIDF score becomes the vector value for that index. it also
creates a dictionary to be able to get back from the index to the word. did
i get that right?

why this approach as opposed to the feature hashing that i saw in other
places in mahout?

thanks! koert

Re: vector encoding of text documents

Posted by Ted Dunning <te...@gmail.com>.

It has to do with a few things.

First, most classifiers can learn as good or better weights than TF-IDF
weighting, but k-means really needs the extra help.

Second, there is an issue of when the code was developed.  Most of the
clustering code was developed before the feature hashing stuff was
available.  Thus, the mismatch in techniques.  Clearly IDF could have been
used with feature hashing, although reversibility would suffer.

On Thu, Jan 10, 2013 at 9:21 AM, Koert Kuipers <ko...@tresata.com> wrote:

> i noticed that the mahout encoding of text documents in a k-means example
> uses TF-IDF and then converts the documents into vectors (so one vector per
> document) where every word gets mapped to an int for the vector index and
> the words TFIDF score becomes the vector value for that index. it also
> creates a dictionary to be able to get back from the index to the word. did
> i get that right?
>
> why this approach as opposed to the feature hashing that i saw in other
> places in mahout?
>
> thanks! koert
>