You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by David Noel <da...@gmail.com> on 2014/06/03 08:37:22 UTC

TF-IDF vector persistence with normalization enabled

I made an observation similar to what was pointed out in this mailing
list post here:
http://comments.gmane.org/gmane.comp.apache.mahout.user/17819; that
TF-IDF vectors do not seem to persist when generating them with
normalization enabled.

According to Gokhan Capan:

"It seems to have tf-idf vectors later, you need to create tf vectors
(DictionaryVectorizer.createTermFrequencyVectors) with logNormalize option
set to false, and normPower option set to -1.0f."

Is there some reason for this? It would seem useful if they persisted.
Can someone explain the reasoning behind them not? I figure there's a
perfectly good reason, I just can't seem to figure out what it is.

Re: TF-IDF vector persistence with normalization enabled

Posted by David Noel <da...@gmail.com>.
>> "It seems to have tf-idf vectors later, you need to create tf vectors
>> (DictionaryVectorizer.createTermFrequencyVectors) with logNormalize
>> option set to false, and normPower option set to -1.0f."
> That post implies that in order to have tf-idf vectors persisted, in the tf
> vectors creation phase you need those options set.

I've noticed that from playing around with DictionaryVectorizer and
TFIDFConverter. I'm just wondering why this is the case. I don't
understand the reasoning behind the vectors not persisting when
normalization is enabled.

Re: TF-IDF vector persistence with normalization enabled

Posted by Gokhan Capan <gk...@gmail.com>.
That post implies that in order to have tf-idf vectors persisted, in the tf
vectors creation phase you need those options set.

Or you can always run the Driver directly and easily, preferably from
mahout's commandline, i.e. bin/mahout seq2sparse

Gokhan


On Tue, Jun 3, 2014 at 9:37 AM, David Noel <da...@gmail.com> wrote:

> I made an observation similar to what was pointed out in this mailing
> list post here:
> http://comments.gmane.org/gmane.comp.apache.mahout.user/17819; that
> TF-IDF vectors do not seem to persist when generating them with
> normalization enabled.
>
> According to Gokhan Capan:
>
> "It seems to have tf-idf vectors later, you need to create tf vectors
> (DictionaryVectorizer.createTermFrequencyVectors) with logNormalize option
> set to false, and normPower option set to -1.0f."
>
> Is there some reason for this? It would seem useful if they persisted.
> Can someone explain the reasoning behind them not? I figure there's a
> perfectly good reason, I just can't seem to figure out what it is.
>