You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Vicky <ja...@yahoo.com> on 2012/01/24 04:37:04 UTC

TF-IDF and vectors, Mahout clustering

Hello,

can someone confirm if following understanding of TF-IDF, Vector and Mahout algorithm is correct. 

From Lucene index when I create a vector file using Mahout lucene.vector command including TFIDF weighting, following is the process that it involves?

let's say I have only three document in the entire corpus in the lucene index. 
Doc1 : High High High Low
Doc2: High High High Medium
Doc3: High HIgh High High

when vector file is created with TF-IDF following will the result after applying TF-IDF formulas? can someone please confirm? In some lucene document its stated without square root for Term frequency calculations while at other it's with square root

Term Frequency 

Word         Doc1           Doc2         Doc3
High          0.866          0.866          1
Medium     0                0.5              0
Low          0.5              0                 0

IDF : log(numDocs/docFrq+1) +1
              Doc1            Doc2            Doc3
High        1.124           1.124           1.124
Medium    0                1.176            0
Low        1.176            0                 0

So vectors will look like (after TF*IDF)

(High,High,High,Low)               (0.973,0,0.588)
(High,High,High,Medium)          (0,973,0,0.588)
(High,High,High,High)               (1.124,0,0)


Now once vectors are created, in Mahout we can use different algorithm to calculate distance between the points. e.g. EuclideanDistanceMeasure etc

Can you please confirm if my understanding of the entire process is correct including TF-IDF calculations? 

Thanks,

Re: TF-IDF and vectors, Mahout clustering

Posted by Ioan Eugen Stan <st...@gmail.com>.

Pe 24.01.2012 05:37, Vicky a scris:
> Hello,
>
> can someone confirm if following understanding of TF-IDF, Vector and Mahout algorithm is correct.
>
>  From Lucene index when I create a vector file using Mahout lucene.vector command including TFIDF weighting, following is the process that it involves?
>
> let's say I have only three document in the entire corpus in the lucene index.
> Doc1 : High High High Low
> Doc2: High High High Medium
> Doc3: High HIgh High High
>
> when vector file is created with TF-IDF following will the result after applying TF-IDF formulas? can someone please confirm? In some lucene document its stated without square root for Term frequency calculations while at other it's with square root
>
> Term Frequency
>
> Word         Doc1           Doc2         Doc3
> High          0.866          0.866          1
> Medium     0                0.5              0
> Low          0.5              0                 0
>
> IDF : log(numDocs/docFrq+1) +1
>                Doc1            Doc2            Doc3
> High        1.124           1.124           1.124
> Medium    0                1.176            0
> Low        1.176            0                 0
>
> So vectors will look like (after TF*IDF)
>
> (High,High,High,Low)               (0.973,0,0.588)
> (High,High,High,Medium)          (0,973,0,0.588)
> (High,High,High,High)               (1.124,0,0)
>
>
> Now once vectors are created, in Mahout we can use different algorithm to calculate distance between the points. e.g. EuclideanDistanceMeasure etc
>
> Can you please confirm if my understanding of the entire process is correct including TF-IDF calculations?
>
> Thanks,

Not sure if the formula is the one you mentioned, but you've got the idea:

- make vectors
- compute similarity
- group similar vectors in the same cluster



-- 
Ioan Eugen Stan
http://ieugen.blogspot.com

Re: TF-IDF and vectors, Mahout clustering

Posted by prasenjit mukherjee <pr...@gmail.com>.

I think IDF is for a given term ( hence should not  vary with document ).

So IDFs ( log(numDocs/docFrq+1) +1 )  should be :

High = log(3/11)+1
Low = log(3/1) +1
Meidum = log(3/1)+1

-P

On Tue, Jan 24, 2012 at 9:07 AM, Vicky <ja...@yahoo.com> wrote:
> Hello,
>
> can someone confirm if following understanding of TF-IDF, Vector and Mahout algorithm is correct.
>
> From Lucene index when I create a vector file using Mahout lucene.vector command including TFIDF weighting, following is the process that it involves?
>
> let's say I have only three document in the entire corpus in the lucene index.
> Doc1 : High High High Low
> Doc2: High High High Medium
> Doc3: High HIgh High High
>
> when vector file is created with TF-IDF following will the result after applying TF-IDF formulas? can someone please confirm? In some lucene document its stated without square root for Term frequency calculations while at other it's with square root
>
> Term Frequency
>
> Word         Doc1           Doc2         Doc3
> High          0.866          0.866          1
> Medium     0                0.5              0
> Low          0.5              0                 0
>
> IDF : log(numDocs/docFrq+1) +1
>               Doc1            Doc2            Doc3
> High        1.124           1.124           1.124
> Medium    0                1.176            0
> Low        1.176            0                 0
>
> So vectors will look like (after TF*IDF)
>
> (High,High,High,Low)               (0.973,0,0.588)
> (High,High,High,Medium)          (0,973,0,0.588)
> (High,High,High,High)               (1.124,0,0)
>
>
> Now once vectors are created, in Mahout we can use different algorithm to calculate distance between the points. e.g. EuclideanDistanceMeasure etc
>
> Can you please confirm if my understanding of the entire process is correct including TF-IDF calculations?
>
> Thanks,