You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Vicky <ja...@yahoo.com> on 2012/01/24 04:37:04 UTC
TF-IDF and vectors, Mahout clustering
Hello,
can someone confirm if following understanding of TF-IDF, Vector and Mahout algorithm is correct.
From Lucene index when I create a vector file using Mahout lucene.vector command including TFIDF weighting, following is the process that it involves?
let's say I have only three document in the entire corpus in the lucene index.
Doc1 : High High High Low
Doc2: High High High Medium
Doc3: High HIgh High High
when vector file is created with TF-IDF following will the result after applying TF-IDF formulas? can someone please confirm? In some lucene document its stated without square root for Term frequency calculations while at other it's with square root
Term Frequency
Word Doc1 Doc2 Doc3
High 0.866 0.866 1
Medium 0 0.5 0
Low 0.5 0 0
IDF : log(numDocs/docFrq+1) +1
Doc1 Doc2 Doc3
High 1.124 1.124 1.124
Medium 0 1.176 0
Low 1.176 0 0
So vectors will look like (after TF*IDF)
(High,High,High,Low) (0.973,0,0.588)
(High,High,High,Medium) (0,973,0,0.588)
(High,High,High,High) (1.124,0,0)
Now once vectors are created, in Mahout we can use different algorithm to calculate distance between the points. e.g. EuclideanDistanceMeasure etc
Can you please confirm if my understanding of the entire process is correct including TF-IDF calculations?
Thanks,
Re: TF-IDF and vectors, Mahout clustering
Posted by Ioan Eugen Stan <st...@gmail.com>.
Pe 24.01.2012 05:37, Vicky a scris:
> Hello,
>
> can someone confirm if following understanding of TF-IDF, Vector and Mahout algorithm is correct.
>
> From Lucene index when I create a vector file using Mahout lucene.vector command including TFIDF weighting, following is the process that it involves?
>
> let's say I have only three document in the entire corpus in the lucene index.
> Doc1 : High High High Low
> Doc2: High High High Medium
> Doc3: High HIgh High High
>
> when vector file is created with TF-IDF following will the result after applying TF-IDF formulas? can someone please confirm? In some lucene document its stated without square root for Term frequency calculations while at other it's with square root
>
> Term Frequency
>
> Word Doc1 Doc2 Doc3
> High 0.866 0.866 1
> Medium 0 0.5 0
> Low 0.5 0 0
>
> IDF : log(numDocs/docFrq+1) +1
> Doc1 Doc2 Doc3
> High 1.124 1.124 1.124
> Medium 0 1.176 0
> Low 1.176 0 0
>
> So vectors will look like (after TF*IDF)
>
> (High,High,High,Low) (0.973,0,0.588)
> (High,High,High,Medium) (0,973,0,0.588)
> (High,High,High,High) (1.124,0,0)
>
>
> Now once vectors are created, in Mahout we can use different algorithm to calculate distance between the points. e.g. EuclideanDistanceMeasure etc
>
> Can you please confirm if my understanding of the entire process is correct including TF-IDF calculations?
>
> Thanks,
Not sure if the formula is the one you mentioned, but you've got the idea:
- make vectors
- compute similarity
- group similar vectors in the same cluster
--
Ioan Eugen Stan
http://ieugen.blogspot.com
Re: TF-IDF and vectors, Mahout clustering
Posted by prasenjit mukherjee <pr...@gmail.com>.
I think IDF is for a given term ( hence should not vary with document ).
So IDFs ( log(numDocs/docFrq+1) +1 ) should be :
High = log(3/11)+1
Low = log(3/1) +1
Meidum = log(3/1)+1
-P
On Tue, Jan 24, 2012 at 9:07 AM, Vicky <ja...@yahoo.com> wrote:
> Hello,
>
> can someone confirm if following understanding of TF-IDF, Vector and Mahout algorithm is correct.
>
> From Lucene index when I create a vector file using Mahout lucene.vector command including TFIDF weighting, following is the process that it involves?
>
> let's say I have only three document in the entire corpus in the lucene index.
> Doc1 : High High High Low
> Doc2: High High High Medium
> Doc3: High HIgh High High
>
> when vector file is created with TF-IDF following will the result after applying TF-IDF formulas? can someone please confirm? In some lucene document its stated without square root for Term frequency calculations while at other it's with square root
>
> Term Frequency
>
> Word Doc1 Doc2 Doc3
> High 0.866 0.866 1
> Medium 0 0.5 0
> Low 0.5 0 0
>
> IDF : log(numDocs/docFrq+1) +1
> Doc1 Doc2 Doc3
> High 1.124 1.124 1.124
> Medium 0 1.176 0
> Low 1.176 0 0
>
> So vectors will look like (after TF*IDF)
>
> (High,High,High,Low) (0.973,0,0.588)
> (High,High,High,Medium) (0,973,0,0.588)
> (High,High,High,High) (1.124,0,0)
>
>
> Now once vectors are created, in Mahout we can use different algorithm to calculate distance between the points. e.g. EuclideanDistanceMeasure etc
>
> Can you please confirm if my understanding of the entire process is correct including TF-IDF calculations?
>
> Thanks,