You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by WangRamon <ra...@hotmail.com> on 2012/03/15 06:25:38 UTC

What will be a better value for T1 and T2 of a CosineDistanceMeasure




Hi All  I'm tunning the cluster number of some news input with CosineDistanceMeasure, the input data is about 11000 rows, so i tried different settings for t1 and t2, here is a list: 1) with t1: 0.6 t2: 0.9, i got Reduce output records=60 2) with t1: 0.6 t2: 0.8, i got Reduce output records=868 3) with t1=0.6 and t2=0.7, i got Reduce output records=3374  I expect the reduce output (the cluster number) should be less than 100 and the first one just matched what i was thinking, but what supprised me is the test values for t2, so my understanding is that cos(25) is about 0.9 and cos(35) is about 0.8 (cos(90) == 0.0), so if i set cos(35) as t2, it should generate less cluster number than cos(25) as t2, because it means two vector is much more different, the angle between them is larger. Did I miss something? Thanks in advance.  Cheers   Ramon  		 	   		  

RE: What will be a better value for T1 and T2 of a CosineDistanceMeasure

Posted by WangRamon <ra...@hotmail.com>.
Ok, i got it, this seems to be the anwser: d = 1 - (a1b1 + a2b2 + … + anbn) / (√(a12 + a22 + … + an2)√(b12 + b22 + … + bn2))
 > From: ramon_wang@hotmail.com
> To: user@mahout.apache.org
> Subject: What will be a better value for T1 and T2 of a CosineDistanceMeasure
> Date: Thu, 15 Mar 2012 13:25:38 +0800
> 
> 
> 
> 
> 
> Hi All  I'm tunning the cluster number of some news input with CosineDistanceMeasure, the input data is about 11000 rows, so i tried different settings for t1 and t2, here is a list: 1) with t1: 0.6 t2: 0.9, i got Reduce output records=60 2) with t1: 0.6 t2: 0.8, i got Reduce output records=868 3) with t1=0.6 and t2=0.7, i got Reduce output records=3374  I expect the reduce output (the cluster number) should be less than 100 and the first one just matched what i was thinking, but what supprised me is the test values for t2, so my understanding is that cos(25) is about 0.9 and cos(35) is about 0.8 (cos(90) == 0.0), so if i set cos(35) as t2, it should generate less cluster number than cos(25) as t2, because it means two vector is much more different, the angle between them is larger. Did I miss something? Thanks in advance.  Cheers   Ramon