You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by WangRamon <ra...@hotmail.com> on 2012/03/16 05:11:19 UTC

Why there is "Infinity" values for the vector of a K-Means cluster center point?






Hi Guys

 

I’m running k-Means driver, I find most (95%)
of my input vectors is built into one cluster, so I retrieved the big Cluster
object and one of its point that belong to it, use CosineDistanceMeasure to calculate
the distance between them, I found the distance is “NaN”, so I tried to debug,
and found the center point vector of the Cluster object contains some “-Infinity”
values, so when the measure method is called, the AbstractVector.dot will
return “Infinity”, so I don’t know if this is the reason that caused most of my
input vectors belong to one big cluster? And why there are “-Infinity” values
in the center point? Thanks in advance. BTW, i'm suing Mahout 0.6 release.

 

Cheers

Ramon

 		 	   		  

Re: Why there is "Infinity" values for the vector of a K-Means cluster center point?

Posted by Paritosh Ranjan <pr...@xebia.com>.
Top down clustering might be an option. I am not sure how will it
behave, in terms of performance. However, the idea is, to find
relatively bigger clusters first and then run another level of
clustering on the clusters identified and so on till its needed.

https://cwiki.apache.org/confluence/display/MAHOUT/Top+Down+Clustering

It might help if Canopy is used in top phase as there is only one
reducer and it keeps all the clusters in memory and compares every input
to it.
So, finding initial clusters using Canopy and then using any other
faster clustering algorithm on the clusters found (bottom phases) might
be a good idea, however, the results only can tell whether its
performance is better than Canopy or not.

On 17-03-2012 06:25, WangRamon wrote:
>
>
> Thanks Jeff, I’ve found the root cause, the
> input vectors contain some “Infinity” value, it’s because a bug in our logic to
> generate tfidf value, we fixed the bug, now the K-Means process works as
> expected, thank you.
>
>  
>
> BTW, we plan to use Canopy to find the
> initial cluster centers, but the process seems it’s very slow, do you have any
> good idea? Thanks
>
>  
>
> Cheers
>
> Ramon
>
>
>  > Date: Fri, 16 Mar 2012 10:16:55 -0600
>> From: jdog@windwardsolutions.com
>> To: user@mahout.apache.org
>> Subject: Re: Why there is "Infinity" values for the vector of a K-Means cluster center point?
>>
>> Good question. The only way I can think of an infinity in a Kluster
>> center is if there were some infinity values in the vectors it observed.
>> The center (centroid) is calculated in each iteration after all points
>> have been observed by dividing S1 by S0. If, for some reason, S0 was
>> zero this would cause all of the center elements to be infinity. We
>> check for that case so it is unlikely.
>>
>> Can you narrow it down a bit more? How are you getting the kmeans prior?
>> By sampling input vectors (-k) or using Canopy? Are there any infinity
>> values in clusters-0?
>>
>>
>> On 3/15/12 10:11 PM, WangRamon wrote:
>>>
>>>
>>>
>>>
>>> Hi Guys
>>>
>>>  
>>>
>>> I’m running k-Means driver, I find most (95%)
>>> of my input vectors is built into one cluster, so I retrieved the big Cluster
>>> object and one of its point that belong to it, use CosineDistanceMeasure to calculate
>>> the distance between them, I found the distance is “NaN”, so I tried to debug,
>>> and found the center point vector of the Cluster object contains some “-Infinity”
>>> values, so when the measure method is called, the AbstractVector.dot will
>>> return “Infinity”, so I don’t know if this is the reason that caused most of my
>>> input vectors belong to one big cluster? And why there are “-Infinity” values
>>> in the center point? Thanks in advance. BTW, i'm suing Mahout 0.6 release.
>>>
>>>  
>>>
>>> Cheers
>>>
>>> Ramon
>>>
>>>  		 	   		  
>  		 	   		  


RE: Why there is "Infinity" values for the vector of a K-Means cluster center point?

Posted by WangRamon <ra...@hotmail.com>.


Thanks Jeff, I’ve found the root cause, the
input vectors contain some “Infinity” value, it’s because a bug in our logic to
generate tfidf value, we fixed the bug, now the K-Means process works as
expected, thank you.

 

BTW, we plan to use Canopy to find the
initial cluster centers, but the process seems it’s very slow, do you have any
good idea? Thanks

 

Cheers

Ramon


 > Date: Fri, 16 Mar 2012 10:16:55 -0600
> From: jdog@windwardsolutions.com
> To: user@mahout.apache.org
> Subject: Re: Why there is "Infinity" values for the vector of a K-Means cluster center point?
> 
> Good question. The only way I can think of an infinity in a Kluster
> center is if there were some infinity values in the vectors it observed.
> The center (centroid) is calculated in each iteration after all points
> have been observed by dividing S1 by S0. If, for some reason, S0 was
> zero this would cause all of the center elements to be infinity. We
> check for that case so it is unlikely.
> 
> Can you narrow it down a bit more? How are you getting the kmeans prior?
> By sampling input vectors (-k) or using Canopy? Are there any infinity
> values in clusters-0?
> 
> 
> On 3/15/12 10:11 PM, WangRamon wrote:
> >
> >
> >
> >
> >
> > Hi Guys
> >
> >  
> >
> > I’m running k-Means driver, I find most (95%)
> > of my input vectors is built into one cluster, so I retrieved the big Cluster
> > object and one of its point that belong to it, use CosineDistanceMeasure to calculate
> > the distance between them, I found the distance is “NaN”, so I tried to debug,
> > and found the center point vector of the Cluster object contains some “-Infinity”
> > values, so when the measure method is called, the AbstractVector.dot will
> > return “Infinity”, so I don’t know if this is the reason that caused most of my
> > input vectors belong to one big cluster? And why there are “-Infinity” values
> > in the center point? Thanks in advance. BTW, i'm suing Mahout 0.6 release.
> >
> >  
> >
> > Cheers
> >
> > Ramon
> >
> >  		 	   		  
> 
 		 	   		  

Re: Why there is "Infinity" values for the vector of a K-Means cluster center point?

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Good question. The only way I can think of an infinity in a Kluster
center is if there were some infinity values in the vectors it observed.
The center (centroid) is calculated in each iteration after all points
have been observed by dividing S1 by S0. If, for some reason, S0 was
zero this would cause all of the center elements to be infinity. We
check for that case so it is unlikely.

Can you narrow it down a bit more? How are you getting the kmeans prior?
By sampling input vectors (-k) or using Canopy? Are there any infinity
values in clusters-0?


On 3/15/12 10:11 PM, WangRamon wrote:
>
>
>
>
>
> Hi Guys
>
>  
>
> I’m running k-Means driver, I find most (95%)
> of my input vectors is built into one cluster, so I retrieved the big Cluster
> object and one of its point that belong to it, use CosineDistanceMeasure to calculate
> the distance between them, I found the distance is “NaN”, so I tried to debug,
> and found the center point vector of the Cluster object contains some “-Infinity”
> values, so when the measure method is called, the AbstractVector.dot will
> return “Infinity”, so I don’t know if this is the reason that caused most of my
> input vectors belong to one big cluster? And why there are “-Infinity” values
> in the center point? Thanks in advance. BTW, i'm suing Mahout 0.6 release.
>
>  
>
> Cheers
>
> Ramon
>
>