You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Lokendra Singh <ls...@gmail.com> on 2011/01/17 17:04:18 UTC

Difference in KMeans performance with Mahout-0.3 and Mahout-0.4

Hi all,

I am running KMeans clustering algorithm to cluster about 60K points
(DenseVectors) into 4K clusters on my Hadoop Cluster.
I initialized the clusters with initial 'k' points  as centroids(4000) and
kept the convergence threshold pretty low (0.001).

I tried running it with Mahout-0.3 and 0.4 version and found huge difference
in their performance.
The rate of convergence was pretty high with mahout-0.3 ( in 1st iteration
about 600 clusters (out of 4000) converged, by 6th iteration almost 3500
clusters (out of 4000) converged).
While with mahout-0.4, I observed just 10 clusters (out of 4000) converging
even after 10 iterations.

What architectural difference between implementation of KMeans of mahout-0.4
and mahout-0.3 might be causing this difference in performance?

Regards
Lokendra

Re: Difference in KMeans performance with Mahout-0.3 and Mahout-0.4

Posted by Lokendra Singh <ls...@gmail.com>.

@Jeff :Every parameter : conv threshold, number of clusters (i.e 4000) and
Input Points and Input Clusters are same for both the cases.
I did not generate the initial cluster randomly but rather generated the
initial 'k' clusters with 'first' 'k' Input points as their centroids.
Hence, initial clusters are same in both the cases.
Each DenseVector is of cardinality 64 (all doubles).

@Robin: I have been using Euclidean Distance measure in both the cases.
Actually, I am not using the mahout command line stuff, but rather directly
accessing the API  by KMeansDriver.runJob() (mahout-0.3)  and
KMeansDriver.run() (mahout-0.4) methods, hence  default values is not a
problem

I would try Random initialization of clusters and report the behavior again.


Regards
Lokendra


On Mon, Jan 17, 2011 at 11:09 PM, Jeff Eastman
<jd...@windwardsolutions.com>wrote:

> Good call Robin,
> IIRC the default distance measure did change from Euclidean to
> SquaredEuclidean. Try specifying the DM directly using the -dm option to
> force the same DistanceMeasure.
>
>
> On 1/17/11 10:09 AM, Robin Anil wrote:
>
>> Are the distance measure classes same in both runs? There could be changes
>> in default values, which are causing this. do a --help to see the default
>> values for cmdline flags
>>
>> Robin
>>
>> On Mon, Jan 17, 2011 at 10:25 PM, Ted Dunning<te...@gmail.com>
>>  wrote:
>>
>>  4000 clusters is a lot as well.
>>>
>>> Did the 0.3 solution have lots of clusters with single members?
>>>
>>> On Mon, Jan 17, 2011 at 8:46 AM, Jeff Eastman<jdog@windwardsolutions.com
>>>
>>>> wrote:
>>>> I can't think of any architectural changes which would cause the
>>>> convergence performance to change but this is a curious indeed. I see
>>>> you
>>>> are using DenseVectors but you did not say what their cardinality is.
>>>> You
>>>> also did not say how you generated the initial clusters (canopy or
>>>> random
>>>> sample). Can you run the 0.4 k-means with the initial clusters from your
>>>>
>>> 0.3
>>>
>>>> run? That would tend to isolate the change to either k-means itself or
>>>>
>>> the
>>>
>>>> the sampling algorithm in RandomSeedGenerator. A poor set of initial
>>>> clusters could greatly impact the convergence so that is where I'd
>>>>
>>> suggest
>>>
>>>> starting.
>>>>
>>>> Jeff
>>>>
>>>> On 1/17/11 9:04 AM, Lokendra Singh wrote:
>>>>
>>>>  Hi all,
>>>>>
>>>>> I am running KMeans clustering algorithm to cluster about 60K points
>>>>> (DenseVectors) into 4K clusters on my Hadoop Cluster.
>>>>> I initialized the clusters with initial 'k' points  as centroids(4000)
>>>>>
>>>> and
>>>
>>>> kept the convergence threshold pretty low (0.001).
>>>>>
>>>>> I tried running it with Mahout-0.3 and 0.4 version and found huge
>>>>> difference
>>>>> in their performance.
>>>>> The rate of convergence was pretty high with mahout-0.3 ( in 1st
>>>>>
>>>> iteration
>>>
>>>> about 600 clusters (out of 4000) converged, by 6th iteration almost 3500
>>>>> clusters (out of 4000) converged).
>>>>> While with mahout-0.4, I observed just 10 clusters (out of 4000)
>>>>> converging
>>>>> even after 10 iterations.
>>>>>
>>>>> What architectural difference between implementation of KMeans of
>>>>> mahout-0.4
>>>>> and mahout-0.3 might be causing this difference in performance?
>>>>>
>>>>> Regards
>>>>> Lokendra
>>>>>
>>>>>
>>>>>
>

Re: Difference in KMeans performance with Mahout-0.3 and Mahout-0.4

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Good call Robin,
IIRC the default distance measure did change from Euclidean to 
SquaredEuclidean. Try specifying the DM directly using the -dm option to 
force the same DistanceMeasure.

On 1/17/11 10:09 AM, Robin Anil wrote:
> Are the distance measure classes same in both runs? There could be changes
> in default values, which are causing this. do a --help to see the default
> values for cmdline flags
>
> Robin
>
> On Mon, Jan 17, 2011 at 10:25 PM, Ted Dunning<te...@gmail.com>  wrote:
>
>> 4000 clusters is a lot as well.
>>
>> Did the 0.3 solution have lots of clusters with single members?
>>
>> On Mon, Jan 17, 2011 at 8:46 AM, Jeff Eastman<jdog@windwardsolutions.com
>>> wrote:
>>> I can't think of any architectural changes which would cause the
>>> convergence performance to change but this is a curious indeed. I see you
>>> are using DenseVectors but you did not say what their cardinality is. You
>>> also did not say how you generated the initial clusters (canopy or random
>>> sample). Can you run the 0.4 k-means with the initial clusters from your
>> 0.3
>>> run? That would tend to isolate the change to either k-means itself or
>> the
>>> the sampling algorithm in RandomSeedGenerator. A poor set of initial
>>> clusters could greatly impact the convergence so that is where I'd
>> suggest
>>> starting.
>>>
>>> Jeff
>>>
>>> On 1/17/11 9:04 AM, Lokendra Singh wrote:
>>>
>>>> Hi all,
>>>>
>>>> I am running KMeans clustering algorithm to cluster about 60K points
>>>> (DenseVectors) into 4K clusters on my Hadoop Cluster.
>>>> I initialized the clusters with initial 'k' points  as centroids(4000)
>> and
>>>> kept the convergence threshold pretty low (0.001).
>>>>
>>>> I tried running it with Mahout-0.3 and 0.4 version and found huge
>>>> difference
>>>> in their performance.
>>>> The rate of convergence was pretty high with mahout-0.3 ( in 1st
>> iteration
>>>> about 600 clusters (out of 4000) converged, by 6th iteration almost 3500
>>>> clusters (out of 4000) converged).
>>>> While with mahout-0.4, I observed just 10 clusters (out of 4000)
>>>> converging
>>>> even after 10 iterations.
>>>>
>>>> What architectural difference between implementation of KMeans of
>>>> mahout-0.4
>>>> and mahout-0.3 might be causing this difference in performance?
>>>>
>>>> Regards
>>>> Lokendra
>>>>
>>>>

Re: Difference in KMeans performance with Mahout-0.3 and Mahout-0.4

Posted by Robin Anil <ro...@gmail.com>.

Are the distance measure classes same in both runs? There could be changes
in default values, which are causing this. do a --help to see the default
values for cmdline flags

Robin

On Mon, Jan 17, 2011 at 10:25 PM, Ted Dunning <te...@gmail.com> wrote:

> 4000 clusters is a lot as well.
>
> Did the 0.3 solution have lots of clusters with single members?
>
> On Mon, Jan 17, 2011 at 8:46 AM, Jeff Eastman <jdog@windwardsolutions.com
> >wrote:
>
> > I can't think of any architectural changes which would cause the
> > convergence performance to change but this is a curious indeed. I see you
> > are using DenseVectors but you did not say what their cardinality is. You
> > also did not say how you generated the initial clusters (canopy or random
> > sample). Can you run the 0.4 k-means with the initial clusters from your
> 0.3
> > run? That would tend to isolate the change to either k-means itself or
> the
> > the sampling algorithm in RandomSeedGenerator. A poor set of initial
> > clusters could greatly impact the convergence so that is where I'd
> suggest
> > starting.
> >
> > Jeff
> >
> > On 1/17/11 9:04 AM, Lokendra Singh wrote:
> >
> >> Hi all,
> >>
> >> I am running KMeans clustering algorithm to cluster about 60K points
> >> (DenseVectors) into 4K clusters on my Hadoop Cluster.
> >> I initialized the clusters with initial 'k' points  as centroids(4000)
> and
> >> kept the convergence threshold pretty low (0.001).
> >>
> >> I tried running it with Mahout-0.3 and 0.4 version and found huge
> >> difference
> >> in their performance.
> >> The rate of convergence was pretty high with mahout-0.3 ( in 1st
> iteration
> >> about 600 clusters (out of 4000) converged, by 6th iteration almost 3500
> >> clusters (out of 4000) converged).
> >> While with mahout-0.4, I observed just 10 clusters (out of 4000)
> >> converging
> >> even after 10 iterations.
> >>
> >> What architectural difference between implementation of KMeans of
> >> mahout-0.4
> >> and mahout-0.3 might be causing this difference in performance?
> >>
> >> Regards
> >> Lokendra
> >>
> >>
> >
>

Re: Difference in KMeans performance with Mahout-0.3 and Mahout-0.4

Posted by Ted Dunning <te...@gmail.com>.

4000 clusters is a lot as well.

Did the 0.3 solution have lots of clusters with single members?

On Mon, Jan 17, 2011 at 8:46 AM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> I can't think of any architectural changes which would cause the
> convergence performance to change but this is a curious indeed. I see you
> are using DenseVectors but you did not say what their cardinality is. You
> also did not say how you generated the initial clusters (canopy or random
> sample). Can you run the 0.4 k-means with the initial clusters from your 0.3
> run? That would tend to isolate the change to either k-means itself or the
> the sampling algorithm in RandomSeedGenerator. A poor set of initial
> clusters could greatly impact the convergence so that is where I'd suggest
> starting.
>
> Jeff
>
> On 1/17/11 9:04 AM, Lokendra Singh wrote:
>
>> Hi all,
>>
>> I am running KMeans clustering algorithm to cluster about 60K points
>> (DenseVectors) into 4K clusters on my Hadoop Cluster.
>> I initialized the clusters with initial 'k' points  as centroids(4000) and
>> kept the convergence threshold pretty low (0.001).
>>
>> I tried running it with Mahout-0.3 and 0.4 version and found huge
>> difference
>> in their performance.
>> The rate of convergence was pretty high with mahout-0.3 ( in 1st iteration
>> about 600 clusters (out of 4000) converged, by 6th iteration almost 3500
>> clusters (out of 4000) converged).
>> While with mahout-0.4, I observed just 10 clusters (out of 4000)
>> converging
>> even after 10 iterations.
>>
>> What architectural difference between implementation of KMeans of
>> mahout-0.4
>> and mahout-0.3 might be causing this difference in performance?
>>
>> Regards
>> Lokendra
>>
>>
>

Re: Difference in KMeans performance with Mahout-0.3 and Mahout-0.4

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I can't think of any architectural changes which would cause the 
convergence performance to change but this is a curious indeed. I see 
you are using DenseVectors but you did not say what their cardinality 
is. You also did not say how you generated the initial clusters (canopy 
or random sample). Can you run the 0.4 k-means with the initial clusters 
from your 0.3 run? That would tend to isolate the change to either 
k-means itself or the the sampling algorithm in RandomSeedGenerator. A 
poor set of initial clusters could greatly impact the convergence so 
that is where I'd suggest starting.

Jeff

On 1/17/11 9:04 AM, Lokendra Singh wrote:
> Hi all,
>
> I am running KMeans clustering algorithm to cluster about 60K points
> (DenseVectors) into 4K clusters on my Hadoop Cluster.
> I initialized the clusters with initial 'k' points  as centroids(4000) and
> kept the convergence threshold pretty low (0.001).
>
> I tried running it with Mahout-0.3 and 0.4 version and found huge difference
> in their performance.
> The rate of convergence was pretty high with mahout-0.3 ( in 1st iteration
> about 600 clusters (out of 4000) converged, by 6th iteration almost 3500
> clusters (out of 4000) converged).
> While with mahout-0.4, I observed just 10 clusters (out of 4000) converging
> even after 10 iterations.
>
> What architectural difference between implementation of KMeans of mahout-0.4
> and mahout-0.3 might be causing this difference in performance?
>
> Regards
> Lokendra
>