You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by nikos <nk...@csd.auth.gr> on 2012/10/05 14:50:12 UTC
Re: Mahout K-means has different behavior based on the number of mapping tasks

Hello,
is there any update on this?
Does the answer I got here 
http://stackoverflow.com/questions/12606701/mahout-k-means-has-different-behavior-based-on-the-number-of-mapping-tasks 
sounds resonable to you? If it does it seems that there is a rather 
serious implementation error on k-means.What do you think?

Nikos

On 09/27/12 13:17, nikos wrote:
> Thank you for the answers,
> so how could we check if there is a problem in the reducer?And if, 
> indeed, there is could also explain why there are users that 
> experience slow executions of K-means ( 
> http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%3CEED162F8-71C8-4DA1-9625-77295C827FB5@gmail.com%3E)?
> Also I have to mention that for (bigger) k near 100 again in the same 
> dataset and same parameters and same initial centroids k-means 
> converges when it runs on one mapper on two iterations but when I 
> split the dataset in two mappers it does never converge and takes all 
> the iterations until it finishes (even if I set -x 100).
>
> On 09/26/12 23:51, Jeff Eastman wrote:
>> Very odd indeed. Each mapper will start with the same set of clusters 
>> and assign points to clusters (clusters observe the points) based 
>> upon the cluster centers (identical) and the chosen distance measure 
>> (also identical). At the end of the map step, each mapper sends its 
>> trained clusters (with observation statistics s0, s1 & s2) to the 
>> reducer(s) keyed by clusterId.
>>
>> In the reducer, the trained clusters are accumulated by taking the 
>> first and observing all the subsequent clusters (with the same 
>> clusterId) with it. This is done by adding the s0, s1 and s2 values 
>> from each observed cluster.
>>
>> Finally, each cluster is closed and a new center & radius is 
>> calculated before it is output to begin the next iteration. If there 
>> is a problem in the implementation, it would be in the reducer where 
>> the accumulations occur.
>>
>> On 9/26/12 3:16 PM, paritosh ranjan wrote:
>>> Each input split ( containing vectors in this case ) goes to a 
>>> different
>>> mapper task, and the clusters (models) are trained using the vectors
>>> present in each mapper task, and the models are updated in the reducer.
>>> This process is repeated till convergence/maxiteration. Since different
>>> vectors went to different mapper tasks when two mapper tasks were 
>>> used, so,
>>> it took time (more iterations) to converge, and also the results after
>>> first iteration were different.
>>>
>>> Look into CIMapper and CIReducer classes for more/better explanation.
>>>
>>> On Thu, Sep 27, 2012 at 12:03 AM, paritosh ranjan 
>>> <paritoshranjan5@gmail.com
>>>> wrote:
>>>> And same set of centroids were used for both executions?
>>>>
>>>>
>>>> On Wed, Sep 26, 2012 at 11:22 PM, nikos <nk...@csd.auth.gr> wrote:
>>>>
>>>>> The centroids have been selected in a previous execution of Mahout
>>>>> K-means via randomSeed generator.
>>>>>
>>>>>
>>>>> On 09/26/2012 08:43 PM, paritosh ranjan wrote:
>>>>>
>>>>>> By saying "Using the a pre-selected set of initial centroids" do 
>>>>>> you mean
>>>>>> that the initial centroids were same in both executions?
>>>>>> In other words, how are you choosing your initial centroids?
>>>>>>
>>>>>> On Wed, Sep 26, 2012 at 10:40 PM, nikos <nk...@csd.auth.gr> 
>>>>>> wrote:
>>>>>>
>>>>>>   I experience a strange situation when running Mahout K-means: 
>>>>>> Using the
>>>>>>> a
>>>>>>> pre-selected set of initial centroids, I run K-means on a 
>>>>>>> SequenceFile
>>>>>>> generated by lucene.vector. The run is for testing purposes, so the
>>>>>>> file is
>>>>>>> small (around 10MB~10000 vectors).
>>>>>>>
>>>>>>> When K-means is executed with a single mapper (the default 
>>>>>>> considering
>>>>>>> the
>>>>>>> Hadoop split size which in my cluster is 128MB), it reaches a given
>>>>>>> clustering result in 2 iterations (Case A). However, I wanted to 
>>>>>>> test if
>>>>>>> there would be any improvement/deterioration in the algorithm's
>>>>>>> execution
>>>>>>> speed by firing more mapping tasks (the Hadoop cluster has in 
>>>>>>> total 6
>>>>>>> nodes). I therefore set the -Dmapred.max.split.size parameter to 
>>>>>>> 5242880
>>>>>>> bytes, in order to make mahout fire 2 mapping tasks (Case B). I 
>>>>>>> indeed
>>>>>>> succeeded in starting two mappers, but the strange thing was 
>>>>>>> that the
>>>>>>> job
>>>>>>> finished after 5 iterations instead of 2, and that even at the 
>>>>>>> first
>>>>>>> assignment of points to clusters, the mappers made different 
>>>>>>> choices
>>>>>>> compared to the single-map execution . What I mean is that after 
>>>>>>> close
>>>>>>> inspection of the clusterDump for the first iteration for both two
>>>>>>> cases, I
>>>>>>> found that in case B some points were not assigned to their closest
>>>>>>> cluster.
>>>>>>>
>>>>>>> Could this behavior be justified by the existing K-means Mahout
>>>>>>> implementation?
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>
>
>