You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by nikos <nk...@csd.auth.gr> on 2012/10/05 14:50:12 UTC
Re: Mahout K-means has different behavior based on the number of
mapping tasks
Hello,
is there any update on this?
Does the answer I got here
http://stackoverflow.com/questions/12606701/mahout-k-means-has-different-behavior-based-on-the-number-of-mapping-tasks
sounds resonable to you? If it does it seems that there is a rather
serious implementation error on k-means.What do you think?
Nikos
On 09/27/12 13:17, nikos wrote:
> Thank you for the answers,
> so how could we check if there is a problem in the reducer?And if,
> indeed, there is could also explain why there are users that
> experience slow executions of K-means (
> http://mail-archives.apache.org/mod_mbox/mahout-user/201209.mbox/%3CEED162F8-71C8-4DA1-9625-77295C827FB5@gmail.com%3E)?
> Also I have to mention that for (bigger) k near 100 again in the same
> dataset and same parameters and same initial centroids k-means
> converges when it runs on one mapper on two iterations but when I
> split the dataset in two mappers it does never converge and takes all
> the iterations until it finishes (even if I set -x 100).
>
> On 09/26/12 23:51, Jeff Eastman wrote:
>> Very odd indeed. Each mapper will start with the same set of clusters
>> and assign points to clusters (clusters observe the points) based
>> upon the cluster centers (identical) and the chosen distance measure
>> (also identical). At the end of the map step, each mapper sends its
>> trained clusters (with observation statistics s0, s1 & s2) to the
>> reducer(s) keyed by clusterId.
>>
>> In the reducer, the trained clusters are accumulated by taking the
>> first and observing all the subsequent clusters (with the same
>> clusterId) with it. This is done by adding the s0, s1 and s2 values
>> from each observed cluster.
>>
>> Finally, each cluster is closed and a new center & radius is
>> calculated before it is output to begin the next iteration. If there
>> is a problem in the implementation, it would be in the reducer where
>> the accumulations occur.
>>
>> On 9/26/12 3:16 PM, paritosh ranjan wrote:
>>> Each input split ( containing vectors in this case ) goes to a
>>> different
>>> mapper task, and the clusters (models) are trained using the vectors
>>> present in each mapper task, and the models are updated in the reducer.
>>> This process is repeated till convergence/maxiteration. Since different
>>> vectors went to different mapper tasks when two mapper tasks were
>>> used, so,
>>> it took time (more iterations) to converge, and also the results after
>>> first iteration were different.
>>>
>>> Look into CIMapper and CIReducer classes for more/better explanation.
>>>
>>> On Thu, Sep 27, 2012 at 12:03 AM, paritosh ranjan
>>> <paritoshranjan5@gmail.com
>>>> wrote:
>>>> And same set of centroids were used for both executions?
>>>>
>>>>
>>>> On Wed, Sep 26, 2012 at 11:22 PM, nikos <nk...@csd.auth.gr> wrote:
>>>>
>>>>> The centroids have been selected in a previous execution of Mahout
>>>>> K-means via randomSeed generator.
>>>>>
>>>>>
>>>>> On 09/26/2012 08:43 PM, paritosh ranjan wrote:
>>>>>
>>>>>> By saying "Using the a pre-selected set of initial centroids" do
>>>>>> you mean
>>>>>> that the initial centroids were same in both executions?
>>>>>> In other words, how are you choosing your initial centroids?
>>>>>>
>>>>>> On Wed, Sep 26, 2012 at 10:40 PM, nikos <nk...@csd.auth.gr>
>>>>>> wrote:
>>>>>>
>>>>>> I experience a strange situation when running Mahout K-means:
>>>>>> Using the
>>>>>>> a
>>>>>>> pre-selected set of initial centroids, I run K-means on a
>>>>>>> SequenceFile
>>>>>>> generated by lucene.vector. The run is for testing purposes, so the
>>>>>>> file is
>>>>>>> small (around 10MB~10000 vectors).
>>>>>>>
>>>>>>> When K-means is executed with a single mapper (the default
>>>>>>> considering
>>>>>>> the
>>>>>>> Hadoop split size which in my cluster is 128MB), it reaches a given
>>>>>>> clustering result in 2 iterations (Case A). However, I wanted to
>>>>>>> test if
>>>>>>> there would be any improvement/deterioration in the algorithm's
>>>>>>> execution
>>>>>>> speed by firing more mapping tasks (the Hadoop cluster has in
>>>>>>> total 6
>>>>>>> nodes). I therefore set the -Dmapred.max.split.size parameter to
>>>>>>> 5242880
>>>>>>> bytes, in order to make mahout fire 2 mapping tasks (Case B). I
>>>>>>> indeed
>>>>>>> succeeded in starting two mappers, but the strange thing was
>>>>>>> that the
>>>>>>> job
>>>>>>> finished after 5 iterations instead of 2, and that even at the
>>>>>>> first
>>>>>>> assignment of points to clusters, the mappers made different
>>>>>>> choices
>>>>>>> compared to the single-map execution . What I mean is that after
>>>>>>> close
>>>>>>> inspection of the clusterDump for the first iteration for both two
>>>>>>> cases, I
>>>>>>> found that in case B some points were not assigned to their closest
>>>>>>> cluster.
>>>>>>>
>>>>>>> Could this behavior be justified by the existing K-means Mahout
>>>>>>> implementation?
>>>>>>>
>>>>>>> Thanks in advance.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>
>
>