You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Reinis Vicups <ma...@orbit-x.de> on 2014/05/08 16:45:58 UTC

ClusterOutputPostProcessor: what is the purpose of clusterMappings

Hi,

in mahout 0.8 I see that ClusterOutputPostProcessorMapper and -Reducer 
are using Map<Integer, Integer> *ClusterMappings = 
ClusterCountReader.getClusterIDs(clusterOutputPath, conf, <true|false>).

This map alows to map clusterIds to index of 0 to k-1 where k is the 
number of clusters.

What is the purpose of this mapping?

clusterIds themselves are int thus the mapping to an index (and reverse 
mapping in Reducer back from index) seems to me useless.

Since clusterpp is setting number of reducers equal to k I thought 
initially this design is used to ensure that each cluster is given to a 
separate reducer but this should be true even without mapping.

What reducer gets as a key IF we are doind mapping is this: 0, 1, 2, 3, 
4, 5, 6, ...
Without mapping the reducer gets keys like this: 345, 37636, 14, 47699, 
234576, ...

But the clustered points will still be shuffled by cluster id when 
passed to reducer.

So what gives?

Thank you, guys, for your hints
reinis.