You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Grant Ingersoll (JIRA)" <ji...@apache.org> on 2013/06/08 13:19:19 UTC

[jira] [Work started] (MAHOUT-1084) Kmeans for synthetic control example--there are 12 cluster during iterations.

     [ https://issues.apache.org/jira/browse/MAHOUT-1084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on MAHOUT-1084 started by Grant Ingersoll.

> Kmeans for synthetic control example--there are 12 cluster during iterations.
> -----------------------------------------------------------------------------
>
>                 Key: MAHOUT-1084
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1084
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: liutengfei
>            Assignee: Grant Ingersoll
>             Fix For: 0.8
>
>
>        In Mahout-Kmeans for syntheticcontrol example, using the default parameters means to compute 6 clusters at last. But why there are 12 clusters during Kmeans iterations. According to my observation, the former 6 clusters and the latter 6 clusters are the same before the first iteration,those 6 clusters are generatored by RandomSeedGenerator.java. Then the CIMapper will assign its own points to this 12 clusters. Is here existing logical errors?
>        The 12 clusters are created by the function "setup" in CIMapper.java, more specifically, is the line "classifier.readFromSeqFiles(conf, new Path(priorClustersPath));", here the "priorClustersPath" means hdfs direction "output/clusters-0/", there are 8 files in this direction: "_policy","part-randomSeed"(one file record six cluster),"part-00000" to "part-00005"(total six files,every one record a cluster), while reading this direction, "_policy" will be filtered out, so program will read "part-00000" to "part-00005" to create six clusters, then read "part-randomSeed" to create the other six clusters, this is the reason why there will be 12 clusters before first iteration.
>       Solution: delete associated code to avoid duplicately creating clusters in "output/clusters-0/", here i delete codes where create files: "part-00000" to "part-00005" in ClusterClassfier.java:
>   public void writeToSeqFiles(Path path) throws IOException {
>     writePolicy(policy, path);
>     /*
>     Configuration config = new Configuration();
>     FileSystem fs = FileSystem.get(path.toUri(), config);
>     SequenceFile.Writer writer = null;
>     ClusterWritable cw = new ClusterWritable();
>     for (int i = 0; i < models.size(); i++) {
>       try {
>         Cluster cluster = models.get(i);
>         cw.setValue(cluster);
>         writer = new SequenceFile.Writer(fs, config,
>             new Path(path, "part-" + String.format(Locale.ENGLISH, "%05d", i)), IntWritable.class,
>             ClusterWritable.class);
>         Writable key = new IntWritable(i);
>         writer.append(key, cw);
>       } finally {
>         Closeables.closeQuietly(writer);
>       }
>     }
>     */
>   }
>     I don't know if it is still okay for other progams who using this file, but for KMeans in Syntheticcontrol example, program will create 6 clusters during every iterations as i expected.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira