You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Andrey Davydov (JIRA)" <ji...@apache.org> on 2012/12/17 16:00:29 UTC
[jira] [Created] (MAHOUT-1128) MAHOUT-999 issue still actual

Andrey Davydov created MAHOUT-1128:
--------------------------------------

             Summary:  MAHOUT-999 issue still actual
                 Key: MAHOUT-1128
                 URL: https://issues.apache.org/jira/browse/MAHOUT-1128
             Project: Mahout
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 0.7
         Environment: I work on Hadoop 1.0.3 cluster deployed on Amazon EC2 virtual computers with Ubuntu 11 and mahout-core.jar 0.7 from maven-central.
I run my application from separated "clien" machine and it submit tasks to cluster.


            Reporter: Andrey Davydov


I'm sorry my english is not well and I'm newbie with Mahout. But it seems that MAHOUT-999 issue still actual.

I use mahout-core 0.7 loaded from maven-central and I've got the same fail. 

I've investigate sources and found following in the org.apache.mahout.clustering.classify.ClusterClassifier class:

  public void writeToSeqFiles(Path path) throws IOException {
    writePolicy(policy, path);
    Configuration config = new Configuration();
    FileSystem fs = FileSystem.get(path.toUri(), config);
    SequenceFile.Writer writer = null;
    ClusterWritable cw = new ClusterWritable();
    for (int i = 0; i < models.size(); i++) {
...
      } finally {
        Closeables.closeQuietly(writer);
      }
    }
  }
  
  public void readFromSeqFiles(Configuration conf, Path path) throws IOException {
    Configuration config = new Configuration();
    List<Cluster> clusters = Lists.newArrayList();
    for (ClusterWritable cw : new SequenceFileDirValueIterable<ClusterWritable>(path, PathType.LIST,
        PathFilters.logsCRCFilter(), config)) {
...
    }
    this.models = clusters;
    modelClass = models.get(0).getClass().getName();
    this.policy = readPolicy(path);
  }

Both methods use new default Configuration and they try to work with local file system. I.e. KMeansDriver wrote initial clusters to local file system of the "client" system and CIMapper try to read it from cluster node local file system.

It seems that current implementation can work only pseudo-distributed hadoop system. I think that ClusterClassifier should store intermediate results in the HDFS using Configuration passed by api from user.








--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira