You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Pallavi Palleti (JIRA)" <ji...@apache.org> on 2008/11/28 13:57:44 UTC

[jira] Created: (MAHOUT-99) r

r
-

                 Key: MAHOUT-99
                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
             Project: Mahout
          Issue Type: Improvement
            Reporter: Pallavi Palleti




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (MAHOUT-99) Improving speed of KMeans

Posted by Philippe Lamarche <ph...@gmail.com>.

Hi,

I just tried the patch and I have problem getting it right. I noticed
that there
are 2 new attributes to KmeansDriver.runJob, and I am probably not setting
them right. If I understand correctly, they seem to set the number of mapper
and reducer. How should I set them if I am running mahout on a one nodecluster?

This is what I am getting from the syntheticcontrol example :


hadoop@philippe-vaio:/usr/local/hadoop$ bin/hadoop dfs -put
/home/philippe/synthetic_control.data testdata
hadoop@philippe-vaio:/usr/local/hadoop$ bin/hadoop jar
/home/philippe/workspace/MahoutJava/examples/build/apache-mahout-examples-0.1-dev.job
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
08/11/28 12:01:04 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
08/11/28 12:01:05 INFO mapred.FileInputFormat: Total input paths to process
: 1
08/11/28 12:01:05 INFO mapred.JobClient: Running job: job_200811281146_0008
08/11/28 12:01:06 INFO mapred.JobClient:  map 0% reduce 0%
08/11/28 12:01:13 INFO mapred.JobClient:  map 100% reduce 0%
08/11/28 12:01:14 INFO mapred.JobClient: Job complete: job_200811281146_0008
08/11/28 12:01:14 INFO mapred.JobClient: Counters: 7
08/11/28 12:01:14 INFO mapred.JobClient:   File Systems
08/11/28 12:01:14 INFO mapred.JobClient:     HDFS bytes read=291644
08/11/28 12:01:14 INFO mapred.JobClient:     HDFS bytes written=323660
08/11/28 12:01:14 INFO mapred.JobClient:   Job Counters
08/11/28 12:01:14 INFO mapred.JobClient:     Launched map tasks=2
08/11/28 12:01:14 INFO mapred.JobClient:     Data-local map tasks=2
08/11/28 12:01:14 INFO mapred.JobClient:   Map-Reduce Framework
08/11/28 12:01:14 INFO mapred.JobClient:     Map input records=600
08/11/28 12:01:14 INFO mapred.JobClient:     Map input bytes=288374
08/11/28 12:01:14 INFO mapred.JobClient:     Map output records=600
08/11/28 12:01:14 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
08/11/28 12:01:14 INFO mapred.FileInputFormat: Total input paths to process
: 2
08/11/28 12:01:15 INFO mapred.JobClient: Running job: job_200811281146_0009
08/11/28 12:01:16 INFO mapred.JobClient:  map 0% reduce 0%
08/11/28 12:01:21 INFO mapred.JobClient:  map 50% reduce 0%
08/11/28 12:01:23 INFO mapred.JobClient:  map 100% reduce 0%
08/11/28 12:01:27 INFO mapred.JobClient:  map 100% reduce 100%
08/11/28 12:01:28 INFO mapred.JobClient: Job complete: job_200811281146_0009
08/11/28 12:01:28 INFO mapred.JobClient: Counters: 16
08/11/28 12:01:28 INFO mapred.JobClient:   File Systems
08/11/28 12:01:28 INFO mapred.JobClient:     HDFS bytes read=323660
08/11/28 12:01:28 INFO mapred.JobClient:     HDFS bytes written=9657
08/11/28 12:01:28 INFO mapred.JobClient:     Local bytes read=36119
08/11/28 12:01:28 INFO mapred.JobClient:     Local bytes written=72300
08/11/28 12:01:28 INFO mapred.JobClient:   Job Counters
08/11/28 12:01:28 INFO mapred.JobClient:     Launched reduce tasks=1
08/11/28 12:01:28 INFO mapred.JobClient:     Launched map tasks=2
08/11/28 12:01:28 INFO mapred.JobClient:     Data-local map tasks=2
08/11/28 12:01:28 INFO mapred.JobClient:   Map-Reduce Framework
08/11/28 12:01:28 INFO mapred.JobClient:     Reduce input groups=1
08/11/28 12:01:28 INFO mapred.JobClient:     Combine output records=28
08/11/28 12:01:28 INFO mapred.JobClient:     Map input records=600
08/11/28 12:01:28 INFO mapred.JobClient:     Reduce output records=7
08/11/28 12:01:28 INFO mapred.JobClient:     Map output bytes=943020
08/11/28 12:01:28 INFO mapred.JobClient:     Map input bytes=323660
08/11/28 12:01:28 INFO mapred.JobClient:     Combine input records=1732
08/11/28 12:01:28 INFO mapred.JobClient:     Map output records=1732
08/11/28 12:01:28 INFO mapred.JobClient:     Reduce input records=28
08/11/28 12:01:28 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
08/11/28 12:01:28 INFO mapred.FileInputFormat: Total input paths to process
: 2
08/11/28 12:01:29 INFO mapred.JobClient: Running job: job_200811281146_0010
08/11/28 12:01:30 INFO mapred.JobClient:  map 0% reduce 0%
08/11/28 12:01:35 INFO mapred.JobClient:  map 50% reduce 0%
08/11/28 12:01:37 INFO mapred.JobClient:  map 100% reduce 0%
08/11/28 12:01:41 INFO mapred.JobClient:  map 100% reduce 100%
08/11/28 12:01:42 INFO mapred.JobClient: Job complete: job_200811281146_0010
08/11/28 12:01:42 INFO mapred.JobClient: Counters: 16
08/11/28 12:01:42 INFO mapred.JobClient:   File Systems
08/11/28 12:01:42 INFO mapred.JobClient:     HDFS bytes read=342974
08/11/28 12:01:42 INFO mapred.JobClient:     HDFS bytes written=3002539
08/11/28 12:01:42 INFO mapred.JobClient:     Local bytes read=3018455
08/11/28 12:01:42 INFO mapred.JobClient:     Local bytes written=6036972
08/11/28 12:01:42 INFO mapred.JobClient:   Job Counters
08/11/28 12:01:42 INFO mapred.JobClient:     Launched reduce tasks=1
08/11/28 12:01:42 INFO mapred.JobClient:     Launched map tasks=2
08/11/28 12:01:42 INFO mapred.JobClient:     Data-local map tasks=2
08/11/28 12:01:42 INFO mapred.JobClient:   Map-Reduce Framework
08/11/28 12:01:42 INFO mapred.JobClient:     Reduce input groups=7
08/11/28 12:01:42 INFO mapred.JobClient:     Combine output records=0
08/11/28 12:01:42 INFO mapred.JobClient:     Map input records=600
08/11/28 12:01:42 INFO mapred.JobClient:     Reduce output records=1591
08/11/28 12:01:42 INFO mapred.JobClient:     Map output bytes=3008903
08/11/28 12:01:42 INFO mapred.JobClient:     Map input bytes=323660
08/11/28 12:01:42 INFO mapred.JobClient:     Combine input records=0
08/11/28 12:01:42 INFO mapred.JobClient:     Map output records=1591
08/11/28 12:01:42 INFO mapred.JobClient:     Reduce input records=1591
08/11/28 12:01:42 INFO kmeans.KMeansDriver: Iteration 0
08/11/28 12:01:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
08/11/28 12:01:42 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
08/11/28 12:01:42 INFO mapred.FileInputFormat: Total input paths to process
: 2
08/11/28 12:01:42 INFO mapred.JobClient: Running job: job_local_0001
08/11/28 12:01:42 INFO mapred.FileInputFormat: Total input paths to process
: 2
08/11/28 12:01:42 INFO mapred.MapTask: numReduceTasks: 1
08/11/28 12:01:42 INFO mapred.MapTask: io.sort.mb = 100
08/11/28 12:01:42 INFO mapred.MapTask: data buffer = 79691776/99614720
08/11/28 12:01:42 INFO mapred.MapTask: record buffer = 262144/327680
08/11/28 12:01:42 WARN mapred.LocalJobRunner: job_local_0001
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
    at java.lang.String.substring(String.java:1938)
    at
org.apache.mahout.clustering.kmeans.Cluster.decodeCluster(Cluster.java:81)
    at
org.apache.mahout.clustering.kmeans.KMeansUtil.configureWithClusterInfo(KMeansUtil.java:64)
    at
org.apache.mahout.clustering.kmeans.KMeansMapper.configure(KMeansMapper.java:66)
    at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
    at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
    at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
    at
org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
    at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:328)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
08/11/28 12:01:43 WARN kmeans.KMeansDriver: java.io.IOException: Job failed!
java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
    at
org.apache.mahout.clustering.kmeans.KMeansDriver.runIteration(KMeansDriver.java:129)
    at
org.apache.mahout.clustering.kmeans.KMeansDriver.runJob(KMeansDriver.java:80)
    at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.runJob(Job.java:80)
    at
org.apache.mahout.clustering.syntheticcontrol.kmeans.Job.main(Job.java:44)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
    at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
    at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)
08/11/28 12:01:43 INFO kmeans.KMeansDriver: Clustering
08/11/28 12:01:43 WARN mapred.JobClient: Use GenericOptionsParser for
parsing the arguments. Applications should implement Tool for the same.
08/11/28 12:01:43 INFO mapred.FileInputFormat: Total input paths to process
: 2
08/11/28 12:01:44 INFO mapred.JobClient: Running job: job_200811281146_0011
08/11/28 12:01:45 INFO mapred.JobClient:  map 0% reduce 0%


Thanks!
Philippe.

On Fri, Nov 28, 2008 at 8:05 AM, Pallavi Palleti (JIRA) <ji...@apache.org>wrote:

>
>     [
> https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel]
>
> Pallavi Palleti updated MAHOUT-99:
> ----------------------------------
>
>     Attachment: MAHOUT-99.patch
>
> this patch takes care of issues with speed. Also, the issues with combiner
> runs zero or more than once has been taken care.
>
> > Improving speed of KMeans
> > -------------------------
> >
> >                 Key: MAHOUT-99
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
> >             Project: Mahout
> >          Issue Type: Improvement
> >          Components: Clustering
> >            Reporter: Pallavi Palleti
> >         Attachments: MAHOUT-99.patch
> >
> >
> > Improved the speed of KMeans by passing only cluster ID from mapper to
> reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> > Also removed the implicit assumption of Combiner runs only once approach
> and the code is modified accordingly so that it won't create a bug when
> combiner runs zero or more than once.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by "Uppuluri, Rohini" <ro...@corp.aol.com>.

Hi Grant, 

I am Rohini and work in the same team as Pallavi is. Pallavi is out of
Office till the end of this month. I will be taking care of this issue
now. 

I will look into the issue you have pointed out and get back to you. 

Thanks, 
-Rohini


-----Original Message-----
From: Grant Ingersoll (JIRA) [mailto:jira@apache.org] 
Sent: Sunday, December 07, 2008 7:32 AM
To: mahout-dev@lucene.apache.org
Subject: [jira] Commented: (MAHOUT-99) Improving speed of KMeans


    [
https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.
plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654168#
action_12654168 ] 

Grant Ingersoll commented on MAHOUT-99:
---------------------------------------

Hi Pallavi,

The core code works, but the change to the KMeansDriver causes a compile
error in examples in the Kmeans demo code b/c it now asks for the number
of map tasks and the number of centroids.  Could you document these new
parameters and put in reasonable defaults and update the patch?

One thing I'm not certain of, though, is why we need to pass in the
number of map tasks, isn't that a config thing already when you setup
Hadoop?  

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>         Attachments: MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to
reducer. Previously, whole Cluster Info as formatted s`tring was being
sent.
> Also removed the implicit assumption of Combiner runs only once
approach and the code is modified accordingly so that it won't create a
bug when combiner runs zero or more than once.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Work started: (MAHOUT-99) Improving speed of KMeans

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Work on MAHOUT-99 started by Grant Ingersoll.

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>         Attachments: MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by Grant Ingersoll <gs...@apache.org>.

Never mind on this, I read some emails out of context and now realize  
this has been addressed.

On Mar 19, 2009, at 6:57 AM, Grant Ingersoll (JIRA) wrote:

>
>    [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683426 
> #action_12683426 ]
>
> Grant Ingersoll commented on MAHOUT-99:
> ---------------------------------------
>
> For the record, I ran Canopy independently, and that worked just fine.
>
>
>
>
>
>> Improving speed of KMeans
>> -------------------------
>>
>>                Key: MAHOUT-99
>>                URL: https://issues.apache.org/jira/browse/MAHOUT-99
>>            Project: Mahout
>>         Issue Type: Improvement
>>         Components: Clustering
>>           Reporter: Pallavi Palleti
>>           Assignee: Grant Ingersoll
>>            Fix For: 0.1
>>
>>        Attachments: MAHOUT-99-1.patch, Mahout-99.patch,  
>> MAHOUT-99.patch
>>
>>
>> Improved the speed of KMeans by passing only cluster ID from mapper  
>> to reducer. Previously, whole Cluster Info as formatted s`tring was  
>> being sent.
>> Also removed the implicit assumption of Combiner runs only once  
>> approach and the code is modified accordingly so that it won't  
>> create a bug when combiner runs zero or more than once.
>
> -- 
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>

[jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683426#action_12683426 ] 

Grant Ingersoll commented on MAHOUT-99:
---------------------------------------

For the record, I ran Canopy independently, and that worked just fine.





> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>             Fix For: 0.1
>
>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by "Palleti, Pallavi" <pa...@corp.aol.com>.

There is a testcase in TestKMeansClustering.java which actually uses the output of Canopy as input. This testcase succeeded without any issue. But the thing here is, it doesn't use hdfs but uses the local file system. So, this might be the reason why it is succeeded without any issue.

Thanks
Pallavi



-----Original Message-----
From: Jeff Eastman [mailto:jdog@windwardsolutions.com] 
Sent: Thursday, March 19, 2009 10:14 AM
To: mahout-dev@lucene.apache.org
Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

The unit tests dont care which format is used as long as it is consistent. The compiler helps enforce that. kMeans will run and its tests will pass. So will Canopy. When somebody runs the kMeans example it encounters the file format differences. Are all the examples run by the install? I'd be surprised.

Jeff


Palleti, Pallavi wrote:
> Yeah. But, I am wondering how the testcases succeeded? I ran them using "mvn clean install" command.
>
> Thanks
> Pallavi
>
> -----Original Message-----
> From: Jeff Eastman [mailto:jdog@windwardsolutions.com]
> Sent: Thursday, March 19, 2009 9:56 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
>
> The Synthetic Control kMeans job calls the Canopy job to build its initial clusters as is commonly done. If the kMeans record format was changed and the Canopy not changed accordingly, then everything would still compile but there would be a mismatch when the kMeans mapper tried to read in the clusters.
>
> Jeff
>
>
> Richard Tomsett (JIRA) wrote:
>   
>>     [
>> https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.ji
>> r
>> a.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=1268
>> 3
>> 252#action_12683252 ]
>>
>> Richard Tomsett commented on MAHOUT-99:
>> ---------------------------------------
>>
>> Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get the same error on the Synthetic Control example. It seems to be because the new KMeans code uses a KeyValueLineRecordReader object to read the input cluster centres from the canopy clustering output, but the canopy clustering job outputs a SequenceFile (and the old KMeans code read in a SequenceFile for the cluster centres). Think that's the problem at least, I''ll have a quick play.
>>
>>   
>>     
>>> Improving speed of KMeans
>>> -------------------------
>>>
>>>                 Key: MAHOUT-99
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>>>             Project: Mahout
>>>          Issue Type: Improvement
>>>          Components: Clustering
>>>            Reporter: Pallavi Palleti
>>>            Assignee: Grant Ingersoll
>>>             Fix For: 0.1
>>>
>>>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, 
>>> MAHOUT-99.patch
>>>
>>>
>>> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
>>> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.
>>>     
>>>       
>>   
>>     
>
>

Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

The unit tests dont care which format is used as long as it is 
consistent. The compiler helps enforce that. kMeans will run and its 
tests will pass. So will Canopy. When somebody runs the kMeans example 
it encounters the file format differences. Are all the examples run by 
the install? I'd be surprised.

Jeff


Palleti, Pallavi wrote:
> Yeah. But, I am wondering how the testcases succeeded? I ran them using "mvn clean install" command.
>
> Thanks
> Pallavi
>
> -----Original Message-----
> From: Jeff Eastman [mailto:jdog@windwardsolutions.com] 
> Sent: Thursday, March 19, 2009 9:56 AM
> To: mahout-dev@lucene.apache.org
> Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans
>
> The Synthetic Control kMeans job calls the Canopy job to build its initial clusters as is commonly done. If the kMeans record format was changed and the Canopy not changed accordingly, then everything would still compile but there would be a mismatch when the kMeans mapper tried to read in the clusters.
>
> Jeff
>
>
> Richard Tomsett (JIRA) wrote:
>   
>>     [ 
>> https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jir
>> a.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683
>> 252#action_12683252 ]
>>
>> Richard Tomsett commented on MAHOUT-99:
>> ---------------------------------------
>>
>> Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get the same error on the Synthetic Control example. It seems to be because the new KMeans code uses a KeyValueLineRecordReader object to read the input cluster centres from the canopy clustering output, but the canopy clustering job outputs a SequenceFile (and the old KMeans code read in a SequenceFile for the cluster centres). Think that's the problem at least, I''ll have a quick play.
>>
>>   
>>     
>>> Improving speed of KMeans
>>> -------------------------
>>>
>>>                 Key: MAHOUT-99
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>>>             Project: Mahout
>>>          Issue Type: Improvement
>>>          Components: Clustering
>>>            Reporter: Pallavi Palleti
>>>            Assignee: Grant Ingersoll
>>>             Fix For: 0.1
>>>
>>>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, 
>>> MAHOUT-99.patch
>>>
>>>
>>> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
>>> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.
>>>     
>>>       
>>   
>>     
>
>

RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by "Palleti, Pallavi" <pa...@corp.aol.com>.

Yeah. But, I am wondering how the testcases succeeded? I ran them using "mvn clean install" command.

Thanks
Pallavi

-----Original Message-----
From: Jeff Eastman [mailto:jdog@windwardsolutions.com] 
Sent: Thursday, March 19, 2009 9:56 AM
To: mahout-dev@lucene.apache.org
Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

The Synthetic Control kMeans job calls the Canopy job to build its initial clusters as is commonly done. If the kMeans record format was changed and the Canopy not changed accordingly, then everything would still compile but there would be a mismatch when the kMeans mapper tried to read in the clusters.

Jeff


Richard Tomsett (JIRA) wrote:
>     [ 
> https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jir
> a.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683
> 252#action_12683252 ]
>
> Richard Tomsett commented on MAHOUT-99:
> ---------------------------------------
>
> Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get the same error on the Synthetic Control example. It seems to be because the new KMeans code uses a KeyValueLineRecordReader object to read the input cluster centres from the canopy clustering output, but the canopy clustering job outputs a SequenceFile (and the old KMeans code read in a SequenceFile for the cluster centres). Think that's the problem at least, I''ll have a quick play.
>
>   
>> Improving speed of KMeans
>> -------------------------
>>
>>                 Key: MAHOUT-99
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>>             Project: Mahout
>>          Issue Type: Improvement
>>          Components: Clustering
>>            Reporter: Pallavi Palleti
>>            Assignee: Grant Ingersoll
>>             Fix For: 0.1
>>
>>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, 
>> MAHOUT-99.patch
>>
>>
>> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
>> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.
>>     
>
>

Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

The Synthetic Control kMeans job calls the Canopy job to build its 
initial clusters as is commonly done. If the kMeans record format was 
changed and the Canopy not changed accordingly, then everything would 
still compile but there would be a mismatch when the kMeans mapper tried 
to read in the clusters.

Jeff


Richard Tomsett (JIRA) wrote:
>     [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683252#action_12683252 ] 
>
> Richard Tomsett commented on MAHOUT-99:
> ---------------------------------------
>
> Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get the same error on the Synthetic Control example. It seems to be because the new KMeans code uses a KeyValueLineRecordReader object to read the input cluster centres from the canopy clustering output, but the canopy clustering job outputs a SequenceFile (and the old KMeans code read in a SequenceFile for the cluster centres). Think that's the problem at least, I''ll have a quick play.
>
>   
>> Improving speed of KMeans
>> -------------------------
>>
>>                 Key: MAHOUT-99
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>>             Project: Mahout
>>          Issue Type: Improvement
>>          Components: Clustering
>>            Reporter: Pallavi Palleti
>>            Assignee: Grant Ingersoll
>>             Fix For: 0.1
>>
>>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch
>>
>>
>> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
>> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.
>>     
>
>

[jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by "Richard Tomsett (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683252#action_12683252 ] 

Richard Tomsett commented on MAHOUT-99:
---------------------------------------

Yup, just downloaded the latest trunk and run with Hadoop 0.19.1 and I get the same error on the Synthetic Control example. It seems to be because the new KMeans code uses a KeyValueLineRecordReader object to read the input cluster centres from the canopy clustering output, but the canopy clustering job outputs a SequenceFile (and the old KMeans code read in a SequenceFile for the cluster centres). Think that's the problem at least, I''ll have a quick play.

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>             Fix For: 0.1
>
>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683140#action_12683140 ] 

Grant Ingersoll commented on MAHOUT-99:
---------------------------------------

I seem to recall hitting something similar before, let me poke around...

Seems somewhat similar to the problems we were having on http://www.lucidimagination.com/search/document/31bd6ab8d94bb3e5/problems_with_kmeans_clustering#31bd6ab8d94bb3e5, but I'm not sure

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>             Fix For: 0.1
>
>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by "Pallavi Palleti (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683335#action_12683335 ] 

Pallavi Palleti commented on MAHOUT-99:
---------------------------------------

If we need to modify Canopy. We need to modify all depandant classes too where ever canopy is being used.

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>             Fix For: 0.1
>
>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Reopened: (MAHOUT-99) Improving speed of KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Did you reopen this issue because of this error? I just ran the example 
and it ran without error.
Jeff

Grant Ingersoll (JIRA) wrote:
>      [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> Grant Ingersoll reopened MAHOUT-99:
> -----------------------------------
>
>
> Hi Pallavi,
>
> I'm getting: 
> 09/03/18 11:13:56 WARN mapred.LocalJobRunner: job_local_0001
> java.lang.StringIndexOutOfBoundsException: String index out of range: -1
>         at java.lang.String.substring(String.java:1938)
>         at org.apache.mahout.clustering.kmeans.Cluster.decodeCluster(Cluster.java:81)
>         at org.apache.mahout.clustering.kmeans.KMeansUtil.configureWithClusterInfo(KMeansUtil.java:80)
>         at org.apache.mahout.clustering.kmeans.KMeansMapper.configure(KMeansMapper.java:66)
>         at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>         at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
>         at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
>         at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
>         at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)
>
> when running http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html
>
>   
>> Improving speed of KMeans
>> -------------------------
>>
>>                 Key: MAHOUT-99
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>>             Project: Mahout
>>          Issue Type: Improvement
>>          Components: Clustering
>>            Reporter: Pallavi Palleti
>>            Assignee: Grant Ingersoll
>>             Fix For: 0.1
>>
>>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch
>>
>>
>> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
>> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.
>>     
>
>

[jira] Reopened: (MAHOUT-99) Improving speed of KMeans

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reopened MAHOUT-99:
-----------------------------------


Hi Pallavi,

I'm getting: 
09/03/18 11:13:56 WARN mapred.LocalJobRunner: job_local_0001
java.lang.StringIndexOutOfBoundsException: String index out of range: -1
        at java.lang.String.substring(String.java:1938)
        at org.apache.mahout.clustering.kmeans.Cluster.decodeCluster(Cluster.java:81)
        at org.apache.mahout.clustering.kmeans.KMeansUtil.configureWithClusterInfo(KMeansUtil.java:80)
        at org.apache.mahout.clustering.kmeans.KMeansMapper.configure(KMeansMapper.java:66)
        at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
        at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:34)
        at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:58)
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:83)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:338)
        at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:138)

when running http://cwiki.apache.org/MAHOUT/syntheticcontroldata.html

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>             Fix For: 0.1
>
>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-99) Improving speed of KMeans

Posted by "Pallavi Palleti (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pallavi Palleti updated MAHOUT-99:
----------------------------------

    Component/s: Clustering
    Description: 
Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.
        Summary: Improving speed of KMeans  (was: r)

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683442#action_12683442 ] 

Grant Ingersoll commented on MAHOUT-99:
---------------------------------------

Trying it now

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>             Fix For: 0.1
>
>         Attachments: MAHOUT-99-1.patch, MAHOUT-99.patch, Mahout-99.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683451#action_12683451 ] 

Grant Ingersoll commented on MAHOUT-99:
---------------------------------------

OK, this works.  I will apply

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>             Fix For: 0.1
>
>         Attachments: MAHOUT-99-1.patch, MAHOUT-99.patch, Mahout-99.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-99) Improving speed of KMeans

Posted by "Pallavi Palleti (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pallavi Palleti updated MAHOUT-99:
----------------------------------

    Attachment: Mahout-99.patch

Patch is modified to be compatible with latest trunk.

Thanks
Pallavi

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-99) Improving speed of KMeans

Posted by "Rohini Uppuluri (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Rohini Uppuluri updated MAHOUT-99:
----------------------------------

    Attachment: MAHOUT-99-1.patch

Hi Grant,

I have set them as optional arguments. I set those parameters to some reasonable defaults incase they are not given as input. I will be uploading the updated patch reflecting the change.

It is a config thing already set up in hadoop but it gives us flexibility to change incase we want to increase the map tasks.




Thanks,
-Rohini


> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>         Attachments: MAHOUT-99-1.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by Grant Ingersoll <gs...@apache.org>.

On my Mac, I have:
$ echo $JAVA_HOME
/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home

-Grant

On Mar 18, 2009, at 2:10 PM, Jeff Eastman wrote:

> I'm running the example in Eclipse using the stand-alone mode in the  
> hadoop-0.19.1 jar file. It works fine, as does the hadoop compile in  
> Eclipse. I cannot; however, get any hadoop stuff to work from the  
> command line. Even though my JAVA_HOME environment is set to / 
> Library/Java/Home and java -version yields:
>
> Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
> Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)
>
> ... the hadoop build script and the start-all.sh commands all  
> complain about class version errors. Can any other Mac users help me  
> out?
>
> Jeff
>
>
> Grant Ingersoll (JIRA) wrote:
>>    [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683077 
>> #action_12683077 ]
>> Grant Ingersoll commented on MAHOUT-99:
>> ---------------------------------------
>>
>> Yeah, what version of Hadoop are you running?  I got it w/ 0.19.1,  
>> but maybe I didn't set something up right.
>>
>> {code}
>> bin/hadoop jar ~/projects/lucene/mahout/mahout-clean/examples/ 
>> target/mahout-examples-0.2-SNAPSHOT.job  
>> org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
>> {code}
>>
>>
>>> Improving speed of KMeans
>>> -------------------------
>>>
>>>                Key: MAHOUT-99
>>>                URL: https://issues.apache.org/jira/browse/MAHOUT-99
>>>            Project: Mahout
>>>         Issue Type: Improvement
>>>         Components: Clustering
>>>           Reporter: Pallavi Palleti
>>>           Assignee: Grant Ingersoll
>>>            Fix For: 0.1
>>>
>>>        Attachments: MAHOUT-99-1.patch, Mahout-99.patch,  
>>> MAHOUT-99.patch
>>>
>>>
>>> Improved the speed of KMeans by passing only cluster ID from  
>>> mapper to reducer. Previously, whole Cluster Info as formatted  
>>> s`tring was being sent.
>>> Also removed the implicit assumption of Combiner runs only once  
>>> approach and the code is modified accordingly so that it won't  
>>> create a bug when combiner runs zero or more than once.
>>>
>>
>>
>

Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

I'm running the example in Eclipse using the stand-alone mode in the 
hadoop-0.19.1 jar file. It works fine, as does the hadoop compile in 
Eclipse. I cannot; however, get any hadoop stuff to work from the 
command line. Even though my JAVA_HOME environment is set to 
/Library/Java/Home and java -version yields:

Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)

... the hadoop build script and the start-all.sh commands all complain 
about class version errors. Can any other Mac users help me out?

Jeff


Grant Ingersoll (JIRA) wrote:
>     [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683077#action_12683077 ] 
>
> Grant Ingersoll commented on MAHOUT-99:
> ---------------------------------------
>
> Yeah, what version of Hadoop are you running?  I got it w/ 0.19.1, but maybe I didn't set something up right.
>
> {code}
>  bin/hadoop jar ~/projects/lucene/mahout/mahout-clean/examples/target/mahout-examples-0.2-SNAPSHOT.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
> {code}
>
>   
>> Improving speed of KMeans
>> -------------------------
>>
>>                 Key: MAHOUT-99
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>>             Project: Mahout
>>          Issue Type: Improvement
>>          Components: Clustering
>>            Reporter: Pallavi Palleti
>>            Assignee: Grant Ingersoll
>>             Fix For: 0.1
>>
>>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch
>>
>>
>> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
>> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.
>>     
>
>

[jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683077#action_12683077 ] 

Grant Ingersoll commented on MAHOUT-99:
---------------------------------------

Yeah, what version of Hadoop are you running?  I got it w/ 0.19.1, but maybe I didn't set something up right.

{code}
 bin/hadoop jar ~/projects/lucene/mahout/mahout-clean/examples/target/mahout-examples-0.2-SNAPSHOT.job org.apache.mahout.clustering.syntheticcontrol.kmeans.Job
{code}

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>             Fix For: 0.1
>
>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (MAHOUT-99) Improving speed of KMeans

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned MAHOUT-99:
-------------------------------------

    Assignee: Grant Ingersoll

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>         Attachments: MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-99) Improving speed of KMeans

Posted by "Pallavi Palleti (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pallavi Palleti updated MAHOUT-99:
----------------------------------

    Attachment: MAHOUT-99.patch

I have fixed sequencefile issue. Modified code SequenceFile where ever possible. And also, with the new KMeansClusterMapper, we don't need outputMapper code in Job.java in SyntheticControl. So, I commented that.

Thanks
Pallavi

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>             Fix For: 0.1
>
>         Attachments: MAHOUT-99-1.patch, MAHOUT-99.patch, Mahout-99.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (MAHOUT-99) Improving speed of KMeans

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved MAHOUT-99.
-----------------------------------

    Resolution: Fixed

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>             Fix For: 0.1
>
>         Attachments: MAHOUT-99-1.patch, MAHOUT-99.patch, Mahout-99.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (MAHOUT-99) Improving speed of KMeans

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll resolved MAHOUT-99.
-----------------------------------

       Resolution: Fixed
    Fix Version/s: 0.1

Committed revision 755548.

Thanks!

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>             Fix For: 0.1
>
>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

RE: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by "Palleti, Pallavi" <pa...@corp.aol.com>.

It depends on the kind of output. If we are just outputting only some numeric values then it is preferred to have SequenceFile as the data is written as binary. If not, it is preferred to write as simple text. Text file is readable where as binary is not readable. 

As we consider the data as text in reducers of both Canopy and KMeans, I don't see any performance improvement in using SequenceFile. So, I used TextInputFormat which is read friendly.
 
Thanks
Pallavi

-----Original Message-----
From: Jeff Eastman [mailto:jdog@windwardsolutions.com] 
Sent: Thursday, March 19, 2009 10:19 AM
To: mahout-dev@lucene.apache.org
Subject: Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Also why not consider just converting canopy? Which reader is better?


Jeff Eastman wrote:
> * PGP Signed: 03/18/09 at 21:37:36
>
> Sure, why don't you go ahead and post a patch?
>
>
> Pallavi Palleti (JIRA) wrote:
>>     [
>> https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.ji
>> ra.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=126
>> 83312#action_12683312
>> ]
>> Pallavi Palleti commented on MAHOUT-99:
>> ---------------------------------------
>>
>> I have used KeyValueLineRecordReader internally for my code and 
>> forgot to revert back to SequenceFileReader. Will that be sufficient 
>> to add another patch on the latest code and modify only KMeansDriver 
>> to use SequenceFileReader? Kindly let me know.
>>
>> Thanks
>> Pallavi
>>
>>  
>>> Improving speed of KMeans
>>> -------------------------
>>>
>>>                 Key: MAHOUT-99
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>>>             Project: Mahout
>>>          Issue Type: Improvement
>>>          Components: Clustering
>>>            Reporter: Pallavi Palleti
>>>            Assignee: Grant Ingersoll
>>>             Fix For: 0.1
>>>
>>>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, 
>>> MAHOUT-99.patch
>>>
>>>
>>> Improved the speed of KMeans by passing only cluster ID from mapper 
>>> to reducer. Previously, whole Cluster Info as formatted s`tring was 
>>> being sent.
>>> Also removed the implicit assumption of Combiner runs only once 
>>> approach and the code is modified accordingly so that it won't 
>>> create a bug when combiner runs zero or more than once.
>>>     
>>
>>   
>
>
> * Jeff Eastman <jd...@windwardsolutions.com>
> * 0x6BFF1277
>
> .
>

Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Also why not consider just converting canopy? Which reader is better?


Jeff Eastman wrote:
> * PGP Signed: 03/18/09 at 21:37:36
>
> Sure, why don't you go ahead and post a patch?
>
>
> Pallavi Palleti (JIRA) wrote:
>>     [ 
>> https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683312#action_12683312 
>> ]
>> Pallavi Palleti commented on MAHOUT-99:
>> ---------------------------------------
>>
>> I have used KeyValueLineRecordReader internally for my code and 
>> forgot to revert back to SequenceFileReader. Will that be sufficient 
>> to add another patch on the latest code and modify only KMeansDriver 
>> to use SequenceFileReader? Kindly let me know.
>>
>> Thanks
>> Pallavi
>>
>>  
>>> Improving speed of KMeans
>>> -------------------------
>>>
>>>                 Key: MAHOUT-99
>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>>>             Project: Mahout
>>>          Issue Type: Improvement
>>>          Components: Clustering
>>>            Reporter: Pallavi Palleti
>>>            Assignee: Grant Ingersoll
>>>             Fix For: 0.1
>>>
>>>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, 
>>> MAHOUT-99.patch
>>>
>>>
>>> Improved the speed of KMeans by passing only cluster ID from mapper 
>>> to reducer. Previously, whole Cluster Info as formatted s`tring was 
>>> being sent.
>>> Also removed the implicit assumption of Combiner runs only once 
>>> approach and the code is modified accordingly so that it won't 
>>> create a bug when combiner runs zero or more than once.
>>>     
>>
>>   
>
>
> * Jeff Eastman <jd...@windwardsolutions.com>
> * 0x6BFF1277
>
> .
>

Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Sure, why don't you go ahead and post a patch?


Pallavi Palleti (JIRA) wrote:
>     [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683312#action_12683312 ] 
>
> Pallavi Palleti commented on MAHOUT-99:
> ---------------------------------------
>
> I have used KeyValueLineRecordReader internally for my code and forgot to revert back to SequenceFileReader. Will that be sufficient to add another patch on the latest code and modify only KMeansDriver to use SequenceFileReader? Kindly let me know.
>
> Thanks
> Pallavi
>
>   
>> Improving speed of KMeans
>> -------------------------
>>
>>                 Key: MAHOUT-99
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>>             Project: Mahout
>>          Issue Type: Improvement
>>          Components: Clustering
>>            Reporter: Pallavi Palleti
>>            Assignee: Grant Ingersoll
>>             Fix For: 0.1
>>
>>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch
>>
>>
>> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
>> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.
>>     
>
>

[jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by "Pallavi Palleti (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683312#action_12683312 ] 

Pallavi Palleti commented on MAHOUT-99:
---------------------------------------

I have used KeyValueLineRecordReader internally for my code and forgot to revert back to SequenceFileReader. Will that be sufficient to add another patch on the latest code and modify only KMeansDriver to use SequenceFileReader? Kindly let me know.

Thanks
Pallavi

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>             Fix For: 0.1
>
>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-99) Improving speed of KMeans

Posted by "Pallavi Palleti (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Pallavi Palleti updated MAHOUT-99:
----------------------------------

    Attachment: MAHOUT-99.patch

this patch takes care of issues with speed. Also, the issues with combiner runs zero or more than once has been taken care.

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>         Attachments: MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682844#action_12682844 ] 

Grant Ingersoll commented on MAHOUT-99:
---------------------------------------

I'd like to put this in 0.1, is it ready to go w/ the current trunk?

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>         Attachments: MAHOUT-99-1.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by Jeff Eastman <jd...@windwardsolutions.com>.

Are the examples run automatically in the build?

Pallavi Palleti (JIRA) wrote:
>     [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683297#action_12683297 ] 
>
> Pallavi Palleti commented on MAHOUT-99:
> ---------------------------------------
>
> Yup. That must be the issue. But I am wondering how the test case succeeded?
>
>   
>> Improving speed of KMeans
>> -------------------------
>>
>>                 Key: MAHOUT-99
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>>             Project: Mahout
>>          Issue Type: Improvement
>>          Components: Clustering
>>            Reporter: Pallavi Palleti
>>            Assignee: Grant Ingersoll
>>             Fix For: 0.1
>>
>>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch
>>
>>
>> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
>> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.
>>     
>
>

[jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by "Pallavi Palleti (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683297#action_12683297 ] 

Pallavi Palleti commented on MAHOUT-99:
---------------------------------------

Yup. That must be the issue. But I am wondering how the test case succeeded?

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>             Fix For: 0.1
>
>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by "Richard Tomsett (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683376#action_12683376 ] 

Richard Tomsett commented on MAHOUT-99:
---------------------------------------

I tried just reverting back to SequenceFiles in the test and KMeansUtil classes but couldn't get the test to complete correctly - must admit I didn't work on it too long as it was late... I'm not sure quite what the problem was, but I think it was just that I hadn't found all the relevant changes that needed to be made. I guess if you've found it in the Canopy class (I take it that was modified in this patch NOT to output SequenceFiles as well) that would explain it :)

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>             Fix For: 0.1
>
>         Attachments: MAHOUT-99-1.patch, Mahout-99.patch, MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-99) Improving speed of KMeans

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-99?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12654168#action_12654168 ] 

Grant Ingersoll commented on MAHOUT-99:
---------------------------------------

Hi Pallavi,

The core code works, but the change to the KMeansDriver causes a compile error in examples in the Kmeans demo code b/c it now asks for the number of map tasks and the number of centroids.  Could you document these new parameters and put in reasonable defaults and update the patch?

One thing I'm not certain of, though, is why we need to pass in the number of map tasks, isn't that a config thing already when you setup Hadoop?  

> Improving speed of KMeans
> -------------------------
>
>                 Key: MAHOUT-99
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-99
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Clustering
>            Reporter: Pallavi Palleti
>            Assignee: Grant Ingersoll
>         Attachments: MAHOUT-99.patch
>
>
> Improved the speed of KMeans by passing only cluster ID from mapper to reducer. Previously, whole Cluster Info as formatted s`tring was being sent.
> Also removed the implicit assumption of Combiner runs only once approach and the code is modified accordingly so that it won't create a bug when combiner runs zero or more than once.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.