You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Pradhuman Jhala <Pr...@fox.com> on 2008/12/05 02:11:23 UTC

mahout & hadoop compatibility

 
Just wondering if Mahout is compatible with hadoop-0.18 (and later) versions.  As in hadoop version 0.18 onwards, the combiner execution policy has changed and now it gets executed  twice - first from Mapper side (on the output of  Mapper) and then again on the Reducer side (on the output of first Combiner). 
 
For more details: http://issues.apache.org/jira/browse/HADOOP-3226 <http://issues.apache.org/jira/browse/HADOOP-3226> 
 
It seems me that the kmean and canopy clustering in Mahout assumes that the combiner gets executed on Mapper side only and it's a major source of error, as when the Combiner gets executed on the Reducer side, it can not parse the output of first Combiner correctly. 
 
To fix, only for hadoop-0.18.*, if you want to use combiner only on the output of mapper (like earlier hadoop versions), add the following to your job config:
 
job.setCombineOnlyOnce(true); 
  
This method (setCombineOnlyOnce) is not available in hadoop-0.19 release, so I think Mahout code needs to be changed to take care of this issue. 
 
Pradhuman

RE: mahout & hadoop compatibility

Posted by "Palleti, Pallavi" <pa...@corp.aol.com>.

Sorry for the late reply. The issue number is 99. Please have a look at
https://issues.apache.org/jira/browse/MAHOUT-99 

Thanks
Pallavi

-----Original Message-----
From: Grant Ingersoll [mailto:gsingers@apache.org] 
Sent: Friday, December 05, 2008 7:56 AM
To: mahout-user@lucene.apache.org
Subject: Re: mahout & hadoop compatibility

Yeah, it should work with 0.18, with a few patches to fix the Combiner  
issue, if you are using the k-Means clustering stuff.  I committed one  
of them, but forget the Issue numbers (Pallavi?)  Have a look in JIRA.


On Dec 4, 2008, at 8:11 PM, Pradhuman Jhala wrote:

>
> Just wondering if Mahout is compatible with hadoop-0.18 (and later)  
> versions.  As in hadoop version 0.18 onwards, the combiner execution  
> policy has changed and now it gets executed  twice - first from  
> Mapper side (on the output of  Mapper) and then again on the Reducer  
> side (on the output of first Combiner).
>
> For more details: http://issues.apache.org/jira/browse/HADOOP-3226
<http://issues.apache.org/jira/browse/HADOOP-3226 
> >
>
> It seems me that the kmean and canopy clustering in Mahout assumes  
> that the combiner gets executed on Mapper side only and it's a major  
> source of error, as when the Combiner gets executed on the Reducer  
> side, it can not parse the output of first Combiner correctly.
>
> To fix, only for hadoop-0.18.*, if you want to use combiner only on  
> the output of mapper (like earlier hadoop versions), add the  
> following to your job config:
>
> job.setCombineOnlyOnce(true);
>
> This method (setCombineOnlyOnce) is not available in hadoop-0.19  
> release, so I think Mahout code needs to be changed to take care of  
> this issue.
>
> Pradhuman
>
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: mahout & hadoop compatibility

Posted by Grant Ingersoll <gs...@apache.org>.

Yeah, it should work with 0.18, with a few patches to fix the Combiner  
issue, if you are using the k-Means clustering stuff.  I committed one  
of them, but forget the Issue numbers (Pallavi?)  Have a look in JIRA.


On Dec 4, 2008, at 8:11 PM, Pradhuman Jhala wrote:

>
> Just wondering if Mahout is compatible with hadoop-0.18 (and later)  
> versions.  As in hadoop version 0.18 onwards, the combiner execution  
> policy has changed and now it gets executed  twice - first from  
> Mapper side (on the output of  Mapper) and then again on the Reducer  
> side (on the output of first Combiner).
>
> For more details: http://issues.apache.org/jira/browse/HADOOP-3226 <http://issues.apache.org/jira/browse/HADOOP-3226 
> >
>
> It seems me that the kmean and canopy clustering in Mahout assumes  
> that the combiner gets executed on Mapper side only and it's a major  
> source of error, as when the Combiner gets executed on the Reducer  
> side, it can not parse the output of first Combiner correctly.
>
> To fix, only for hadoop-0.18.*, if you want to use combiner only on  
> the output of mapper (like earlier hadoop versions), add the  
> following to your job config:
>
> job.setCombineOnlyOnce(true);
>
> This method (setCombineOnlyOnce) is not available in hadoop-0.19  
> release, so I think Mahout code needs to be changed to take care of  
> this issue.
>
> Pradhuman
>
>

--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ