You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Lingxiang Cheng <li...@yahoo.com> on 2011/12/25 14:59:05 UTC

Mahout classifier on Hadoop

Hi,
 
    I am a newbie to Mahout. When I was reading the book "Mahout in Action", I found chapters talking about how clustering naturally fit into Map/Reduce framework, but I did not see the same claim for classifiers. Does it involve a lot of work to make classifiers like random forest work with Hadoop?
 
Thanks!
Lingxiang Cheng 

Mahout classifier on Hadoop

Posted by Lingxiang Cheng <li...@yahoo.com>.
Hi,


    I am a newbie to Mahout. When I was reading the book "Mahout in Action", I found chapters talking about how clustering naturally fit into Map/Reduce framework, but I did not see the same claim for classifiers. Does it involve a lot of work to make classifiers like random forest work with Hadoop?

Thanks!
Lingxiang Cheng 

Re: Mahout classifier on Hadoop

Posted by Ted Dunning <te...@gmail.com>.
On Sun, Dec 25, 2011 at 4:08 PM, Lingxiang Cheng
<li...@yahoo.com>wrote:

>
>    Thanks for the answer. I am having some difficulty understanding why
> running random forest on top of Hadoop "does not produce arbitrary
> scalability". Could you elaborate?
>

The problem is that the problem is difficult to decompose well and get
linear scaling.  For instance, if you shard the data by features, you want
to have overlap between the features for different shards.  This means that
the total data processed during learning increases super-linearly with the
number of shards.

On the other hand, sharding by training data records leaves you with a
problem of how to combine different models and whether you get the kind of
improved training that you want.  Just taking the union of trees in each
ensemble probably isn't that effective (based on analogizing from other
types of learning).



> Also, are you aware of any work that involved developing random forest
> using map-reduce?
>

Well, we have it.  There are fancier efforts as well.  Have you done a web
search?

Re: Mahout classifier on Hadoop

Posted by Lingxiang Cheng <li...@yahoo.com>.
Hi Ted,
 
   Thanks for the answer. I am having some difficulty understanding why running random forest on top of Hadoop "does not produce arbitrary scalability". Could you elaborate?
Also, are you aware of any work that involved developing random forest using map-reduce?
 
Thanks!
Lingxiang 


________________________________
From: Ted Dunning <te...@gmail.com>
To: user@mahout.apache.org; Lingxiang Cheng <li...@yahoo.com> 
Sent: Sunday, December 25, 2011 4:13 PM
Subject: Re: Mahout classifier on Hadoop

Random forest works as a map-reduce program, but that does not produce
arbitrary scalability.

The Naive Bayes classifier is relatively natural as a map-reduce program
and has a map-reduce version.

The linear classifiers like linear regression do not have map-reduce
versions (yet) since there is some difficulty in getting these to work well.

On Sun, Dec 25, 2011 at 5:59 AM, Lingxiang Cheng
<li...@yahoo.com>wrote:

> Hi,
>
>    I am a newbie to Mahout. When I was reading the book "Mahout in
> Action", I found chapters talking about how clustering naturally fit into
> Map/Reduce framework, but I did not see the same claim for classifiers.
> Does it involve a lot of work to make classifiers like random forest work
> with Hadoop?
>
> Thanks!
> Lingxiang Cheng

Re: Mahout classifier on Hadoop

Posted by Ted Dunning <te...@gmail.com>.
Random forest works as a map-reduce program, but that does not produce
arbitrary scalability.

The Naive Bayes classifier is relatively natural as a map-reduce program
and has a map-reduce version.

The linear classifiers like linear regression do not have map-reduce
versions (yet) since there is some difficulty in getting these to work well.

On Sun, Dec 25, 2011 at 5:59 AM, Lingxiang Cheng
<li...@yahoo.com>wrote:

> Hi,
>
>     I am a newbie to Mahout. When I was reading the book "Mahout in
> Action", I found chapters talking about how clustering naturally fit into
> Map/Reduce framework, but I did not see the same claim for classifiers.
> Does it involve a lot of work to make classifiers like random forest work
> with Hadoop?
>
> Thanks!
> Lingxiang Cheng