You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Tim Bass <ti...@gmail.com> on 2009/05/19 12:59:23 UTC

Amazon Mahout Public AMI v. Mahout on EMR

Dear All,

A few months ago (on the developer's list) we briefly touched on the
idea of building a Mahout public AMI on EC2.

Subsequently, Amazon released EMR and a number of folks have
experimented with running sample Mahout jobs on EMR.

What are the pros and cons of creating a public Mahout AMI with Hadoop
and MapReduce configured with the versions that
are supported by the developers, in addition to Amazon's EMR implementation?

Should we revisit the AMI idea?  Pros and cons?

Re: Amazon Mahout Public AMI v. Mahout on EMR

Posted by Stephen Green <St...@sun.com>.
On May 19, 2009, at 7:11 AM, Grant Ingersoll wrote:

>
> On May 19, 2009, at 6:59 AM, Tim Bass wrote:
>
>> Dear All,
>>
>> A few months ago (on the developer's list) we briefly touched on the
>> idea of building a Mahout public AMI on EC2.
>>
>> Subsequently, Amazon released EMR and a number of folks have
>> experimented with running sample Mahout jobs on EMR.
>>
>> What are the pros and cons of creating a public Mahout AMI with  
>> Hadoop
>> and MapReduce configured with the versions that
>> are supported by the developers, in addition to Amazon's EMR  
>> implementation?
>
> AFAICT, one issue seems to be that EMR locks you into a specific  
> Hadoop instance.  Not sure if "locks" is too strong, maybe I should  
> say it "encourages" you to use a specific version?

Actually, I think "locks" is more appropriate.  They're using Hadoop  
0.18.3 with some feature backports (according to what they said to  
me), so if you want features from a newer Hadoop (isn't 0.20 the  
current release?  It looked like it had a lot of new stuff), you're  
pretty much done for.

Also, they charge extra for EMR jobs, which strikes me as a bit crazy  
(see Greg Linden's comments about variable pricing), and may strike  
some folks as a reason to run their own clusters.

> As Ted and others pointed out, I think we would benefit from tools  
> that make it easy to add Mahout to an AMI.

Perhaps you could base it off of one of the Cloudera Hadoop AMIs?   
They're publically available, and they handle all the Hadoop  
business.  I have no idea what the redistribution license would be,  
and I am most definitely not a lawyer!

Steve
-- 
Stephen Green                      //   Stephen.Green@sun.com
Principal Investigator             \\   http://blogs.sun.com/searchguy
Aura Project                       //   Voice: +1 781-442-0926
Sun Microsystems Labs              \\   Fax:   +1 781-442-1692




Re: Amazon Mahout Public AMI v. Mahout on EMR

Posted by Tim Bass <ti...@gmail.com>.
One of my thoughts is that if Mahout AMIs had functionality that made
it easy to create underlying Hadoop clusters by running multiple
instances, that might be a tangible benefit.

If there was a public AMI that anyone could run under the Apache
license, this would be "a good thing", since the Amazon EMR version
seems to be out-of-step with Mahout requirements.

Re: Amazon Mahout Public AMI v. Mahout on EMR

Posted by Grant Ingersoll <gs...@apache.org>.
On May 19, 2009, at 6:59 AM, Tim Bass wrote:

> Dear All,
>
> A few months ago (on the developer's list) we briefly touched on the
> idea of building a Mahout public AMI on EC2.
>
> Subsequently, Amazon released EMR and a number of folks have
> experimented with running sample Mahout jobs on EMR.
>
> What are the pros and cons of creating a public Mahout AMI with Hadoop
> and MapReduce configured with the versions that
> are supported by the developers, in addition to Amazon's EMR  
> implementation?

AFAICT, one issue seems to be that EMR locks you into a specific  
Hadoop instance.  Not sure if "locks" is too strong, maybe I should  
say it "encourages" you to use a specific version?

As Ted and others pointed out, I think we would benefit from tools  
that make it easy to add Mahout to an AMI.

-Grant

Re: Amazon Mahout Public AMI v. Mahout on EMR

Posted by Sean Owen <sr...@gmail.com>.
I still don't see a point in producing an AMI -- it's like
distributing our .jar, which we already do -- plus a gigabyte of
operating system.

However by all means I think we should produce the sort of runnable
.jar files that AEMR needs and post them in S3. That is, a .jar with
all the Mahout code, plus a proper Main-Class manifest entry, is all
you need to start your own instance of the job with AEMR (you supply
.jar location and program arguments).

I have this sort of ready to go for collaborative filtering, even as
I'm hitting snags farther down the road. AEMR is exactly what we want
to support.

On Tue, May 19, 2009 at 11:59 AM, Tim Bass <ti...@gmail.com> wrote:
> Dear All,
>
> A few months ago (on the developer's list) we briefly touched on the
> idea of building a Mahout public AMI on EC2.
>
> Subsequently, Amazon released EMR and a number of folks have
> experimented with running sample Mahout jobs on EMR.
>
> What are the pros and cons of creating a public Mahout AMI with Hadoop
> and MapReduce configured with the versions that
> are supported by the developers, in addition to Amazon's EMR implementation?
>
> Should we revisit the AMI idea?  Pros and cons?
>