You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by deneche abdelhakim <a_...@yahoo.fr> on 2009/02/26 15:58:04 UTC

GSoC 2009 proposition

Hi,
Im planning to participate, again, at GSoC and I want to do it, again, with Mahout.
This year, lets make Mahout run over Amazon EC2. This means building the proper AMIs, run some Mahout projects (the GA examples) over EC2, give feedback and write simple, clear How-Tos about running a Mahout project on EC2.

The Mahout.GA examples (TSP and CDGA) should be good real-world scenarios about how one may need to use Mahout.GA on EC2. The TSP example should be modified to be able to run on a console and to load TSPLIB benchmarks, thus we can tackle more challenging TSP problems with the help of EC2. The CDGA example should run unmodified given, of course, that Hadoop is configured correctly on EC2 and the the benchmark is on HDFS.

This two examples will give us three use cases about Mahout on EC2:

1. TSP can be run on a single, High-CPU, EC2 instance. In this case, Watchmaker's ConcurrentEvolutionEngine should take care of the multi-threading part (or at least I hope!) and there will be no need for Hadoop;

2. TSP can also be run over multiple EC2 instances with the help of Hadoop;

3. CDGA not only needs Hadoop to run, but its data should be on HDFS.


So what do you think, is the "elephant" ready for a walk on EC2 ?

Re: GSoC 2009 proposition

Posted by Tom White <to...@gmail.com>.

If you want to run Mahout on Hadoop in EC2, then have a look at the
EC2 scripts in Hadoop
(http://svn.apache.org/viewvc/hadoop/core/trunk/src/contrib/ec2/). The
way they work now is to allow an arbitrary script to run on instance
boot, so you could install Mahout at this point very easily, and do
any configuration etc. You can also easily build and customize the
AMI. More at http://wiki.apache.org/hadoop/AmazonEC2.

https://issues.apache.org/jira/browse/HADOOP-2409 might be relevant too.

Cheers,
Tom

On Fri, Feb 27, 2009 at 5:02 PM, Tim Bass <ti...@gmail.com> wrote:
> It might be useful to consider some application specific Mahout AMI(s)
> that solve a particular domain problem.
>
> In other words, instead of a generic AMI with Mahout capabilities,
> tailor the AMI to solving a problem in a particular domain.
>
> Cheers.
>
>
> On Thu, Feb 26, 2009 at 10:48 PM, deneche abdelhakim <a_...@yahoo.fr> wrote:
>>
>> Thanks for your fast answers :) I'll rethink this and post as soon as I get something
>>
>>
>> --- En date de : Jeu 26.2.09, Grant Ingersoll <gs...@apache.org> a écrit :
>>
>>> De: Grant Ingersoll <gs...@apache.org>
>>> Objet: Re: GSoC 2009 proposition
>>> À: mahout-dev@lucene.apache.org
>>> Date: Jeudi 26 Février 2009, 16h20
>>> You might have a look at
>>> http://www.lucidimagination.com/search/document/5ab9ddafa19ee04b/thought_offering_ec2_s3_based_services#2d096f39b02ec289
>>> for some background thoughts.
>>>
>>> I think it's a nice idea and I've been meaning to
>>> use my Amazon credits for just such a thing for a while now,
>>> but not sure how high priority it is.
>>>
>>> You might consider extending/altering this thought to have
>>> more of a focus on developing demos (including code) of
>>> Mahout with real data sets on larger scale systems.  Part of
>>> this might involve showing people how to do this on EC2, but
>>> the bigger focus to me should be on demoing/documenting
>>> Mahout's capabilities, versus showing how to run Mahout
>>> on any particular platform.
>>>
>>>
>>> On Feb 26, 2009, at 9:58 AM, deneche abdelhakim wrote:
>>>
>>> >
>>> > Hi,
>>> > Im planning to participate, again, at GSoC and I want
>>> to do it, again, with Mahout.
>>> > This year, lets make Mahout run over Amazon EC2. This
>>> means building the proper AMIs, run some Mahout projects
>>> (the GA examples) over EC2, give feedback and write simple,
>>> clear How-Tos about running a Mahout project on EC2.
>>> >
>>> > The Mahout.GA examples (TSP and CDGA) should be good
>>> real-world scenarios about how one may need to use Mahout.GA
>>> on EC2. The TSP example should be modified to be able to run
>>> on a console and to load TSPLIB benchmarks, thus we can
>>> tackle more challenging TSP problems with the help of EC2.
>>> The CDGA example should run unmodified given, of course,
>>> that Hadoop is configured correctly on EC2 and the the
>>> benchmark is on HDFS.
>>> >
>>> > This two examples will give us three use cases about
>>> Mahout on EC2:
>>> >
>>> > 1. TSP can be run on a single, High-CPU, EC2 instance.
>>> In this case, Watchmaker's ConcurrentEvolutionEngine
>>> should take care of the multi-threading part (or at least I
>>> hope!) and there will be no need for Hadoop;
>>> >
>>> > 2. TSP can also be run over multiple EC2 instances
>>> with the help of Hadoop;
>>> >
>>> > 3. CDGA not only needs Hadoop to run, but its data
>>> should be on HDFS.
>>> >
>>> >
>>> > So what do you think, is the "elephant"
>>> ready for a walk on EC2 ?
>>> >
>>> >
>>> >
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> http://www.lucidimagination.com/
>>>
>>> Search the Lucene ecosystem
>>> (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
>>> http://www.lucidimagination.com/search
>>
>>
>>
>>
>

Re: GSoC 2009 proposition

Posted by Tim Bass <ti...@gmail.com>.

It might be useful to consider some application specific Mahout AMI(s)
that solve a particular domain problem.

In other words, instead of a generic AMI with Mahout capabilities,
tailor the AMI to solving a problem in a particular domain.

Cheers.


On Thu, Feb 26, 2009 at 10:48 PM, deneche abdelhakim <a_...@yahoo.fr> wrote:
>
> Thanks for your fast answers :) I'll rethink this and post as soon as I get something
>
>
> --- En date de : Jeu 26.2.09, Grant Ingersoll <gs...@apache.org> a écrit :
>
>> De: Grant Ingersoll <gs...@apache.org>
>> Objet: Re: GSoC 2009 proposition
>> À: mahout-dev@lucene.apache.org
>> Date: Jeudi 26 Février 2009, 16h20
>> You might have a look at
>> http://www.lucidimagination.com/search/document/5ab9ddafa19ee04b/thought_offering_ec2_s3_based_services#2d096f39b02ec289
>> for some background thoughts.
>>
>> I think it's a nice idea and I've been meaning to
>> use my Amazon credits for just such a thing for a while now,
>> but not sure how high priority it is.
>>
>> You might consider extending/altering this thought to have
>> more of a focus on developing demos (including code) of
>> Mahout with real data sets on larger scale systems.  Part of
>> this might involve showing people how to do this on EC2, but
>> the bigger focus to me should be on demoing/documenting
>> Mahout's capabilities, versus showing how to run Mahout
>> on any particular platform.
>>
>>
>> On Feb 26, 2009, at 9:58 AM, deneche abdelhakim wrote:
>>
>> >
>> > Hi,
>> > Im planning to participate, again, at GSoC and I want
>> to do it, again, with Mahout.
>> > This year, lets make Mahout run over Amazon EC2. This
>> means building the proper AMIs, run some Mahout projects
>> (the GA examples) over EC2, give feedback and write simple,
>> clear How-Tos about running a Mahout project on EC2.
>> >
>> > The Mahout.GA examples (TSP and CDGA) should be good
>> real-world scenarios about how one may need to use Mahout.GA
>> on EC2. The TSP example should be modified to be able to run
>> on a console and to load TSPLIB benchmarks, thus we can
>> tackle more challenging TSP problems with the help of EC2.
>> The CDGA example should run unmodified given, of course,
>> that Hadoop is configured correctly on EC2 and the the
>> benchmark is on HDFS.
>> >
>> > This two examples will give us three use cases about
>> Mahout on EC2:
>> >
>> > 1. TSP can be run on a single, High-CPU, EC2 instance.
>> In this case, Watchmaker's ConcurrentEvolutionEngine
>> should take care of the multi-threading part (or at least I
>> hope!) and there will be no need for Hadoop;
>> >
>> > 2. TSP can also be run over multiple EC2 instances
>> with the help of Hadoop;
>> >
>> > 3. CDGA not only needs Hadoop to run, but its data
>> should be on HDFS.
>> >
>> >
>> > So what do you think, is the "elephant"
>> ready for a walk on EC2 ?
>> >
>> >
>> >
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem
>> (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
>> http://www.lucidimagination.com/search
>
>
>
>

Re: GSoC 2009 proposition

Posted by deneche abdelhakim <a_...@yahoo.fr>.

Thanks for your fast answers :) I'll rethink this and post as soon as I get something


--- En date de : Jeu 26.2.09, Grant Ingersoll <gs...@apache.org> a écrit :

> De: Grant Ingersoll <gs...@apache.org>
> Objet: Re: GSoC 2009 proposition
> À: mahout-dev@lucene.apache.org
> Date: Jeudi 26 Février 2009, 16h20
> You might have a look at
> http://www.lucidimagination.com/search/document/5ab9ddafa19ee04b/thought_offering_ec2_s3_based_services#2d096f39b02ec289
> for some background thoughts.
> 
> I think it's a nice idea and I've been meaning to
> use my Amazon credits for just such a thing for a while now,
> but not sure how high priority it is.
> 
> You might consider extending/altering this thought to have
> more of a focus on developing demos (including code) of
> Mahout with real data sets on larger scale systems.  Part of
> this might involve showing people how to do this on EC2, but
> the bigger focus to me should be on demoing/documenting
> Mahout's capabilities, versus showing how to run Mahout
> on any particular platform.
> 
> 
> On Feb 26, 2009, at 9:58 AM, deneche abdelhakim wrote:
> 
> > 
> > Hi,
> > Im planning to participate, again, at GSoC and I want
> to do it, again, with Mahout.
> > This year, lets make Mahout run over Amazon EC2. This
> means building the proper AMIs, run some Mahout projects
> (the GA examples) over EC2, give feedback and write simple,
> clear How-Tos about running a Mahout project on EC2.
> > 
> > The Mahout.GA examples (TSP and CDGA) should be good
> real-world scenarios about how one may need to use Mahout.GA
> on EC2. The TSP example should be modified to be able to run
> on a console and to load TSPLIB benchmarks, thus we can
> tackle more challenging TSP problems with the help of EC2.
> The CDGA example should run unmodified given, of course,
> that Hadoop is configured correctly on EC2 and the the
> benchmark is on HDFS.
> > 
> > This two examples will give us three use cases about
> Mahout on EC2:
> > 
> > 1. TSP can be run on a single, High-CPU, EC2 instance.
> In this case, Watchmaker's ConcurrentEvolutionEngine
> should take care of the multi-threading part (or at least I
> hope!) and there will be no need for Hadoop;
> > 
> > 2. TSP can also be run over multiple EC2 instances
> with the help of Hadoop;
> > 
> > 3. CDGA not only needs Hadoop to run, but its data
> should be on HDFS.
> > 
> > 
> > So what do you think, is the "elephant"
> ready for a walk on EC2 ?
> > 
> > 
> > 
> 
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
> 
> Search the Lucene ecosystem
> (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
> http://www.lucidimagination.com/search

Re: GSoC 2009 proposition

Posted by Grant Ingersoll <gs...@apache.org>.

You might have a look at http://www.lucidimagination.com/search/document/5ab9ddafa19ee04b/thought_offering_ec2_s3_based_services#2d096f39b02ec289 
  for some background thoughts.

I think it's a nice idea and I've been meaning to use my Amazon  
credits for just such a thing for a while now, but not sure how high  
priority it is.

You might consider extending/altering this thought to have more of a  
focus on developing demos (including code) of Mahout with real data  
sets on larger scale systems.  Part of this might involve showing  
people how to do this on EC2, but the bigger focus to me should be on  
demoing/documenting Mahout's capabilities, versus showing how to run  
Mahout on any particular platform.

On Feb 26, 2009, at 9:58 AM, deneche abdelhakim wrote:

>
> Hi,
> Im planning to participate, again, at GSoC and I want to do it,  
> again, with Mahout.
> This year, lets make Mahout run over Amazon EC2. This means building  
> the proper AMIs, run some Mahout projects (the GA examples) over  
> EC2, give feedback and write simple, clear How-Tos about running a  
> Mahout project on EC2.
>
> The Mahout.GA examples (TSP and CDGA) should be good real-world  
> scenarios about how one may need to use Mahout.GA on EC2. The TSP  
> example should be modified to be able to run on a console and to  
> load TSPLIB benchmarks, thus we can tackle more challenging TSP  
> problems with the help of EC2. The CDGA example should run  
> unmodified given, of course, that Hadoop is configured correctly on  
> EC2 and the the benchmark is on HDFS.
>
> This two examples will give us three use cases about Mahout on EC2:
>
> 1. TSP can be run on a single, High-CPU, EC2 instance. In this case,  
> Watchmaker's ConcurrentEvolutionEngine should take care of the multi- 
> threading part (or at least I hope!) and there will be no need for  
> Hadoop;
>
> 2. TSP can also be run over multiple EC2 instances with the help of  
> Hadoop;
>
> 3. CDGA not only needs Hadoop to run, but its data should be on HDFS.
>
>
> So what do you think, is the "elephant" ready for a walk on EC2 ?
>
>
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

Re: GSoC 2009 proposition

Posted by Sean Owen <sr...@gmail.com>.

Yes, I've already set up a recommender service using EC2. I copied my
documentation-in-progress on it which explains how it works, below.

As I think Ted said before, and I agree with, simply providing an
image with the libraries installed doesn't add value. A how-to is OK,
but is it much more than the concatentation of "how to get a machine
running on EC2" and "how to run Mahout stuff on a machine" which
already exist?

What I think it is useful to offer (well, at least the most useful
thing to offer) on EC2 are AMIs that act almost like a big RPC. You
put data in a location, fire up the AMI, it crunches as fast as
possible, saves output, and quits. That's what the AMI I put together
does and it works quite nicely.

(I agree that it's not a bad idea to still pay attention to the
single-machine case and not just Hadoop. Hadoop is a lot of overhead
but necessary at a certain scale. Below that scale, if you can fit on
one machine, it's obviously a lot quicker. EC2 does offer pretty big
machines...)

Anyway, food for thought on this topic...

-----------
An AMI which employs Apache Mahout's 'Taste' collaborative filtering
engine (of which I am developer) to efficiently generate
recommendations based on user preferences -- think of Amazon's book
recommendations for an idea of what this does. For example if your
business sells CDs, this service could determine which CDs to
recommend to your users for purchase, based on ratings you have
already. This service makes it simple and cost-effective for
businesses to leverage this technology.

This AMI requires that you supply one run-time parameter:

dataBucket: A bucket where your input is stored and output will be stored

This is specified on the command line to ec2-run-instances as "-d
dataBucket=[data bucket name]"

The bucket must grant both read and write permission, and the
"in.txt.gz" file within it (described next) must grant read access, to
the following canonical user ID:

c8453526c3ec4d3c2d3b7ecc654c8e3e4fbf006d595d7310def17047c28c58ab

This enables the service to read your input and write output back to
the bucket. It is advised that you do not store any other data here
for security.

The bucket named by dataBucket should contain an input file named
"in.txt.gz". This should be a GZip-compressed text file, containing
comma-separated lines specify user-item preferences. That is, each
line should be of the form:

[user ID],[item ID],[preference value]

During operation, a file called iterations.txt in this bucket will be
updated with the number of users processed so far. At completion, the
bucket will contain log.txt, with output from the run, and out.txt.gz,
a GZip-compressed file containing comma-separated values, where each
line is of the form:

[user ID],[item ID],[estimated preference value]

All lines for a user ID will be grouped, and will be sorted,
descending, by preference value.

This AMI is intended for use with 64-bit instance types: m1.large,
m1.xlarge, c1.xlarge

This service is appropriate for small- and medium-sized businesses --
roughly speaking, up to 10M user-item preferences. As a rough guide,
on a c1.xlarge instance, using the GroupLens 10M rating data set,
recommendations can be generated for all users in about 4 hours, at a
cost of about $10.
--------------

On Thu, Feb 26, 2009 at 2:58 PM, deneche abdelhakim <a_...@yahoo.fr> wrote:
>
> Hi,
> Im planning to participate, again, at GSoC and I want to do it, again, with Mahout.
> This year, lets make Mahout run over Amazon EC2. This means building the proper AMIs, run some Mahout projects (the GA examples) over EC2, give feedback and write simple, clear How-Tos about running a Mahout project on EC2.
>
> The Mahout.GA examples (TSP and CDGA) should be good real-world scenarios about how one may need to use Mahout.GA on EC2. The TSP example should be modified to be able to run on a console and to load TSPLIB benchmarks, thus we can tackle more challenging TSP problems with the help of EC2. The CDGA example should run unmodified given, of course, that Hadoop is configured correctly on EC2 and the the benchmark is on HDFS.
>
> This two examples will give us three use cases about Mahout on EC2:
>
> 1. TSP can be run on a single, High-CPU, EC2 instance. In this case, Watchmaker's ConcurrentEvolutionEngine should take care of the multi-threading part (or at least I hope!) and there will be no need for Hadoop;
>
> 2. TSP can also be run over multiple EC2 instances with the help of Hadoop;
>
> 3. CDGA not only needs Hadoop to run, but its data should be on HDFS.
>
>
> So what do you think, is the "elephant" ready for a walk on EC2 ?
>
>
>
>