You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by Sean Owen <sr...@gmail.com> on 2009/02/01 23:45:01 UTC

Thought: offering EC2/S3-based services

I had a thought. After looking at Amazon's most excellent EC2 system
again I realized how simple it would be to offer batch recommendations
via EC2. You upload your data to S3, run a machine image I provide
parameterized with the file location, it crunches, copies the results
back, shuts down. It's attractive since they offer 8-way 15GB machines
and the algorithms can easily exploit this to the limit, making it
really efficient too.

I was thinking of developing an AMI for this separately and offering
it as a for-pay commercial service -- Amazon makes that pretty easy.
(It would hardly be a big money maker -- a couple dollars per hour is
probably the highest reasonable price to charge -- but would sorta pay
for its own development.)

I think it will be interesting to try as a proof of concept. It's a
solution that still doesn't scale to huge data sets, but I think a
15GB machine would still work for large-ish data sets (~100M ratings)
and its exactly those small- to medium-sized applications for which it
might make sense to outsource this.

Sean

Re: Thought: offering EC2/S3-based services

Posted by Ted Dunning <te...@gmail.com>.

There are more and more small problems in the world.

     - corollary to Moore's law

On Tue, Feb 3, 2009 at 4:24 AM, Grant Ingersoll <gs...@apache.org> wrote:

>  I think we can get along just fine w/ non distributed as well.
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: Thought: offering EC2/S3-based services

Posted by Grant Ingersoll <gs...@apache.org>.

Right, and I don't feel a particular need to be Hadoop only.  We  
should be pragmatic in our choices and use what we think works best.   
Also, we don't have to be all about distributed, either.  I think we  
can get along just fine w/ non distributed as well.

-Grant

On Feb 3, 2009, at 3:02 AM, Ted Dunning wrote:

> Actually, you probably can do more things with mahout not using  
> hadoop than
> you can using it.  The practice is to check in sequential versions  
> of the
> algorithms first, then parallel.  Many algorithms don't have parallel
> versions yet.
>
> On Tue, Feb 3, 2009 at 12:00 AM, deneche abdelhakim <a_deneche@yahoo.fr 
> >wrote:
>
>> It's a silly question :P but can you use Mahout without using  
>> Hadoop ? do
>> you mean that when having one single 'multi-core" machine, one can  
>> use
>> Mahout alone ? (Ok, that's two silly questions)
>>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve
> 4600 Bohannon Drive, Suite 220
> Menlo Park, CA 94025
> www.deepdyve.com
> 650-324-0110, ext. 738
> 858-414-0013 (m)

Re: Thought: offering EC2/S3-based services

Posted by Ted Dunning <te...@gmail.com>.

Actually, you probably can do more things with mahout not using hadoop than
you can using it.  The practice is to check in sequential versions of the
algorithms first, then parallel.  Many algorithms don't have parallel
versions yet.

On Tue, Feb 3, 2009 at 12:00 AM, deneche abdelhakim <a_...@yahoo.fr>wrote:

> It's a silly question :P but can you use Mahout without using Hadoop ? do
> you mean that when having one single 'multi-core" machine, one can use
> Mahout alone ? (Ok, that's two silly questions)
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: Thought: offering EC2/S3-based services

Posted by deneche abdelhakim <a_...@yahoo.fr>.

It's a silly question :P but can you use Mahout without using Hadoop ? do you mean that when having one single 'multi-core" machine, one can use Mahout alone ? (Ok, that's two silly questions)


--- En date de : Lun 2.2.09, Sean Owen <sr...@gmail.com> a écrit :

> De: Sean Owen <sr...@gmail.com>
> Objet: Re: Thought: offering EC2/S3-based services
> À: mahout-dev@lucene.apache.org
> Date: Lundi 2 Février 2009, 19h31
> Agree, I am not sure if it adds value to publish an AMI that
> just adds
> a copy of the Mahout distro. What I was thinking of for my
> part is a
> minimal OS plus Java plus Mahout with a startup script that
> runs the
> processing, then shuts down automatically. The AMI would
> not be
> extended, but invoked with parameters, kind of like an RPC.
> 
> Using Hadoop involves a large jump in overhead. It is of
> course
> necessary at some scale to move to this framework. But I
> would like to
> provide a one-big-machine solution for small- and
> medium-sized users
> since it will be a lot simpler and more cost effective.
> 
> On Mon, Feb 2, 2009 at 5:57 PM, Ted Dunning
> <te...@gmail.com> wrote:
> > Based on my experience moving our search engine to
> work in the cloud, I
> > would say that it would be easier on users to not
> actually build a
> > specialized AMI, but rather to make some publicly
> available S3 resources
> > such as an installation script, jars and tars.
> >
> > That allows people to install and run mahout not just
> on a single AMI, but
> > also on any AMI they are running.  It also makes it
> easy for anybody else to
> > use Mahout fairly trivially.

Re: Thought: offering EC2/S3-based services

Posted by Sean Owen <sr...@gmail.com>.

Agree, I am not sure if it adds value to publish an AMI that just adds
a copy of the Mahout distro. What I was thinking of for my part is a
minimal OS plus Java plus Mahout with a startup script that runs the
processing, then shuts down automatically. The AMI would not be
extended, but invoked with parameters, kind of like an RPC.

Using Hadoop involves a large jump in overhead. It is of course
necessary at some scale to move to this framework. But I would like to
provide a one-big-machine solution for small- and medium-sized users
since it will be a lot simpler and more cost effective.

On Mon, Feb 2, 2009 at 5:57 PM, Ted Dunning <te...@gmail.com> wrote:
> Based on my experience moving our search engine to work in the cloud, I
> would say that it would be easier on users to not actually build a
> specialized AMI, but rather to make some publicly available S3 resources
> such as an installation script, jars and tars.
>
> That allows people to install and run mahout not just on a single AMI, but
> also on any AMI they are running.  It also makes it easy for anybody else to
> use Mahout fairly trivially.

Re: Thought: offering EC2/S3-based services

Posted by Tim Bass <ti...@gmail.com>.

>
> But what if you don't want Ubuntu (or do)?
>

Most folks don't care about the underlying OS in the AMI, as long as
it is well supported, at least that is my experience.  And, according
to a long running poll at The UNIX and Linux Forums, Ubuntu is the
most popular open/free Linux OS (voted by users). Ubuntu Hardy seems
to be one of the most popular AMIs as well. according to my work in
EC2.

No one will give Mahout any grief for a nice Ubuntu Hardy AMI with a
full running Mahout service with an easy to configure web interface
:-)

As a side note, configuring an AMI for an application is non-trivial.
 Setting up a decent LAMP configuration can be challenging; I assume
that is why there is a Ubuntu Hardy LAMP AMI  that can be easily
instantiated.

Anyway, I guess we are getting way ahead of ourselves :-)

Cheers.

Re: Thought: offering EC2/S3-based services

Posted by Ted Dunning <te...@gmail.com>.

On Mon, Feb 2, 2009 at 5:57 PM, Tim Bass <ti...@gmail.com> wrote:

>
> I don't agree with you that simply putting resources in S3 is better
> than having an AMI as Grant suggests. It is not "either or" but
> "both".

Fair point.

> It would be much easier for people if they could simply log into their
> Amazon account (I have one) and turn on a Mahout AMI, and just
> configure it.

But what if you don't want Ubuntu (or do)?

I would think it just as handy to have a script that I should shove into any
AMI that I happen to run to turn it into a Mahout capable machine.

You are talking about "Hosting files as a service", which is a great idea
> too.

Hmm... I would think of it more along the lines of hosting free software in
binary form.

> However, the SaaS and CaaS model for analytics is well established now
> and this is a perfect model for Mahout in EC2.

And that is not what either of us really proposed.

For that, a higher level service is appropriate and a more appropriate
interface.

AMI's and files are things, not particularly a service.  The category error
is tempting, but I think misleading.

Re: Thought: offering EC2/S3-based services

Posted by Tim Bass <ti...@gmail.com>.

Hi Ted,

I don't agree with you that simply putting resources in S3 is better
than having an AMI as Grant suggests. It is not "either or" but
"both".

It would be much easier for people if they could simply log into their
Amazon account (I have one) and turn on a Mahout AMI, and just
configure it.

You are talking about "Hosting files as a service", which is a great idea too.

However, the SaaS and CaaS model for analytics is well established now
and this is a perfect model for Mahout in EC2.

In my opinion.....

Yours sincerely, Tim

On Tue, Feb 3, 2009 at 1:49 AM, Grant Ingersoll <gs...@apache.org> wrote:
> Good point.
>
>
> On Feb 2, 2009, at 12:57 PM, Ted Dunning wrote:
>
>> Based on my experience moving our search engine to work in the cloud, I
>> would say that it would be easier on users to not actually build a
>> specialized AMI, but rather to make some publicly available S3 resources
>> such as an installation script, jars and tars.
>>
>> That allows people to install and run mahout not just on a single AMI, but
>> also on any AMI they are running.  It also makes it easy for anybody else
>> to
>> use Mahout fairly trivially.
>>
>> On Mon, Feb 2, 2009 at 8:13 AM, Tim Bass <ti...@gmail.com> wrote:
>>
>>> Wow.  That is a great idea, Mahout on a Ubuntu Hardy AMI.
>>>
>>>
>>>
>>> On Mon, Feb 2, 2009 at 11:03 PM, Grant Ingersoll <gs...@apache.org>
>>> wrote:
>>>>
>>>> Sounds cool.  On a related note, it has always been my intent to put up
>>>> Mahout as an AMI, similar to what Hadoop does, to make it easy for
>>>> people
>>>
>>> to
>>>>
>>>> get started w/ Mahout.
>>>>
>>>>
>>>> On Feb 1, 2009, at 5:45 PM, Sean Owen wrote:
>>>>
>>>>> I had a thought. After looking at Amazon's most excellent EC2 system
>>>>> again I realized how simple it would be to offer batch recommendations
>>>>> via EC2. You upload your data to S3, run a machine image I provide
>>>>> parameterized with the file location, it crunches, copies the results
>>>>> back, shuts down. It's attractive since they offer 8-way 15GB machines
>>>>> and the algorithms can easily exploit this to the limit, making it
>>>>> really efficient too.
>>>>>
>>>>> I was thinking of developing an AMI for this separately and offering
>>>>> it as a for-pay commercial service -- Amazon makes that pretty easy.
>>>>> (It would hardly be a big money maker -- a couple dollars per hour is
>>>>> probably the highest reasonable price to charge -- but would sorta pay
>>>>> for its own development.)
>>>>>
>>>>> I think it will be interesting to try as a proof of concept. It's a
>>>>> solution that still doesn't scale to huge data sets, but I think a
>>>>> 15GB machine would still work for large-ish data sets (~100M ratings)
>>>>> and its exactly those small- to medium-sized applications for which it
>>>>> might make sense to outsource this.
>>>>>
>>>>> Sean
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Ted Dunning, CTO
>> DeepDyve
>> 4600 Bohannon Drive, Suite 220
>> Menlo Park, CA 94025
>> www.deepdyve.com
>> 650-324-0110, ext. 738
>> 858-414-0013 (m)
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Lucene Helpful Hints:
> http://wiki.apache.org/lucene-java/BasicsOfPerformance
> http://wiki.apache.org/lucene-java/LuceneFAQ
>
>
>
>
>
>
>
>
>
>
>
>

Re: Thought: offering EC2/S3-based services

Posted by Grant Ingersoll <gs...@apache.org>.

Good point.


On Feb 2, 2009, at 12:57 PM, Ted Dunning wrote:

> Based on my experience moving our search engine to work in the  
> cloud, I
> would say that it would be easier on users to not actually build a
> specialized AMI, but rather to make some publicly available S3  
> resources
> such as an installation script, jars and tars.
>
> That allows people to install and run mahout not just on a single  
> AMI, but
> also on any AMI they are running.  It also makes it easy for anybody  
> else to
> use Mahout fairly trivially.
>
> On Mon, Feb 2, 2009 at 8:13 AM, Tim Bass <ti...@gmail.com>  
> wrote:
>
>> Wow.  That is a great idea, Mahout on a Ubuntu Hardy AMI.
>>
>>
>>
>> On Mon, Feb 2, 2009 at 11:03 PM, Grant Ingersoll  
>> <gs...@apache.org>
>> wrote:
>>> Sounds cool.  On a related note, it has always been my intent to  
>>> put up
>>> Mahout as an AMI, similar to what Hadoop does, to make it easy for  
>>> people
>> to
>>> get started w/ Mahout.
>>>
>>>
>>> On Feb 1, 2009, at 5:45 PM, Sean Owen wrote:
>>>
>>>> I had a thought. After looking at Amazon's most excellent EC2  
>>>> system
>>>> again I realized how simple it would be to offer batch  
>>>> recommendations
>>>> via EC2. You upload your data to S3, run a machine image I provide
>>>> parameterized with the file location, it crunches, copies the  
>>>> results
>>>> back, shuts down. It's attractive since they offer 8-way 15GB  
>>>> machines
>>>> and the algorithms can easily exploit this to the limit, making it
>>>> really efficient too.
>>>>
>>>> I was thinking of developing an AMI for this separately and  
>>>> offering
>>>> it as a for-pay commercial service -- Amazon makes that pretty  
>>>> easy.
>>>> (It would hardly be a big money maker -- a couple dollars per  
>>>> hour is
>>>> probably the highest reasonable price to charge -- but would  
>>>> sorta pay
>>>> for its own development.)
>>>>
>>>> I think it will be interesting to try as a proof of concept. It's a
>>>> solution that still doesn't scale to huge data sets, but I think a
>>>> 15GB machine would still work for large-ish data sets (~100M  
>>>> ratings)
>>>> and its exactly those small- to medium-sized applications for  
>>>> which it
>>>> might make sense to outsource this.
>>>>
>>>> Sean
>>>
>>>
>>>
>>
>
>
>
> -- 
> Ted Dunning, CTO
> DeepDyve
> 4600 Bohannon Drive, Suite 220
> Menlo Park, CA 94025
> www.deepdyve.com
> 650-324-0110, ext. 738
> 858-414-0013 (m)

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Thought: offering EC2/S3-based services

Posted by Ted Dunning <te...@gmail.com>.

Based on my experience moving our search engine to work in the cloud, I
would say that it would be easier on users to not actually build a
specialized AMI, but rather to make some publicly available S3 resources
such as an installation script, jars and tars.

That allows people to install and run mahout not just on a single AMI, but
also on any AMI they are running.  It also makes it easy for anybody else to
use Mahout fairly trivially.

On Mon, Feb 2, 2009 at 8:13 AM, Tim Bass <ti...@gmail.com> wrote:

> Wow.  That is a great idea, Mahout on a Ubuntu Hardy AMI.
>
>
>
> On Mon, Feb 2, 2009 at 11:03 PM, Grant Ingersoll <gs...@apache.org>
> wrote:
> > Sounds cool.  On a related note, it has always been my intent to put up
> > Mahout as an AMI, similar to what Hadoop does, to make it easy for people
> to
> > get started w/ Mahout.
> >
> >
> > On Feb 1, 2009, at 5:45 PM, Sean Owen wrote:
> >
> >> I had a thought. After looking at Amazon's most excellent EC2 system
> >> again I realized how simple it would be to offer batch recommendations
> >> via EC2. You upload your data to S3, run a machine image I provide
> >> parameterized with the file location, it crunches, copies the results
> >> back, shuts down. It's attractive since they offer 8-way 15GB machines
> >> and the algorithms can easily exploit this to the limit, making it
> >> really efficient too.
> >>
> >> I was thinking of developing an AMI for this separately and offering
> >> it as a for-pay commercial service -- Amazon makes that pretty easy.
> >> (It would hardly be a big money maker -- a couple dollars per hour is
> >> probably the highest reasonable price to charge -- but would sorta pay
> >> for its own development.)
> >>
> >> I think it will be interesting to try as a proof of concept. It's a
> >> solution that still doesn't scale to huge data sets, but I think a
> >> 15GB machine would still work for large-ish data sets (~100M ratings)
> >> and its exactly those small- to medium-sized applications for which it
> >> might make sense to outsource this.
> >>
> >> Sean
> >
> >
> >
>



-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)

Re: Thought: offering EC2/S3-based services

Posted by Tim Bass <ti...@gmail.com>.

Wow.  That is a great idea, Mahout on a Ubuntu Hardy AMI.



On Mon, Feb 2, 2009 at 11:03 PM, Grant Ingersoll <gs...@apache.org> wrote:
> Sounds cool.  On a related note, it has always been my intent to put up
> Mahout as an AMI, similar to what Hadoop does, to make it easy for people to
> get started w/ Mahout.
>
>
> On Feb 1, 2009, at 5:45 PM, Sean Owen wrote:
>
>> I had a thought. After looking at Amazon's most excellent EC2 system
>> again I realized how simple it would be to offer batch recommendations
>> via EC2. You upload your data to S3, run a machine image I provide
>> parameterized with the file location, it crunches, copies the results
>> back, shuts down. It's attractive since they offer 8-way 15GB machines
>> and the algorithms can easily exploit this to the limit, making it
>> really efficient too.
>>
>> I was thinking of developing an AMI for this separately and offering
>> it as a for-pay commercial service -- Amazon makes that pretty easy.
>> (It would hardly be a big money maker -- a couple dollars per hour is
>> probably the highest reasonable price to charge -- but would sorta pay
>> for its own development.)
>>
>> I think it will be interesting to try as a proof of concept. It's a
>> solution that still doesn't scale to huge data sets, but I think a
>> 15GB machine would still work for large-ish data sets (~100M ratings)
>> and its exactly those small- to medium-sized applications for which it
>> might make sense to outsource this.
>>
>> Sean
>
>
>

Re: Thought: offering EC2/S3-based services

Posted by Grant Ingersoll <gs...@apache.org>.

Sounds cool.  On a related note, it has always been my intent to put  
up Mahout as an AMI, similar to what Hadoop does, to make it easy for  
people to get started w/ Mahout.


On Feb 1, 2009, at 5:45 PM, Sean Owen wrote:

> I had a thought. After looking at Amazon's most excellent EC2 system
> again I realized how simple it would be to offer batch recommendations
> via EC2. You upload your data to S3, run a machine image I provide
> parameterized with the file location, it crunches, copies the results
> back, shuts down. It's attractive since they offer 8-way 15GB machines
> and the algorithms can easily exploit this to the limit, making it
> really efficient too.
>
> I was thinking of developing an AMI for this separately and offering
> it as a for-pay commercial service -- Amazon makes that pretty easy.
> (It would hardly be a big money maker -- a couple dollars per hour is
> probably the highest reasonable price to charge -- but would sorta pay
> for its own development.)
>
> I think it will be interesting to try as a proof of concept. It's a
> solution that still doesn't scale to huge data sets, but I think a
> 15GB machine would still work for large-ish data sets (~100M ratings)
> and its exactly those small- to medium-sized applications for which it
> might make sense to outsource this.
>
> Sean

Re: Thought: offering EC2/S3-based services

Posted by Tim Bass <ti...@gmail.com>.

Yes, this is a perfect fit for offering analytics as a service, folks
can call it A3S .... :-)

See this reference post:

http://www.thecepblog.com/2009/01/14/it-infrastructure-capability-as-a-service/

... where I also mention Mahout.

Yours sincerely, Tim
www.thecepblog.com
www.unix.com


On Mon, Feb 2, 2009 at 8:52 AM, Ted Dunning <te...@gmail.com> wrote:
> At Veoh, I was using 10 machines that were roughly equivalent to the small
> instances at amazon to crunch really large amounts of data.
>
> Giving up on in-memory computation was a huge win for me.
>
> On Sun, Feb 1, 2009 at 2:45 PM, Sean Owen <sr...@gmail.com> wrote:
>
>> I think it will be interesting to try as a proof of concept. It's a
>> solution that still doesn't scale to huge data sets, but I think a
>> 15GB machine would still work for large-ish data sets (~100M ratings)
>> and its exactly those small- to medium-sized applications for which it
>> might make sense to outsource this.
>>
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
> 4600 Bohannon Drive, Suite 220
> Menlo Park, CA 94025
> www.deepdyve.com
> 650-324-0110, ext. 738
> 858-414-0013 (m)
>

Re: Thought: offering EC2/S3-based services

Posted by Ted Dunning <te...@gmail.com>.

At Veoh, I was using 10 machines that were roughly equivalent to the small
instances at amazon to crunch really large amounts of data.

Giving up on in-memory computation was a huge win for me.

On Sun, Feb 1, 2009 at 2:45 PM, Sean Owen <sr...@gmail.com> wrote:

> I think it will be interesting to try as a proof of concept. It's a
> solution that still doesn't scale to huge data sets, but I think a
> 15GB machine would still work for large-ish data sets (~100M ratings)
> and its exactly those small- to medium-sized applications for which it
> might make sense to outsource this.
>

-- 
Ted Dunning, CTO
DeepDyve
4600 Bohannon Drive, Suite 220
Menlo Park, CA 94025
www.deepdyve.com
650-324-0110, ext. 738
858-414-0013 (m)