You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Chris K Wensel <ch...@wensel.net> on 2009/04/02 09:47:46 UTC

Amazon Elastic MapReduce

FYI

Amazons new Hadoop offering:
http://aws.amazon.com/elasticmapreduce/

And Cascading 1.0 supports it:
http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html

cheers,
ckw

--
Chris K Wensel
chris@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/

Re: Amazon Elastic MapReduce

Posted by Peter Skomoroch <pe...@gmail.com>.

Kevin,

The API accepts any arguments you can pass in the standard jobconf for
Hadoop 18.3, it is pretty easy to convert over an existing jobflow to a JSON
job description that will run on the service.

-Pete

On Thu, Apr 2, 2009 at 2:44 PM, Kevin Peterson <kp...@biz360.com> wrote:

> So if I understand correctly, this is an automated system to bring up a
> hadoop cluster on EC2, import some data from S3, run a job flow, write the
> data back to S3, and bring down the cluster?
>
> This seems like a pretty good deal. At the pricing they are offering,
> unless
> I'm able to keep a cluster at more than about 80% capacity 24/7, it'll be
> cheaper to use this new service.
>
> Does this use an existing Hadoop job control API, or do I need to write my
> flows to conform to Amazon's API?
>



-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Amazon Elastic MapReduce

Posted by Kevin Peterson <kp...@biz360.com>.

So if I understand correctly, this is an automated system to bring up a
hadoop cluster on EC2, import some data from S3, run a job flow, write the
data back to S3, and bring down the cluster?

This seems like a pretty good deal. At the pricing they are offering, unless
I'm able to keep a cluster at more than about 80% capacity 24/7, it'll be
cheaper to use this new service.

Does this use an existing Hadoop job control API, or do I need to write my
flows to conform to Amazon's API?

Re: Amazon Elastic MapReduce

Posted by Lukáš Vlček <lu...@gmail.com>.

I may be wrong but I would welcome this. As far as I understand the hot
topic in cloud computing these days is standardization ... and I would be
happy if Hadoop could be considered as a standard for cloud computing
architecture. So the more Amazon pushes Hadoop the more it could be accepted
by other players in this market (and the better for customers when switching
from one cloud provider to the other). Just my 2 cents.
Regards,
Lukas

On Fri, Apr 3, 2009 at 4:36 PM, Stuart Sierra
<th...@gmail.com>wrote:

> On Thu, Apr 2, 2009 at 4:13 AM, zhang jianfeng <zj...@gmail.com> wrote:
> > seems like I should pay for additional money, so why not configure a
> hadoop
> > cluster in EC2 by myself. This already have been automatic using script.
>
> Personally, I'm excited about this.  They're charging a tiny fraction
> above the standard EC2 rate.  I like that the cluster shuts down
> automatically when the job completes -- you don't have to sit around
> and watch it.  Yeah, you can automate that, but it's one more thing to
> think about.
>
> -Stuart
>

-- 
http://blog.lukas-vlcek.com/

Re: Amazon Elastic MapReduce

Posted by Stuart Sierra <th...@gmail.com>.

On Thu, Apr 2, 2009 at 4:13 AM, zhang jianfeng <zj...@gmail.com> wrote:
> seems like I should pay for additional money, so why not configure a hadoop
> cluster in EC2 by myself. This already have been automatic using script.

Personally, I'm excited about this.  They're charging a tiny fraction
above the standard EC2 rate.  I like that the cluster shuts down
automatically when the job completes -- you don't have to sit around
and watch it.  Yeah, you can automate that, but it's one more thing to
think about.

-Stuart

Re: Amazon Elastic MapReduce

Posted by Chris K Wensel <ch...@wensel.net>.

You should check out the new pricing.

On Apr 2, 2009, at 1:13 AM, zhang jianfeng wrote:

> seems like I should pay for additional money, so why not configure a  
> hadoop
> cluster in EC2 by myself. This already have been automatic using  
> script.
>
>
>
>
>
> On Thu, Apr 2, 2009 at 4:09 PM, Miles Osborne <mi...@inf.ed.ac.uk>  
> wrote:
>
>> ... and only in the US
>>
>> Miles
>>
>> 2009/4/2 zhang jianfeng <zj...@gmail.com>:
>>> Does it support pig ?
>>>
>>>
>>> On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel <ch...@wensel.net>  
>>> wrote:
>>>
>>>>
>>>> FYI
>>>>
>>>> Amazons new Hadoop offering:
>>>> http://aws.amazon.com/elasticmapreduce/
>>>>
>>>> And Cascading 1.0 supports it:
>>>> http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html
>>>>
>>>> cheers,
>>>> ckw
>>>>
>>>> --
>>>> Chris K Wensel
>>>> chris@wensel.net
>>>> http://www.cascading.org/
>>>> http://www.scaleunlimited.com/
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>

--
Chris K Wensel
chris@wensel.net
http://www.cascading.org/
http://www.scaleunlimited.com/

Re: How many people is using Hadoop Streaming ?

Posted by Owen O'Malley <om...@apache.org>.

On Apr 3, 2009, at 10:35 AM, Ricky Ho wrote:

> I assume that the key is still sorted, right ?  That mean I will get  
> all the "key1, valueX" entries before getting any of the "key2  
> valueY" entries and key2 is always bigger than key1.

Yes.

-- Owen

RE: How many people is using Hadoop Streaming ?

Posted by Ricky Ho <rh...@adobe.com>.

Owen, thanks for your elaboration, the data point is very useful.

On your point ...
====================================================
In java you get
          key1, (value1, value2, ...)
          key2, (value3, ...)
in streaming you get
          key1 value1
          key1 value2
          key2 value3
and your application needs to detect the key changes.
=====================================================

I assume that the key is still sorted, right ?  That mean I will get all the "key1, valueX" entries before getting any of the "key2 valueY" entries and key2 is always bigger than key1.

Is this correct ?

Rgds,
Ricky


-----Original Message-----
From: Owen O'Malley [mailto:omalley@apache.org] 
Sent: Friday, April 03, 2009 8:59 AM
To: core-user@hadoop.apache.org
Subject: Re: How many people is using Hadoop Streaming ?


On Apr 3, 2009, at 9:42 AM, Ricky Ho wrote:

> Has anyone benchmark the performance difference of using Hadoop ?
>  1) Java vs C++
>  2) Java vs Streaming

Yes, a while ago. When I tested it using sort, Java and C++ were  
roughly equal and streaming was 10-20% slower. Most of the cost with  
streaming came from the stringification.

>  1) I can pick the language that offers a different programming  
> paradigm (e.g. I may choose functional language, or logic  
> programming if they suit the problem better).  In fact, I can even  
> chosen Erlang at the map() and Prolog at the reduce().  Mix and  
> match can optimize me more.
>  2) I can pick the language that I am familiar with, or one that I  
> like.
>  3) Easy to switch to another language in a fine-grain incremental  
> way if I choose to do so in future.

Additionally, the interface to streaming is very stable. *smile* It  
also supports legacy applications well.

The downsides are that:
   1. The interface is very thin and has minimal functionality.
   2. Streaming combiners don't work very well. Many streaming  
applications buffer in the map
       and run the combiner internally.
   3. Streaming doesn't group the values in the reducer. In Java or C+ 
+, you get:
          key1, (value1, value2, ...)
          key2, (value3, ...)
       in streaming you get
          key1 value1
          key1 value2
          key2 value3
       and your application needs to detect the key changes.
   4. Binary data support has only recently been added to streaming.

> Am I missing something here ?  or is the majority of Hadoop  
> applications written in Hadoop Streaming ?

On Yahoo's research clusters, typically 1/3 of the applications are  
streaming, 1/3 pig, and 1/3 java.

-- Owen

Re: How many people is using Hadoop Streaming ?

Posted by Owen O'Malley <om...@apache.org>.

On Apr 3, 2009, at 9:42 AM, Ricky Ho wrote:

> Has anyone benchmark the performance difference of using Hadoop ?
>  1) Java vs C++
>  2) Java vs Streaming

Yes, a while ago. When I tested it using sort, Java and C++ were  
roughly equal and streaming was 10-20% slower. Most of the cost with  
streaming came from the stringification.

>  1) I can pick the language that offers a different programming  
> paradigm (e.g. I may choose functional language, or logic  
> programming if they suit the problem better).  In fact, I can even  
> chosen Erlang at the map() and Prolog at the reduce().  Mix and  
> match can optimize me more.
>  2) I can pick the language that I am familiar with, or one that I  
> like.
>  3) Easy to switch to another language in a fine-grain incremental  
> way if I choose to do so in future.

Additionally, the interface to streaming is very stable. *smile* It  
also supports legacy applications well.

The downsides are that:
   1. The interface is very thin and has minimal functionality.
   2. Streaming combiners don't work very well. Many streaming  
applications buffer in the map
       and run the combiner internally.
   3. Streaming doesn't group the values in the reducer. In Java or C+ 
+, you get:
          key1, (value1, value2, ...)
          key2, (value3, ...)
       in streaming you get
          key1 value1
          key1 value2
          key2 value3
       and your application needs to detect the key changes.
   4. Binary data support has only recently been added to streaming.

> Am I missing something here ?  or is the majority of Hadoop  
> applications written in Hadoop Streaming ?

On Yahoo's research clusters, typically 1/3 of the applications are  
streaming, 1/3 pig, and 1/3 java.

-- Owen

Re: How many people is using Hadoop Streaming ?

Posted by Steve Loughran <st...@apache.org>.

Tim Wintle wrote:
> On Fri, 2009-04-03 at 09:42 -0700, Ricky Ho wrote:
>>   1) I can pick the language that offers a different programming
>> paradigm (e.g. I may choose functional language, or logic programming
>> if they suit the problem better).  In fact, I can even chosen Erlang
>> at the map() and Prolog at the reduce().  Mix and match can optimize
>> me more.
> 
> Agreed (as someone who has written mappers/reducers in Python, perl,
> shell script and Scheme before).
> 

sounds like a good argument for adding scripting support for in-JVM MR 
jobs; use the java6 scripting APIs and use any of the supported 
languages -java script out the box, other languages (jython, scala) with 
the right JARs.

Re: How many people is using Hadoop Streaming ?

Posted by Aaron Kimball <aa...@cloudera.com>.

Excellent. Thanks
- A

On Tue, Apr 7, 2009 at 2:16 PM, Owen O'Malley <om...@apache.org> wrote:

>
> On Apr 7, 2009, at 11:41 AM, Aaron Kimball wrote:
>
>  Owen,
>>
>> Is binary streaming actually readily available?
>>
>
> https://issues.apache.org/jira/browse/HADOOP-1722
>
>

Re: How many people is using Hadoop Streaming ?

Posted by Owen O'Malley <om...@apache.org>.

On Apr 7, 2009, at 11:41 AM, Aaron Kimball wrote:

> Owen,
>
> Is binary streaming actually readily available?

https://issues.apache.org/jira/browse/HADOOP-1722

Re: How many people is using Hadoop Streaming ?

Posted by Aaron Kimball <aa...@cloudera.com>.

Owen,

Is binary streaming actually readily available? Looking at
http://issues.apache.org/jira/browse/HADOOP-3227, it appears uncommitted.

- Aaron


On Fri, Apr 3, 2009 at 8:37 PM, Tim Wintle <ti...@teamrubber.com>wrote:

> On Fri, 2009-04-03 at 09:42 -0700, Ricky Ho wrote:
> >   1) I can pick the language that offers a different programming
> > paradigm (e.g. I may choose functional language, or logic programming
> > if they suit the problem better).  In fact, I can even chosen Erlang
> > at the map() and Prolog at the reduce().  Mix and match can optimize
> > me more.
>
> Agreed (as someone who has written mappers/reducers in Python, perl,
> shell script and Scheme before).
>
>

Re: How many people is using Hadoop Streaming ?

Posted by Tim Wintle <ti...@teamrubber.com>.

On Fri, 2009-04-03 at 09:42 -0700, Ricky Ho wrote:
>   1) I can pick the language that offers a different programming
> paradigm (e.g. I may choose functional language, or logic programming
> if they suit the problem better).  In fact, I can even chosen Erlang
> at the map() and Prolog at the reduce().  Mix and match can optimize
> me more.

Agreed (as someone who has written mappers/reducers in Python, perl,
shell script and Scheme before).

How many people is using Hadoop Streaming ?

Posted by Ricky Ho <rh...@adobe.com>.

Has anyone benchmark the performance difference of using Hadoop ?
  1) Java vs C++
  2) Java vs Streaming

>From looking at the Hadoop architecture, since TaskTracker will fork a separate process anyway to run the user supplied map() and reduce() function, I don't see the performance overhead of using Hadoop Streaming (of course the efficiency of the chosen script will be a factor but I think this is orthogonal).  On the other hand, I see a lot of benefits of using Streaming, including ...

  1) I can pick the language that offers a different programming paradigm (e.g. I may choose functional language, or logic programming if they suit the problem better).  In fact, I can even chosen Erlang at the map() and Prolog at the reduce().  Mix and match can optimize me more.
  2) I can pick the language that I am familiar with, or one that I like.
  3) Easy to switch to another language in a fine-grain incremental way if I choose to do so in future.

Even if I am a Java programmer, I still can write a Main() method to take the standard in and standard out data and I don't see I am losing much by doing that.  The benefit is my code can be easily moved to another language in future.

Am I missing something here ?  or is the majority of Hadoop applications written in Hadoop Streaming ?

Rgds,
Ricky

RE: Amazon Elastic MapReduce

Posted by Ricky Ho <rh...@adobe.com>.

I disagree.  This is like arguing that everyone should learn everything otherwise they don't know how to do everything.

A better situation is having the algorithm designer just focusing in how to break down their algorithm into Map/Reduce form and test it out immediately, rather than requiring them to learn all the admin aspects of Hadoop, which becomes a hurdle for them to move fast.

Rgds,
Ricky

-----Original Message-----
From: Steve Loughran [mailto:stevel@apache.org] 
Sent: Friday, April 03, 2009 2:19 AM
To: core-user@hadoop.apache.org
Subject: Re: Amazon Elastic MapReduce

Brian Bockelman wrote:
> 
> On Apr 2, 2009, at 3:13 AM, zhang jianfeng wrote:
> 
>> seems like I should pay for additional money, so why not configure a 
>> hadoop
>> cluster in EC2 by myself. This already have been automatic using script.
>>
>>
> 
> Not everyone has a support team or an operations team or enough time to 
> learn how to do it themselves.  You're basically paying for the fact 
> that the only thing you need to know to use Hadoop is:
> 1) Be able to write the Java classes.
> 2) Press the "go" button on a webpage somewhere.
> 
> You could use Hadoop with little-to-zero systems knowledge (and without 
> institutional support), which would always make some researchers happy.
> 
> Brian

True, but this way nobody gets the opportunity to learn how to do it 
themselves, which can be a tactical error one comes to regret further 
down the line. By learning the pain of cluster management today, you get 
to keep it under control as your data grows.

I am curious what bug patches AWS will supply, for they have been very 
silent on their hadoop work to date.

Re: Amazon Elastic MapReduce

Posted by Tim Wintle <ti...@teamrubber.com>.

On Fri, 2009-04-03 at 11:19 +0100, Steve Loughran wrote:
> True, but this way nobody gets the opportunity to learn how to do it 
> themselves, which can be a tactical error one comes to regret further 
> down the line. By learning the pain of cluster management today, you get 
> to keep it under control as your data grows.

Personally I don't want to have to learn (and especially not support in
production) the EC2 / S3 part, so it does sound appealing.

On a side note, I'd hope that at some point they give some control over
the priority of the overall job - on the level of "you can boot up these
machines whenever you want", or "boot up these machines now" - that
should let them manage the load on their hardware and reduce costs
(which I'd obviously expect them to pass on the users of low-priority
jobs). I'm not sure how that would fit into the "give me 10 nodes"
method at the moment.

> 
> I am curious what bug patches AWS will supply, for they have been very 
> silent on their hadoop work to date.

I'm hoping it will involve security of EC2 images, but not expectant.

Re: Amazon Elastic MapReduce

Posted by Steve Loughran <st...@apache.org>.

Brian Bockelman wrote:
> 
> On Apr 2, 2009, at 3:13 AM, zhang jianfeng wrote:
> 
>> seems like I should pay for additional money, so why not configure a 
>> hadoop
>> cluster in EC2 by myself. This already have been automatic using script.
>>
>>
> 
> Not everyone has a support team or an operations team or enough time to 
> learn how to do it themselves.  You're basically paying for the fact 
> that the only thing you need to know to use Hadoop is:
> 1) Be able to write the Java classes.
> 2) Press the "go" button on a webpage somewhere.
> 
> You could use Hadoop with little-to-zero systems knowledge (and without 
> institutional support), which would always make some researchers happy.
> 
> Brian

True, but this way nobody gets the opportunity to learn how to do it 
themselves, which can be a tactical error one comes to regret further 
down the line. By learning the pain of cluster management today, you get 
to keep it under control as your data grows.

I am curious what bug patches AWS will supply, for they have been very 
silent on their hadoop work to date.

Re: Amazon Elastic MapReduce

Posted by Brian Bockelman <bb...@cse.unl.edu>.

On Apr 2, 2009, at 3:13 AM, zhang jianfeng wrote:

> seems like I should pay for additional money, so why not configure a  
> hadoop
> cluster in EC2 by myself. This already have been automatic using  
> script.
>
>

Not everyone has a support team or an operations team or enough time  
to learn how to do it themselves.  You're basically paying for the  
fact that the only thing you need to know to use Hadoop is:
1) Be able to write the Java classes.
2) Press the "go" button on a webpage somewhere.

You could use Hadoop with little-to-zero systems knowledge (and  
without institutional support), which would always make some  
researchers happy.

Brian

>
>
>
> On Thu, Apr 2, 2009 at 4:09 PM, Miles Osborne <mi...@inf.ed.ac.uk>  
> wrote:
>
>> ... and only in the US
>>
>> Miles
>>
>> 2009/4/2 zhang jianfeng <zj...@gmail.com>:
>>> Does it support pig ?
>>>
>>>
>>> On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel <ch...@wensel.net>  
>>> wrote:
>>>
>>>>
>>>> FYI
>>>>
>>>> Amazons new Hadoop offering:
>>>> http://aws.amazon.com/elasticmapreduce/
>>>>
>>>> And Cascading 1.0 supports it:
>>>> http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html
>>>>
>>>> cheers,
>>>> ckw
>>>>
>>>> --
>>>> Chris K Wensel
>>>> chris@wensel.net
>>>> http://www.cascading.org/
>>>> http://www.scaleunlimited.com/
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> The University of Edinburgh is a charitable body, registered in
>> Scotland, with registration number SC005336.
>>

Re: Amazon Elastic MapReduce

Posted by zhang jianfeng <zj...@gmail.com>.

seems like I should pay for additional money, so why not configure a hadoop
cluster in EC2 by myself. This already have been automatic using script.





On Thu, Apr 2, 2009 at 4:09 PM, Miles Osborne <mi...@inf.ed.ac.uk> wrote:

> ... and only in the US
>
> Miles
>
> 2009/4/2 zhang jianfeng <zj...@gmail.com>:
> > Does it support pig ?
> >
> >
> > On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel <ch...@wensel.net> wrote:
> >
> >>
> >> FYI
> >>
> >> Amazons new Hadoop offering:
> >> http://aws.amazon.com/elasticmapreduce/
> >>
> >> And Cascading 1.0 supports it:
> >> http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html
> >>
> >> cheers,
> >> ckw
> >>
> >> --
> >> Chris K Wensel
> >> chris@wensel.net
> >> http://www.cascading.org/
> >> http://www.scaleunlimited.com/
> >>
> >>
> >
>
>
>
> --
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>

Re: Amazon Elastic MapReduce

Posted by Miles Osborne <mi...@inf.ed.ac.uk>.

... and only in the US

Miles

2009/4/2 zhang jianfeng <zj...@gmail.com>:
> Does it support pig ?
>
>
> On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel <ch...@wensel.net> wrote:
>
>>
>> FYI
>>
>> Amazons new Hadoop offering:
>> http://aws.amazon.com/elasticmapreduce/
>>
>> And Cascading 1.0 supports it:
>> http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html
>>
>> cheers,
>> ckw
>>
>> --
>> Chris K Wensel
>> chris@wensel.net
>> http://www.cascading.org/
>> http://www.scaleunlimited.com/
>>
>>
>



-- 
The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.

Re: Amazon Elastic MapReduce

Posted by zhang jianfeng <zj...@gmail.com>.

Does it support pig ?


On Thu, Apr 2, 2009 at 3:47 PM, Chris K Wensel <ch...@wensel.net> wrote:

>
> FYI
>
> Amazons new Hadoop offering:
> http://aws.amazon.com/elasticmapreduce/
>
> And Cascading 1.0 supports it:
> http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html
>
> cheers,
> ckw
>
> --
> Chris K Wensel
> chris@wensel.net
> http://www.cascading.org/
> http://www.scaleunlimited.com/
>
>

Re: Amazon Elastic MapReduce

Posted by Peter Skomoroch <pe...@gmail.com>.

Intermediate results can be stored in hdfs on the EC2 machines, or in S3
using s3n... performance is better if you store on hdfs:

                 "-input",
"s3n://elasticmapreduce/samples/similarity/lastfm/input/",
                 "-output",    "hdfs:///home/hadoop/output2/",



On Mon, Apr 6, 2009 at 11:27 AM, Patrick A. <pa...@gmail.com>wrote:

>
> Are intermediate results stored in S3 as well?
>
> Also, any plans to support HTable?
>
>
>
> Chris K Wensel-2 wrote:
> >
> >
> > FYI
> >
> > Amazons new Hadoop offering:
> > http://aws.amazon.com/elasticmapreduce/
> >
> > And Cascading 1.0 supports it:
> > http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html
> >
> > cheers,
> > ckw
> >
> > --
> > Chris K Wensel
> > chris@wensel.net
> > http://www.cascading.org/
> > http://www.scaleunlimited.com/
> >
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Amazon-Elastic-MapReduce-tp22842658p22911128.html
> Sent from the Hadoop core-user mailing list archive at Nabble.com.
>
>


-- 
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
http://delicious.com/pskomoroch
http://twitter.com/peteskomoroch

Re: Amazon Elastic MapReduce

Posted by "Patrick A." <pa...@gmail.com>.

Are intermediate results stored in S3 as well?

Also, any plans to support HTable?



Chris K Wensel-2 wrote:
> 
> 
> FYI
> 
> Amazons new Hadoop offering:
> http://aws.amazon.com/elasticmapreduce/
> 
> And Cascading 1.0 supports it:
> http://www.cascading.org/2009/04/amazon-elastic-mapreduce.html
> 
> cheers,
> ckw
> 
> --
> Chris K Wensel
> chris@wensel.net
> http://www.cascading.org/
> http://www.scaleunlimited.com/
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Amazon-Elastic-MapReduce-tp22842658p22911128.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.