You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Ricky Ho <rh...@adobe.com> on 2009/04/03 18:42:38 UTC

How many people is using Hadoop Streaming ?

Has anyone benchmark the performance difference of using Hadoop ?
  1) Java vs C++
  2) Java vs Streaming

>From looking at the Hadoop architecture, since TaskTracker will fork a separate process anyway to run the user supplied map() and reduce() function, I don't see the performance overhead of using Hadoop Streaming (of course the efficiency of the chosen script will be a factor but I think this is orthogonal).  On the other hand, I see a lot of benefits of using Streaming, including ...

  1) I can pick the language that offers a different programming paradigm (e.g. I may choose functional language, or logic programming if they suit the problem better).  In fact, I can even chosen Erlang at the map() and Prolog at the reduce().  Mix and match can optimize me more.
  2) I can pick the language that I am familiar with, or one that I like.
  3) Easy to switch to another language in a fine-grain incremental way if I choose to do so in future.

Even if I am a Java programmer, I still can write a Main() method to take the standard in and standard out data and I don't see I am losing much by doing that.  The benefit is my code can be easily moved to another language in future.

Am I missing something here ?  or is the majority of Hadoop applications written in Hadoop Streaming ?

Rgds,
Ricky

Re: How many people is using Hadoop Streaming ?

Posted by Owen O'Malley <om...@apache.org>.
On Apr 3, 2009, at 10:35 AM, Ricky Ho wrote:

> I assume that the key is still sorted, right ?  That mean I will get  
> all the "key1, valueX" entries before getting any of the "key2  
> valueY" entries and key2 is always bigger than key1.

Yes.

-- Owen

RE: How many people is using Hadoop Streaming ?

Posted by Ricky Ho <rh...@adobe.com>.
Owen, thanks for your elaboration, the data point is very useful.

On your point ...
====================================================
In java you get
          key1, (value1, value2, ...)
          key2, (value3, ...)
in streaming you get
          key1 value1
          key1 value2
          key2 value3
and your application needs to detect the key changes.
=====================================================

I assume that the key is still sorted, right ?  That mean I will get all the "key1, valueX" entries before getting any of the "key2 valueY" entries and key2 is always bigger than key1.

Is this correct ?

Rgds,
Ricky


-----Original Message-----
From: Owen O'Malley [mailto:omalley@apache.org] 
Sent: Friday, April 03, 2009 8:59 AM
To: core-user@hadoop.apache.org
Subject: Re: How many people is using Hadoop Streaming ?


On Apr 3, 2009, at 9:42 AM, Ricky Ho wrote:

> Has anyone benchmark the performance difference of using Hadoop ?
>  1) Java vs C++
>  2) Java vs Streaming

Yes, a while ago. When I tested it using sort, Java and C++ were  
roughly equal and streaming was 10-20% slower. Most of the cost with  
streaming came from the stringification.

>  1) I can pick the language that offers a different programming  
> paradigm (e.g. I may choose functional language, or logic  
> programming if they suit the problem better).  In fact, I can even  
> chosen Erlang at the map() and Prolog at the reduce().  Mix and  
> match can optimize me more.
>  2) I can pick the language that I am familiar with, or one that I  
> like.
>  3) Easy to switch to another language in a fine-grain incremental  
> way if I choose to do so in future.

Additionally, the interface to streaming is very stable. *smile* It  
also supports legacy applications well.

The downsides are that:
   1. The interface is very thin and has minimal functionality.
   2. Streaming combiners don't work very well. Many streaming  
applications buffer in the map
       and run the combiner internally.
   3. Streaming doesn't group the values in the reducer. In Java or C+ 
+, you get:
          key1, (value1, value2, ...)
          key2, (value3, ...)
       in streaming you get
          key1 value1
          key1 value2
          key2 value3
       and your application needs to detect the key changes.
   4. Binary data support has only recently been added to streaming.

> Am I missing something here ?  or is the majority of Hadoop  
> applications written in Hadoop Streaming ?

On Yahoo's research clusters, typically 1/3 of the applications are  
streaming, 1/3 pig, and 1/3 java.

-- Owen

Re: How many people is using Hadoop Streaming ?

Posted by Owen O'Malley <om...@apache.org>.
On Apr 3, 2009, at 9:42 AM, Ricky Ho wrote:

> Has anyone benchmark the performance difference of using Hadoop ?
>  1) Java vs C++
>  2) Java vs Streaming

Yes, a while ago. When I tested it using sort, Java and C++ were  
roughly equal and streaming was 10-20% slower. Most of the cost with  
streaming came from the stringification.

>  1) I can pick the language that offers a different programming  
> paradigm (e.g. I may choose functional language, or logic  
> programming if they suit the problem better).  In fact, I can even  
> chosen Erlang at the map() and Prolog at the reduce().  Mix and  
> match can optimize me more.
>  2) I can pick the language that I am familiar with, or one that I  
> like.
>  3) Easy to switch to another language in a fine-grain incremental  
> way if I choose to do so in future.

Additionally, the interface to streaming is very stable. *smile* It  
also supports legacy applications well.

The downsides are that:
   1. The interface is very thin and has minimal functionality.
   2. Streaming combiners don't work very well. Many streaming  
applications buffer in the map
       and run the combiner internally.
   3. Streaming doesn't group the values in the reducer. In Java or C+ 
+, you get:
          key1, (value1, value2, ...)
          key2, (value3, ...)
       in streaming you get
          key1 value1
          key1 value2
          key2 value3
       and your application needs to detect the key changes.
   4. Binary data support has only recently been added to streaming.

> Am I missing something here ?  or is the majority of Hadoop  
> applications written in Hadoop Streaming ?

On Yahoo's research clusters, typically 1/3 of the applications are  
streaming, 1/3 pig, and 1/3 java.

-- Owen

Re: How many people is using Hadoop Streaming ?

Posted by Steve Loughran <st...@apache.org>.
Tim Wintle wrote:
> On Fri, 2009-04-03 at 09:42 -0700, Ricky Ho wrote:
>>   1) I can pick the language that offers a different programming
>> paradigm (e.g. I may choose functional language, or logic programming
>> if they suit the problem better).  In fact, I can even chosen Erlang
>> at the map() and Prolog at the reduce().  Mix and match can optimize
>> me more.
> 
> Agreed (as someone who has written mappers/reducers in Python, perl,
> shell script and Scheme before).
> 

sounds like a good argument for adding scripting support for in-JVM MR 
jobs; use the java6 scripting APIs and use any of the supported 
languages -java script out the box, other languages (jython, scala) with 
the right JARs.

Re: How many people is using Hadoop Streaming ?

Posted by Aaron Kimball <aa...@cloudera.com>.
Excellent. Thanks
- A

On Tue, Apr 7, 2009 at 2:16 PM, Owen O'Malley <om...@apache.org> wrote:

>
> On Apr 7, 2009, at 11:41 AM, Aaron Kimball wrote:
>
>  Owen,
>>
>> Is binary streaming actually readily available?
>>
>
> https://issues.apache.org/jira/browse/HADOOP-1722
>
>

Re: How many people is using Hadoop Streaming ?

Posted by Owen O'Malley <om...@apache.org>.
On Apr 7, 2009, at 11:41 AM, Aaron Kimball wrote:

> Owen,
>
> Is binary streaming actually readily available?

https://issues.apache.org/jira/browse/HADOOP-1722


Re: How many people is using Hadoop Streaming ?

Posted by Aaron Kimball <aa...@cloudera.com>.
Owen,

Is binary streaming actually readily available? Looking at
http://issues.apache.org/jira/browse/HADOOP-3227, it appears uncommitted.

- Aaron


On Fri, Apr 3, 2009 at 8:37 PM, Tim Wintle <ti...@teamrubber.com>wrote:

> On Fri, 2009-04-03 at 09:42 -0700, Ricky Ho wrote:
> >   1) I can pick the language that offers a different programming
> > paradigm (e.g. I may choose functional language, or logic programming
> > if they suit the problem better).  In fact, I can even chosen Erlang
> > at the map() and Prolog at the reduce().  Mix and match can optimize
> > me more.
>
> Agreed (as someone who has written mappers/reducers in Python, perl,
> shell script and Scheme before).
>
>

Re: How many people is using Hadoop Streaming ?

Posted by Tim Wintle <ti...@teamrubber.com>.
On Fri, 2009-04-03 at 09:42 -0700, Ricky Ho wrote:
>   1) I can pick the language that offers a different programming
> paradigm (e.g. I may choose functional language, or logic programming
> if they suit the problem better).  In fact, I can even chosen Erlang
> at the map() and Prolog at the reduce().  Mix and match can optimize
> me more.

Agreed (as someone who has written mappers/reducers in Python, perl,
shell script and Scheme before).