You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Mark question <ma...@gmail.com> on 2012/04/05 20:54:23 UTC

Hadoop streaming or pipes ..

Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
Java? From what I've read, it's only to ease testing by using your favorite
language. So I guess it is eventually translated to bytecode then executed.
Is that true?

Thank you,
Mark

Re: Hadoop streaming or pipes ..

Posted by Robert Evans <ev...@yahoo-inc.com>.

It is a regular process, unless you explicitly say you want it to be java, which would be a bit odd to do, but possible.

--Bobby

On 4/5/12 3:14 PM, "Mark question" <ma...@gmail.com> wrote:

Thanks for the response Robert ..  so the overhead will be in read/write
and communication. But is the new process spawned a JVM or a regular
process?

Thanks,
Mark

On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans <ev...@yahoo-inc.com> wrote:

> Both streaming and pipes do very similar things.  They will fork/exec a
> separate process that is running whatever you want it to run.  The JVM that
> is running hadoop then communicates with this process to send the data over
> and get the processing results back.  The difference between streaming and
> pipes is that streaming uses stdin/stdout for this communication so
> preexisting processing like grep, sed and awk can be used here.  Pipes uses
> a custom protocol with a C++ library to communicate.  The C++ library is
> tagged with SWIG compatible data so that it can be wrapped to have APIs in
> other languages like python or perl.
>
> I am not sure what the performance difference is between the two, but in
> my own work I have seen a significant performance penalty from using either
> of them, because there is a somewhat large overhead of sending all of the
> data out to a separate process just to read it back in again.
>
> --Bobby Evans
>
>
> On 4/5/12 1:54 PM, "Mark question" <ma...@gmail.com> wrote:
>
> Hi guys,
>  quick question:
>   Are there any performance gains from hadoop streaming or pipes over
> Java? From what I've read, it's only to ease testing by using your favorite
> language. So I guess it is eventually translated to bytecode then executed.
> Is that true?
>
> Thank you,
> Mark
>
>

Re: Hadoop streaming or pipes ..

Posted by Mark question <ma...@gmail.com>.

Thanks for the response Robert ..  so the overhead will be in read/write
and communication. But is the new process spawned a JVM or a regular
process?

Thanks,
Mark

On Thu, Apr 5, 2012 at 12:49 PM, Robert Evans <ev...@yahoo-inc.com> wrote:

> Both streaming and pipes do very similar things.  They will fork/exec a
> separate process that is running whatever you want it to run.  The JVM that
> is running hadoop then communicates with this process to send the data over
> and get the processing results back.  The difference between streaming and
> pipes is that streaming uses stdin/stdout for this communication so
> preexisting processing like grep, sed and awk can be used here.  Pipes uses
> a custom protocol with a C++ library to communicate.  The C++ library is
> tagged with SWIG compatible data so that it can be wrapped to have APIs in
> other languages like python or perl.
>
> I am not sure what the performance difference is between the two, but in
> my own work I have seen a significant performance penalty from using either
> of them, because there is a somewhat large overhead of sending all of the
> data out to a separate process just to read it back in again.
>
> --Bobby Evans
>
>
> On 4/5/12 1:54 PM, "Mark question" <ma...@gmail.com> wrote:
>
> Hi guys,
>  quick question:
>   Are there any performance gains from hadoop streaming or pipes over
> Java? From what I've read, it's only to ease testing by using your favorite
> language. So I guess it is eventually translated to bytecode then executed.
> Is that true?
>
> Thank you,
> Mark
>
>

Re: Hadoop streaming or pipes ..

Posted by Mark question <ma...@gmail.com>.

Thanks all, and Charles you guided me to Baidu slides titled:
Introduction to *Hadoop C++
Extension*<http://hic2010.hadooper.cn/dct/attach/Y2xiOmNsYjpwZGY6ODI5>
which is their experience and the sixth-slide shows exactly what I was
looking for. It is still hard to manage memory with pipes besides the no
performance gains, hence the advancement of HCE.

Thanks,
Mark
On Thu, Apr 5, 2012 at 2:23 PM, Charles Earl <ch...@gmail.com>wrote:

> Also bear in mind that there is a kind of detour involved, in the sense
> that a pipes map must send key,value data back to the Java process and then
> to reduce (more or less).
> I think that the Hadoop C Extension (HCE, there is a patch) is supposed to
> be faster.
> Would be interested to know if the community has any experience with HCE
> performance.
> C
>
> On Apr 5, 2012, at 3:49 PM, Robert Evans <ev...@yahoo-inc.com> wrote:
>
> > Both streaming and pipes do very similar things.  They will fork/exec a
> separate process that is running whatever you want it to run.  The JVM that
> is running hadoop then communicates with this process to send the data over
> and get the processing results back.  The difference between streaming and
> pipes is that streaming uses stdin/stdout for this communication so
> preexisting processing like grep, sed and awk can be used here.  Pipes uses
> a custom protocol with a C++ library to communicate.  The C++ library is
> tagged with SWIG compatible data so that it can be wrapped to have APIs in
> other languages like python or perl.
> >
> > I am not sure what the performance difference is between the two, but in
> my own work I have seen a significant performance penalty from using either
> of them, because there is a somewhat large overhead of sending all of the
> data out to a separate process just to read it back in again.
> >
> > --Bobby Evans
> >
> >
> > On 4/5/12 1:54 PM, "Mark question" <ma...@gmail.com> wrote:
> >
> > Hi guys,
> >  quick question:
> >   Are there any performance gains from hadoop streaming or pipes over
> > Java? From what I've read, it's only to ease testing by using your
> favorite
> > language. So I guess it is eventually translated to bytecode then
> executed.
> > Is that true?
> >
> > Thank you,
> > Mark
> >
>

Re: Hadoop streaming or pipes ..

Posted by Charles Earl <ch...@gmail.com>.

Also bear in mind that there is a kind of detour involved, in the sense that a pipes map must send key,value data back to the Java process and then to reduce (more or less). 
I think that the Hadoop C Extension (HCE, there is a patch) is supposed to be faster. 
Would be interested to know if the community has any experience with HCE performance.
C

On Apr 5, 2012, at 3:49 PM, Robert Evans <ev...@yahoo-inc.com> wrote:

> Both streaming and pipes do very similar things.  They will fork/exec a separate process that is running whatever you want it to run.  The JVM that is running hadoop then communicates with this process to send the data over and get the processing results back.  The difference between streaming and pipes is that streaming uses stdin/stdout for this communication so preexisting processing like grep, sed and awk can be used here.  Pipes uses a custom protocol with a C++ library to communicate.  The C++ library is tagged with SWIG compatible data so that it can be wrapped to have APIs in other languages like python or perl.
> 
> I am not sure what the performance difference is between the two, but in my own work I have seen a significant performance penalty from using either of them, because there is a somewhat large overhead of sending all of the data out to a separate process just to read it back in again.
> 
> --Bobby Evans
> 
> 
> On 4/5/12 1:54 PM, "Mark question" <ma...@gmail.com> wrote:
> 
> Hi guys,
>  quick question:
>   Are there any performance gains from hadoop streaming or pipes over
> Java? From what I've read, it's only to ease testing by using your favorite
> language. So I guess it is eventually translated to bytecode then executed.
> Is that true?
> 
> Thank you,
> Mark
>

Re: Hadoop streaming or pipes ..

Posted by Robert Evans <ev...@yahoo-inc.com>.

Both streaming and pipes do very similar things.  They will fork/exec a separate process that is running whatever you want it to run.  The JVM that is running hadoop then communicates with this process to send the data over and get the processing results back.  The difference between streaming and pipes is that streaming uses stdin/stdout for this communication so preexisting processing like grep, sed and awk can be used here.  Pipes uses a custom protocol with a C++ library to communicate.  The C++ library is tagged with SWIG compatible data so that it can be wrapped to have APIs in other languages like python or perl.

I am not sure what the performance difference is between the two, but in my own work I have seen a significant performance penalty from using either of them, because there is a somewhat large overhead of sending all of the data out to a separate process just to read it back in again.

--Bobby Evans


On 4/5/12 1:54 PM, "Mark question" <ma...@gmail.com> wrote:

Hi guys,
  quick question:
   Are there any performance gains from hadoop streaming or pipes over
Java? From what I've read, it's only to ease testing by using your favorite
language. So I guess it is eventually translated to bytecode then executed.
Is that true?

Thank you,
Mark