You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by lin <no...@gmail.com> on 2008/03/31 22:00:59 UTC

Hadoop streaming performance problem

Hi,

I am looking into using Hadoop streaming to parallelize some simple
programs. So far the performance has been pretty disappointing.

The cluster contains 5 nodes. Each node has two CPU cores. The task capacity
of each node is 2. The Hadoop version is 0.15.

Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in
standalone (on a single CPU core). Program runs for 5 minutes on the Hadoop
cluster and 4.5 minutes in standalone. Both programs run as map-only jobs.

I understand that there is some overhead in starting up tasks, reading to
and writing from the distributed file system. But they do not seem to
explain all the overhead. Most map tasks are data-local. I modified program
1 to output nothing and saw the same magnitude of overhead.

The output of top shows that the majority of the CPU time is consumed by
Hadoop java processes (e.g. org.apache.hadoop.mapred.TaskTracker$Child). So
I added a profile option (-agentlib:hprof=cpu=samples) to
mapred.child.java.opts.

The profile results show that most of CPU time is spent in the following
methods

   rank   self  accum   count trace method

   1 23.76% 23.76%    1246 300472 java.lang.UNIXProcess.waitForProcessExit

   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes

   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes

   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes

And their stack traces show that these methods are for interacting with the
map program.


TRACE 300472:

        java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline)

        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)

        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)

TRACE 300474:

        java.io.FileInputStream.readBytes(FileInputStream.java:Unknown line)

        java.io.FileInputStream.read(FileInputStream.java:199)

        java.io.BufferedInputStream.read1(BufferedInputStream.java:256)

        java.io.BufferedInputStream.read(BufferedInputStream.java:317)

        java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

        java.io.BufferedInputStream.read(BufferedInputStream.java:237)

        java.io.FilterInputStream.read(FilterInputStream.java:66)

        org.apache.hadoop.mapred.LineRecordReader.readLine(
LineRecordReader.java:136)

        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
UTF8ByteArrayUtils.java:157)

        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
PipeMapRed.java:348)

TRACE 300479:

        java.io.FileInputStream.readBytes(FileInputStream.java:Unknown line)

        java.io.FileInputStream.read(FileInputStream.java:199)

        java.io.BufferedInputStream.fill(BufferedInputStream.java:218)

        java.io.BufferedInputStream.read(BufferedInputStream.java:237)

        java.io.FilterInputStream.read(FilterInputStream.java:66)

        org.apache.hadoop.mapred.LineRecordReader.readLine(
LineRecordReader.java:136)

        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
UTF8ByteArrayUtils.java:157)

        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
PipeMapRed.java:399)

TRACE 300478:

        java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline)

        java.io.FileOutputStream.write(FileOutputStream.java:260)

        java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java
:65)

        java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)

        java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)

        java.io.DataOutputStream.flush(DataOutputStream.java:106)

        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)

        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)

        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
        org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
:1760)


I don't understand why Hadoop streaming needs so much CPU time to read from
and write to the map program. Note it takes 23.67% time to read from the
standard error of the map program while the program does not output any
error at all!

Does anyone know any way to get rid of this seemingly unnecessary overhead
in Hadoop streaming?

Thanks,

Lin

Re: Hadoop streaming performance problem

Posted by lin <no...@gmail.com>.

My previous java.opts was actually "--server -Xmx512m". I increased the heap
size to 1024M and the running time was about the same. The resident sizes of
the java processes seem to be no greater than 150M.

On Mon, Mar 31, 2008 at 1:56 PM, Theodore Van Rooy <mu...@gmail.com>
wrote:

> try extending the java heap size as well.. I'd be interested to see what
> kind of an effect that has on time (if any).
>
>
>
> On Mon, Mar 31, 2008 at 2:30 PM, lin <no...@gmail.com> wrote:
>
> > I'm running custom map programs written in C++. What the programs do is
> > very
> > simple. For example, in program 2, for each input line        ID node1
> > node2
> > ... nodeN
> > the program outputs
> >        node1 ID
> >        node2 ID
> >        ...
> >        nodeN ID
> >
> > Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m.
> >
> > I agree that it depends on the scripts. I tried replicating the
> > computation
> > for each input line by 10 times and saw significantly better speedup.
> But
> > it
> > is still pretty bad that Hadoop streaming has such big overhead for
> simple
> > programs.
> >
> > I also tried writing program 1 with Hadoop Java API. I got almost 1000%
> > speed up on the cluster.
> >
> > Lin
> >
> > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <mu...@gmail.com>
> > wrote:
> >
> > > are you running a custom map script or a standard linux command like
> WC?
> > >  If
> > > custom, what does your script do?
> > >
> > > How much ram do you have?  what are you Java memory settings?
> > >
> > > I used the following setup
> > >
> > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4
> > task
> > > max.
> > >
> > > I got the following results
> > >
> > > WC 30-40% speedup
> > > Sort 40% speedup
> > > Grep 5X slowdown (turns out this was due to what you described
> above...
> > > Grep
> > > is just very highly optimized for command line)
> > > Custom perl script which is essentially a For loop which matches each
> > row
> > > of
> > > a dataset to a set of 100 categories) 60% speedup.
> > >
> > > So I do think that it depends on your script... and some other
> settings
> > of
> > > yours.
> > >
> > > Theo
> > >
> > > On Mon, Mar 31, 2008 at 2:00 PM, lin <no...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am looking into using Hadoop streaming to parallelize some simple
> > > > programs. So far the performance has been pretty disappointing.
> > > >
> > > > The cluster contains 5 nodes. Each node has two CPU cores. The task
> > > > capacity
> > > > of each node is 2. The Hadoop version is 0.15.
> > > >
> > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes
> in
> > > > standalone (on a single CPU core). Program runs for 5 minutes on the
> > > > Hadoop
> > > > cluster and 4.5 minutes in standalone. Both programs run as map-only
> > > jobs.
> > > >
> > > > I understand that there is some overhead in starting up tasks,
> reading
> > > to
> > > > and writing from the distributed file system. But they do not seem
> to
> > > > explain all the overhead. Most map tasks are data-local. I modified
> > > > program
> > > > 1 to output nothing and saw the same magnitude of overhead.
> > > >
> > > > The output of top shows that the majority of the CPU time is
> consumed
> > by
> > > > Hadoop java processes (e.g.
> > org.apache.hadoop.mapred.TaskTracker$Child).
> > > > So
> > > > I added a profile option (-agentlib:hprof=cpu=samples) to
> > > > mapred.child.java.opts.
> > > >
> > > > The profile results show that most of CPU time is spent in the
> > following
> > > > methods
> > > >
> > > >   rank   self  accum   count trace method
> > > >
> > > >   1 23.76% 23.76%    1246 300472
> > > java.lang.UNIXProcess.waitForProcessExit
> > > >
> > > >   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
> > > >
> > > >   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
> > > >
> > > >   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
> > > >
> > > > And their stack traces show that these methods are for interacting
> > with
> > > > the
> > > > map program.
> > > >
> > > >
> > > > TRACE 300472:
> > > >
> > > >
> > > >
> >  java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline)
> > > >
> > > >        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> > > >
> > > >        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > > >
> > > > TRACE 300474:
> > > >
> > > >        java.io.FileInputStream.readBytes(
> FileInputStream.java:Unknown
> > > > line)
> > > >
> > > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > > >
> > > >        java.io.BufferedInputStream.read1(BufferedInputStream.java
> :256)
> > > >
> > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> :317)
> > > >
> > > >        java.io.BufferedInputStream.fill(BufferedInputStream.java
> :218)
> > > >
> > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> :237)
> > > >
> > > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > > >
> > > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > > LineRecordReader.java:136)
> > > >
> > > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > > UTF8ByteArrayUtils.java:157)
> > > >
> > > >        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> > > > PipeMapRed.java:348)
> > > >
> > > > TRACE 300479:
> > > >
> > > >        java.io.FileInputStream.readBytes(
> FileInputStream.java:Unknown
> > > > line)
> > > >
> > > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > > >
> > > >        java.io.BufferedInputStream.fill(BufferedInputStream.java
> :218)
> > > >
> > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> :237)
> > > >
> > > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > > >
> > > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > > LineRecordReader.java:136)
> > > >
> > > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > > UTF8ByteArrayUtils.java:157)
> > > >
> > > >        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> > > > PipeMapRed.java:399)
> > > >
> > > > TRACE 300478:
> > > >
> > > >
> > > >
> >  java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline)
> > > >
> > > >        java.io.FileOutputStream.write(FileOutputStream.java:260)
> > > >
> > > >        java.io.BufferedOutputStream.flushBuffer(
> > > BufferedOutputStream.java
> > > > :65)
> > > >
> > > >
> >  java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> > > >
> > > >
> >  java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
> > > >
> > > >        java.io.DataOutputStream.flush(DataOutputStream.java:106)
> > > >
> > > >        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java
> :96)
> > > >
> > > >        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > > >
> > > >        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > > >
> >  org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> > > > :1760)
> > > >
> > > >
> > > > I don't understand why Hadoop streaming needs so much CPU time to
> read
> > > > from
> > > > and write to the map program. Note it takes 23.67% time to read from
> > the
> > > > standard error of the map program while the program does not output
> > any
> > > > error at all!
> > > >
> > > > Does anyone know any way to get rid of this seemingly unnecessary
> > > overhead
> > > > in Hadoop streaming?
> > > >
> > > > Thanks,
> > > >
> > > > Lin
> > > >
> > >
> > >
> > >
> > > --
> > > Theodore Van Rooy
> > > http://greentheo.scroggles.com
> > >
> >
>
>
>
> --
> Theodore Van Rooy
> http://greentheo.scroggles.com
>

Re: Hadoop streaming performance problem

Posted by Theodore Van Rooy <mu...@gmail.com>.

try extending the java heap size as well.. I'd be interested to see what
kind of an effect that has on time (if any).



On Mon, Mar 31, 2008 at 2:30 PM, lin <no...@gmail.com> wrote:

> I'm running custom map programs written in C++. What the programs do is
> very
> simple. For example, in program 2, for each input line        ID node1
> node2
> ... nodeN
> the program outputs
>        node1 ID
>        node2 ID
>        ...
>        nodeN ID
>
> Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m.
>
> I agree that it depends on the scripts. I tried replicating the
> computation
> for each input line by 10 times and saw significantly better speedup. But
> it
> is still pretty bad that Hadoop streaming has such big overhead for simple
> programs.
>
> I also tried writing program 1 with Hadoop Java API. I got almost 1000%
> speed up on the cluster.
>
> Lin
>
> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <mu...@gmail.com>
> wrote:
>
> > are you running a custom map script or a standard linux command like WC?
> >  If
> > custom, what does your script do?
> >
> > How much ram do you have?  what are you Java memory settings?
> >
> > I used the following setup
> >
> > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4
> task
> > max.
> >
> > I got the following results
> >
> > WC 30-40% speedup
> > Sort 40% speedup
> > Grep 5X slowdown (turns out this was due to what you described above...
> > Grep
> > is just very highly optimized for command line)
> > Custom perl script which is essentially a For loop which matches each
> row
> > of
> > a dataset to a set of 100 categories) 60% speedup.
> >
> > So I do think that it depends on your script... and some other settings
> of
> > yours.
> >
> > Theo
> >
> > On Mon, Mar 31, 2008 at 2:00 PM, lin <no...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I am looking into using Hadoop streaming to parallelize some simple
> > > programs. So far the performance has been pretty disappointing.
> > >
> > > The cluster contains 5 nodes. Each node has two CPU cores. The task
> > > capacity
> > > of each node is 2. The Hadoop version is 0.15.
> > >
> > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in
> > > standalone (on a single CPU core). Program runs for 5 minutes on the
> > > Hadoop
> > > cluster and 4.5 minutes in standalone. Both programs run as map-only
> > jobs.
> > >
> > > I understand that there is some overhead in starting up tasks, reading
> > to
> > > and writing from the distributed file system. But they do not seem to
> > > explain all the overhead. Most map tasks are data-local. I modified
> > > program
> > > 1 to output nothing and saw the same magnitude of overhead.
> > >
> > > The output of top shows that the majority of the CPU time is consumed
> by
> > > Hadoop java processes (e.g.
> org.apache.hadoop.mapred.TaskTracker$Child).
> > > So
> > > I added a profile option (-agentlib:hprof=cpu=samples) to
> > > mapred.child.java.opts.
> > >
> > > The profile results show that most of CPU time is spent in the
> following
> > > methods
> > >
> > >   rank   self  accum   count trace method
> > >
> > >   1 23.76% 23.76%    1246 300472
> > java.lang.UNIXProcess.waitForProcessExit
> > >
> > >   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
> > >
> > >   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
> > >
> > >   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
> > >
> > > And their stack traces show that these methods are for interacting
> with
> > > the
> > > map program.
> > >
> > >
> > > TRACE 300472:
> > >
> > >
> > >
>  java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline)
> > >
> > >        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> > >
> > >        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > >
> > > TRACE 300474:
> > >
> > >        java.io.FileInputStream.readBytes(FileInputStream.java:Unknown
> > > line)
> > >
> > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > >
> > >        java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> > >
> > >        java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> > >
> > >        java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> > >
> > >        java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> > >
> > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > >
> > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > LineRecordReader.java:136)
> > >
> > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > UTF8ByteArrayUtils.java:157)
> > >
> > >        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> > > PipeMapRed.java:348)
> > >
> > > TRACE 300479:
> > >
> > >        java.io.FileInputStream.readBytes(FileInputStream.java:Unknown
> > > line)
> > >
> > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > >
> > >        java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> > >
> > >        java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> > >
> > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > >
> > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > LineRecordReader.java:136)
> > >
> > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > UTF8ByteArrayUtils.java:157)
> > >
> > >        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> > > PipeMapRed.java:399)
> > >
> > > TRACE 300478:
> > >
> > >
> > >
>  java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline)
> > >
> > >        java.io.FileOutputStream.write(FileOutputStream.java:260)
> > >
> > >        java.io.BufferedOutputStream.flushBuffer(
> > BufferedOutputStream.java
> > > :65)
> > >
> > >
>  java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> > >
> > >
>  java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
> > >
> > >        java.io.DataOutputStream.flush(DataOutputStream.java:106)
> > >
> > >        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
> > >
> > >        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > >
> > >        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > >
>  org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> > > :1760)
> > >
> > >
> > > I don't understand why Hadoop streaming needs so much CPU time to read
> > > from
> > > and write to the map program. Note it takes 23.67% time to read from
> the
> > > standard error of the map program while the program does not output
> any
> > > error at all!
> > >
> > > Does anyone know any way to get rid of this seemingly unnecessary
> > overhead
> > > in Hadoop streaming?
> > >
> > > Thanks,
> > >
> > > Lin
> > >
> >
> >
> >
> > --
> > Theodore Van Rooy
> > http://greentheo.scroggles.com
> >
>



-- 
Theodore Van Rooy
http://greentheo.scroggles.com

Re: Hadoop streaming performance problem

Posted by Travis Brady <tr...@mochimedia.com>.

>
> A set of reasonable performance tests and results would be very helpful
> in helping people decide whether to go with streaming or not. Hopefully
> we can get some numbers from this thread and publish them? Anyone else
> compared streaming with native java?
>

I think this is a great idea.  I think it'd also be nice to have a small
cookbook-ish thing showing equivalent programs in each style.  Perhaps a
simple streaming example with python or something and it's equivalent using
Java.
I'd have written this already but I don't know java.



On Mon, Mar 31, 2008 at 3:42 PM, Colin Freas <co...@gmail.com> wrote:

> Really?
>
> I would expect the opposite: for compressed files to process slower.
>
> You're saying that is not the case, and that compressed files actually
> increase the speed of jobs?
>
> -Colin
>
> On Mon, Mar 31, 2008 at 4:51 PM, Andreas Kostyrka <an...@kostyrka.org>
> wrote:
>
> > Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
> > provide the input files gzipped. Not great difference (e.g. 50% slower
> > when not gzipped, plus it took more than twice as long to upload the
> > data to HDFS-on-S3 in the first place), but still probably relevant.
> >
> > Andreas
> >
> > Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
> > > I'm running custom map programs written in C++. What the programs do
> is
> > very
> > > simple. For example, in program 2, for each input line        ID node1
> > node2
> > > ... nodeN
> > > the program outputs
> > >         node1 ID
> > >         node2 ID
> > >         ...
> > >         nodeN ID
> > >
> > > Each node has 4GB to 8GB of memory. The java memory setting is
> -Xmx300m.
> > >
> > > I agree that it depends on the scripts. I tried replicating the
> > computation
> > > for each input line by 10 times and saw significantly better speedup.
> > But it
> > > is still pretty bad that Hadoop streaming has such big overhead for
> > simple
> > > programs.
> > >
> > > I also tried writing program 1 with Hadoop Java API. I got almost
> 1000%
> > > speed up on the cluster.
> > >
> > > Lin
> > >
> > > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <
> munkey906@gmail.com>
> > > wrote:
> > >
> > > > are you running a custom map script or a standard linux command like
> > WC?
> > > >  If
> > > > custom, what does your script do?
> > > >
> > > > How much ram do you have?  what are you Java memory settings?
> > > >
> > > > I used the following setup
> > > >
> > > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a
> 4
> > task
> > > > max.
> > > >
> > > > I got the following results
> > > >
> > > > WC 30-40% speedup
> > > > Sort 40% speedup
> > > > Grep 5X slowdown (turns out this was due to what you described
> > above...
> > > > Grep
> > > > is just very highly optimized for command line)
> > > > Custom perl script which is essentially a For loop which matches
> each
> > row
> > > > of
> > > > a dataset to a set of 100 categories) 60% speedup.
> > > >
> > > > So I do think that it depends on your script... and some other
> > settings of
> > > > yours.
> > > >
> > > > Theo
> > > >
> > > > On Mon, Mar 31, 2008 at 2:00 PM, lin <no...@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am looking into using Hadoop streaming to parallelize some
> simple
> > > > > programs. So far the performance has been pretty disappointing.
> > > > >
> > > > > The cluster contains 5 nodes. Each node has two CPU cores. The
> task
> > > > > capacity
> > > > > of each node is 2. The Hadoop version is 0.15.
> > > > >
> > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes
> > in
> > > > > standalone (on a single CPU core). Program runs for 5 minutes on
> the
> > > > > Hadoop
> > > > > cluster and 4.5 minutes in standalone. Both programs run as
> map-only
> > > > jobs.
> > > > >
> > > > > I understand that there is some overhead in starting up tasks,
> > reading
> > > > to
> > > > > and writing from the distributed file system. But they do not seem
> > to
> > > > > explain all the overhead. Most map tasks are data-local. I
> modified
> > > > > program
> > > > > 1 to output nothing and saw the same magnitude of overhead.
> > > > >
> > > > > The output of top shows that the majority of the CPU time is
> > consumed by
> > > > > Hadoop java processes (e.g.
> > org.apache.hadoop.mapred.TaskTracker$Child).
> > > > > So
> > > > > I added a profile option (-agentlib:hprof=cpu=samples) to
> > > > > mapred.child.java.opts.
> > > > >
> > > > > The profile results show that most of CPU time is spent in the
> > following
> > > > > methods
> > > > >
> > > > >   rank   self  accum   count trace method
> > > > >
> > > > >   1 23.76% 23.76%    1246 300472
> > > > java.lang.UNIXProcess.waitForProcessExit
> > > > >
> > > > >   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
> > > > >
> > > > >   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
> > > > >
> > > > >   4 16.15% 87.32%     847 300478
> java.io.FileOutputStream.writeBytes
> > > > >
> > > > > And their stack traces show that these methods are for interacting
> > with
> > > > > the
> > > > > map program.
> > > > >
> > > > >
> > > > > TRACE 300472:
> > > > >
> > > > >
> > > > >  java.lang.UNIXProcess.waitForProcessExit(
> > UNIXProcess.java:Unknownline)
> > > > >
> > > > >        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> > > > >
> > > > >        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > > > >
> > > > > TRACE 300474:
> > > > >
> > > > >        java.io.FileInputStream.readBytes(
> > FileInputStream.java:Unknown
> > > > > line)
> > > > >
> > > > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > > > >
> > > > >        java.io.BufferedInputStream.read1(BufferedInputStream.java
> > :256)
> > > > >
> > > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> > :317)
> > > > >
> > > > >        java.io.BufferedInputStream.fill(BufferedInputStream.java
> > :218)
> > > > >
> > > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> > :237)
> > > > >
> > > > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > > > >
> > > > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > > > LineRecordReader.java:136)
> > > > >
> > > > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > > > UTF8ByteArrayUtils.java:157)
> > > > >
> > > > >        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> > > > > PipeMapRed.java:348)
> > > > >
> > > > > TRACE 300479:
> > > > >
> > > > >        java.io.FileInputStream.readBytes(
> > FileInputStream.java:Unknown
> > > > > line)
> > > > >
> > > > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > > > >
> > > > >        java.io.BufferedInputStream.fill(BufferedInputStream.java
> > :218)
> > > > >
> > > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> > :237)
> > > > >
> > > > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > > > >
> > > > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > > > LineRecordReader.java:136)
> > > > >
> > > > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > > > UTF8ByteArrayUtils.java:157)
> > > > >
> > > > >        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> > > > > PipeMapRed.java:399)
> > > > >
> > > > > TRACE 300478:
> > > > >
> > > > >
> > > > >  java.io.FileOutputStream.writeBytes(
> > FileOutputStream.java:Unknownline)
> > > > >
> > > > >        java.io.FileOutputStream.write(FileOutputStream.java:260)
> > > > >
> > > > >        java.io.BufferedOutputStream.flushBuffer(
> > > > BufferedOutputStream.java
> > > > > :65)
> > > > >
> > > > >        java.io.BufferedOutputStream.flush(
> BufferedOutputStream.java
> > :123)
> > > > >
> > > > >        java.io.BufferedOutputStream.flush(
> BufferedOutputStream.java
> > :124)
> > > > >
> > > > >        java.io.DataOutputStream.flush(DataOutputStream.java:106)
> > > > >
> > > > >        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java
> > :96)
> > > > >
> > > > >        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > > > >
> > > > >        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > > > >        org.apache.hadoop.mapred.TaskTracker$Child.main(
> > TaskTracker.java
> > > > > :1760)
> > > > >
> > > > >
> > > > > I don't understand why Hadoop streaming needs so much CPU time to
> > read
> > > > > from
> > > > > and write to the map program. Note it takes 23.67% time to read
> from
> > the
> > > > > standard error of the map program while the program does not
> output
> > any
> > > > > error at all!
> > > > >
> > > > > Does anyone know any way to get rid of this seemingly unnecessary
> > > > overhead
> > > > > in Hadoop streaming?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Lin
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Theodore Van Rooy
> > > > http://greentheo.scroggles.com
> > > >
> >
>



-- 
Travis Brady
www.mochiads.com

Re: Hadoop streaming performance problem

Posted by Andreas Kostyrka <an...@kostyrka.org>.

Yes. I would have to look up the exact numbers in my Sent folder (I've
been reporting all the time to my boss), but it has been a sizeable
portion of the runtime (on an EC2+HDFS-on-S3 cluster), especially when
one included the time to load the data into HDFS.

Ah gone ahead and looked it up:

        93minutes for memberfind on the unzipped input files (+ 45
        minutes to
        upload the 15GB of unzipped data).
        
        70minutes for memberfind on the gzipped files (Critical
        important: you
        need to name the files with a .gz suffix or Hadoop won't
        recognize and
        unzip it, which trashes the runs :( )
        
So if I add up the difference it's almost 100% of the runtime of the
gzipped run. (input set is 15GB unzipped, 24.8M lines, 1.6GB gzipped).

One interesting detail is that I have taken also the time to measure
runtime without Hadoop streaming for my input set, and Hadoop took less
than 5% time compared to running it directly. (Guess if had run two
processes to amortize the time where the VM's "disc" was loading time, I
might had managed to speed up the throughput of the "native" run
slightly.) The relevant thing here is that the performance hit my Hadoop
seemed low enough that I had no case for dropping Hadoop completely. ;)

Andreas


Am Montag, den 31.03.2008, 18:42 -0400 schrieb Colin Freas:
> Really?
> 
> I would expect the opposite: for compressed files to process slower.
> 
> You're saying that is not the case, and that compressed files actually
> increase the speed of jobs?
> 
> -Colin
> 
> On Mon, Mar 31, 2008 at 4:51 PM, Andreas Kostyrka <an...@kostyrka.org>
> wrote:
> 
> > Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
> > provide the input files gzipped. Not great difference (e.g. 50% slower
> > when not gzipped, plus it took more than twice as long to upload the
> > data to HDFS-on-S3 in the first place), but still probably relevant.
> >
> > Andreas
> >
> > Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
> > > I'm running custom map programs written in C++. What the programs do is
> > very
> > > simple. For example, in program 2, for each input line        ID node1
> > node2
> > > ... nodeN
> > > the program outputs
> > >         node1 ID
> > >         node2 ID
> > >         ...
> > >         nodeN ID
> > >
> > > Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m.
> > >
> > > I agree that it depends on the scripts. I tried replicating the
> > computation
> > > for each input line by 10 times and saw significantly better speedup.
> > But it
> > > is still pretty bad that Hadoop streaming has such big overhead for
> > simple
> > > programs.
> > >
> > > I also tried writing program 1 with Hadoop Java API. I got almost 1000%
> > > speed up on the cluster.
> > >
> > > Lin
> > >
> > > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <mu...@gmail.com>
> > > wrote:
> > >
> > > > are you running a custom map script or a standard linux command like
> > WC?
> > > >  If
> > > > custom, what does your script do?
> > > >
> > > > How much ram do you have?  what are you Java memory settings?
> > > >
> > > > I used the following setup
> > > >
> > > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4
> > task
> > > > max.
> > > >
> > > > I got the following results
> > > >
> > > > WC 30-40% speedup
> > > > Sort 40% speedup
> > > > Grep 5X slowdown (turns out this was due to what you described
> > above...
> > > > Grep
> > > > is just very highly optimized for command line)
> > > > Custom perl script which is essentially a For loop which matches each
> > row
> > > > of
> > > > a dataset to a set of 100 categories) 60% speedup.
> > > >
> > > > So I do think that it depends on your script... and some other
> > settings of
> > > > yours.
> > > >
> > > > Theo
> > > >
> > > > On Mon, Mar 31, 2008 at 2:00 PM, lin <no...@gmail.com> wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I am looking into using Hadoop streaming to parallelize some simple
> > > > > programs. So far the performance has been pretty disappointing.
> > > > >
> > > > > The cluster contains 5 nodes. Each node has two CPU cores. The task
> > > > > capacity
> > > > > of each node is 2. The Hadoop version is 0.15.
> > > > >
> > > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes
> > in
> > > > > standalone (on a single CPU core). Program runs for 5 minutes on the
> > > > > Hadoop
> > > > > cluster and 4.5 minutes in standalone. Both programs run as map-only
> > > > jobs.
> > > > >
> > > > > I understand that there is some overhead in starting up tasks,
> > reading
> > > > to
> > > > > and writing from the distributed file system. But they do not seem
> > to
> > > > > explain all the overhead. Most map tasks are data-local. I modified
> > > > > program
> > > > > 1 to output nothing and saw the same magnitude of overhead.
> > > > >
> > > > > The output of top shows that the majority of the CPU time is
> > consumed by
> > > > > Hadoop java processes (e.g.
> > org.apache.hadoop.mapred.TaskTracker$Child).
> > > > > So
> > > > > I added a profile option (-agentlib:hprof=cpu=samples) to
> > > > > mapred.child.java.opts.
> > > > >
> > > > > The profile results show that most of CPU time is spent in the
> > following
> > > > > methods
> > > > >
> > > > >   rank   self  accum   count trace method
> > > > >
> > > > >   1 23.76% 23.76%    1246 300472
> > > > java.lang.UNIXProcess.waitForProcessExit
> > > > >
> > > > >   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
> > > > >
> > > > >   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
> > > > >
> > > > >   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
> > > > >
> > > > > And their stack traces show that these methods are for interacting
> > with
> > > > > the
> > > > > map program.
> > > > >
> > > > >
> > > > > TRACE 300472:
> > > > >
> > > > >
> > > > >  java.lang.UNIXProcess.waitForProcessExit(
> > UNIXProcess.java:Unknownline)
> > > > >
> > > > >        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> > > > >
> > > > >        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > > > >
> > > > > TRACE 300474:
> > > > >
> > > > >        java.io.FileInputStream.readBytes(
> > FileInputStream.java:Unknown
> > > > > line)
> > > > >
> > > > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > > > >
> > > > >        java.io.BufferedInputStream.read1(BufferedInputStream.java
> > :256)
> > > > >
> > > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> > :317)
> > > > >
> > > > >        java.io.BufferedInputStream.fill(BufferedInputStream.java
> > :218)
> > > > >
> > > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> > :237)
> > > > >
> > > > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > > > >
> > > > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > > > LineRecordReader.java:136)
> > > > >
> > > > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > > > UTF8ByteArrayUtils.java:157)
> > > > >
> > > > >        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> > > > > PipeMapRed.java:348)
> > > > >
> > > > > TRACE 300479:
> > > > >
> > > > >        java.io.FileInputStream.readBytes(
> > FileInputStream.java:Unknown
> > > > > line)
> > > > >
> > > > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > > > >
> > > > >        java.io.BufferedInputStream.fill(BufferedInputStream.java
> > :218)
> > > > >
> > > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> > :237)
> > > > >
> > > > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > > > >
> > > > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > > > LineRecordReader.java:136)
> > > > >
> > > > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > > > UTF8ByteArrayUtils.java:157)
> > > > >
> > > > >        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> > > > > PipeMapRed.java:399)
> > > > >
> > > > > TRACE 300478:
> > > > >
> > > > >
> > > > >  java.io.FileOutputStream.writeBytes(
> > FileOutputStream.java:Unknownline)
> > > > >
> > > > >        java.io.FileOutputStream.write(FileOutputStream.java:260)
> > > > >
> > > > >        java.io.BufferedOutputStream.flushBuffer(
> > > > BufferedOutputStream.java
> > > > > :65)
> > > > >
> > > > >        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
> > :123)
> > > > >
> > > > >        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
> > :124)
> > > > >
> > > > >        java.io.DataOutputStream.flush(DataOutputStream.java:106)
> > > > >
> > > > >        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java
> > :96)
> > > > >
> > > > >        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > > > >
> > > > >        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > > > >        org.apache.hadoop.mapred.TaskTracker$Child.main(
> > TaskTracker.java
> > > > > :1760)
> > > > >
> > > > >
> > > > > I don't understand why Hadoop streaming needs so much CPU time to
> > read
> > > > > from
> > > > > and write to the map program. Note it takes 23.67% time to read from
> > the
> > > > > standard error of the map program while the program does not output
> > any
> > > > > error at all!
> > > > >
> > > > > Does anyone know any way to get rid of this seemingly unnecessary
> > > > overhead
> > > > > in Hadoop streaming?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Lin
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Theodore Van Rooy
> > > > http://greentheo.scroggles.com
> > > >
> >

Re: Hadoop streaming performance problem

Posted by Colin Freas <co...@gmail.com>.

Really?

I would expect the opposite: for compressed files to process slower.

You're saying that is not the case, and that compressed files actually
increase the speed of jobs?

-Colin

On Mon, Mar 31, 2008 at 4:51 PM, Andreas Kostyrka <an...@kostyrka.org>
wrote:

> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
> provide the input files gzipped. Not great difference (e.g. 50% slower
> when not gzipped, plus it took more than twice as long to upload the
> data to HDFS-on-S3 in the first place), but still probably relevant.
>
> Andreas
>
> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
> > I'm running custom map programs written in C++. What the programs do is
> very
> > simple. For example, in program 2, for each input line        ID node1
> node2
> > ... nodeN
> > the program outputs
> >         node1 ID
> >         node2 ID
> >         ...
> >         nodeN ID
> >
> > Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m.
> >
> > I agree that it depends on the scripts. I tried replicating the
> computation
> > for each input line by 10 times and saw significantly better speedup.
> But it
> > is still pretty bad that Hadoop streaming has such big overhead for
> simple
> > programs.
> >
> > I also tried writing program 1 with Hadoop Java API. I got almost 1000%
> > speed up on the cluster.
> >
> > Lin
> >
> > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <mu...@gmail.com>
> > wrote:
> >
> > > are you running a custom map script or a standard linux command like
> WC?
> > >  If
> > > custom, what does your script do?
> > >
> > > How much ram do you have?  what are you Java memory settings?
> > >
> > > I used the following setup
> > >
> > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4
> task
> > > max.
> > >
> > > I got the following results
> > >
> > > WC 30-40% speedup
> > > Sort 40% speedup
> > > Grep 5X slowdown (turns out this was due to what you described
> above...
> > > Grep
> > > is just very highly optimized for command line)
> > > Custom perl script which is essentially a For loop which matches each
> row
> > > of
> > > a dataset to a set of 100 categories) 60% speedup.
> > >
> > > So I do think that it depends on your script... and some other
> settings of
> > > yours.
> > >
> > > Theo
> > >
> > > On Mon, Mar 31, 2008 at 2:00 PM, lin <no...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am looking into using Hadoop streaming to parallelize some simple
> > > > programs. So far the performance has been pretty disappointing.
> > > >
> > > > The cluster contains 5 nodes. Each node has two CPU cores. The task
> > > > capacity
> > > > of each node is 2. The Hadoop version is 0.15.
> > > >
> > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes
> in
> > > > standalone (on a single CPU core). Program runs for 5 minutes on the
> > > > Hadoop
> > > > cluster and 4.5 minutes in standalone. Both programs run as map-only
> > > jobs.
> > > >
> > > > I understand that there is some overhead in starting up tasks,
> reading
> > > to
> > > > and writing from the distributed file system. But they do not seem
> to
> > > > explain all the overhead. Most map tasks are data-local. I modified
> > > > program
> > > > 1 to output nothing and saw the same magnitude of overhead.
> > > >
> > > > The output of top shows that the majority of the CPU time is
> consumed by
> > > > Hadoop java processes (e.g.
> org.apache.hadoop.mapred.TaskTracker$Child).
> > > > So
> > > > I added a profile option (-agentlib:hprof=cpu=samples) to
> > > > mapred.child.java.opts.
> > > >
> > > > The profile results show that most of CPU time is spent in the
> following
> > > > methods
> > > >
> > > >   rank   self  accum   count trace method
> > > >
> > > >   1 23.76% 23.76%    1246 300472
> > > java.lang.UNIXProcess.waitForProcessExit
> > > >
> > > >   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
> > > >
> > > >   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
> > > >
> > > >   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
> > > >
> > > > And their stack traces show that these methods are for interacting
> with
> > > > the
> > > > map program.
> > > >
> > > >
> > > > TRACE 300472:
> > > >
> > > >
> > > >  java.lang.UNIXProcess.waitForProcessExit(
> UNIXProcess.java:Unknownline)
> > > >
> > > >        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> > > >
> > > >        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > > >
> > > > TRACE 300474:
> > > >
> > > >        java.io.FileInputStream.readBytes(
> FileInputStream.java:Unknown
> > > > line)
> > > >
> > > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > > >
> > > >        java.io.BufferedInputStream.read1(BufferedInputStream.java
> :256)
> > > >
> > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> :317)
> > > >
> > > >        java.io.BufferedInputStream.fill(BufferedInputStream.java
> :218)
> > > >
> > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> :237)
> > > >
> > > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > > >
> > > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > > LineRecordReader.java:136)
> > > >
> > > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > > UTF8ByteArrayUtils.java:157)
> > > >
> > > >        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> > > > PipeMapRed.java:348)
> > > >
> > > > TRACE 300479:
> > > >
> > > >        java.io.FileInputStream.readBytes(
> FileInputStream.java:Unknown
> > > > line)
> > > >
> > > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > > >
> > > >        java.io.BufferedInputStream.fill(BufferedInputStream.java
> :218)
> > > >
> > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> :237)
> > > >
> > > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > > >
> > > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > > LineRecordReader.java:136)
> > > >
> > > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > > UTF8ByteArrayUtils.java:157)
> > > >
> > > >        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> > > > PipeMapRed.java:399)
> > > >
> > > > TRACE 300478:
> > > >
> > > >
> > > >  java.io.FileOutputStream.writeBytes(
> FileOutputStream.java:Unknownline)
> > > >
> > > >        java.io.FileOutputStream.write(FileOutputStream.java:260)
> > > >
> > > >        java.io.BufferedOutputStream.flushBuffer(
> > > BufferedOutputStream.java
> > > > :65)
> > > >
> > > >        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
> :123)
> > > >
> > > >        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
> :124)
> > > >
> > > >        java.io.DataOutputStream.flush(DataOutputStream.java:106)
> > > >
> > > >        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java
> :96)
> > > >
> > > >        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > > >
> > > >        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > > >        org.apache.hadoop.mapred.TaskTracker$Child.main(
> TaskTracker.java
> > > > :1760)
> > > >
> > > >
> > > > I don't understand why Hadoop streaming needs so much CPU time to
> read
> > > > from
> > > > and write to the map program. Note it takes 23.67% time to read from
> the
> > > > standard error of the map program while the program does not output
> any
> > > > error at all!
> > > >
> > > > Does anyone know any way to get rid of this seemingly unnecessary
> > > overhead
> > > > in Hadoop streaming?
> > > >
> > > > Thanks,
> > > >
> > > > Lin
> > > >
> > >
> > >
> > >
> > > --
> > > Theodore Van Rooy
> > > http://greentheo.scroggles.com
> > >
>

Re: Hadoop streaming performance problem

Posted by lin <no...@gmail.com>.

You're right. Java isn't really that slow. I re-examined the Java code for
the standalone program and found I was using an unbuffered output method.
After I changed it to a buffered method, the Java code running time was
comparable to the C++ one. This also means the 1000% speed-up I got was
quite wrong. Hadoop runs Java programs a little more efficiently than hadoop
streaming, but not significantly.

On Mon, Mar 31, 2008 at 3:30 PM, Ted Dunning <td...@veoh.com> wrote:

>
>
> This seems a bit surprising.  In my experience well-written Java is
> generally just about as fast as C++, especially for I/O bound work.  The
> exceptions are:
>
> - java startup is still slow.  This shouldn't matter much here because you
> are using streaming anyway so you have java startup + C startup.
>
> - java memory use is larger than a perfectly written C++ program and
> probably always will be.  Of course, getting that perfectly written C++
> program is pretty difficult.
>
> If the program in question is the one you posted earlier, it is very hard
> to
> imagine that java would not saturate the I/O channels.  Can you confirm
> that
> you are literally doing:
>
>    for (node : line.split(" ")) {
>      out.collect(node, key);
>    }
>
> And finding a serious difference in speed?
>
> On 3/31/08 3:23 PM, "lin" <no...@gmail.com> wrote:
>
> > Java seems to be too slow. I rewrote the
> > first program in Java and it runs 4 to 5 times slower than the C++ one.
>
>

Re: Hadoop streaming performance problem

Posted by Jeff Hammerbacher <je...@gmail.com>.

We have 40 or so engineers who only use streaming at Facebook, so no
matter how jury-rigged the solution might be, it's been immensely
valuable for developer productivity.  As we found with Thrift, letting
developers write code in their language of choice has many benefits,
including development speed, reuse of common libraries, and using the
right language for the right task.  It would be interesting to bake
something like streaming into the Hadoop core rather than forcing
tab-delimited data through pipes.  I know, I know: it's open source,
go ahead and add it.  Just sayin...

On Mon, Mar 31, 2008 at 3:31 PM, Travis Brady <tr...@mochimedia.com> wrote:
> This brings up two interesting issues:
>
>  1. Hadoop streaming is a potentially very powerful tool, especially for
>  those of us who don't work in Java for whatever reason
>  2. If Hadoop streaming is "at best a jury rigged solution" then that should
>  be made known somewhere on the wiki.  If it's really not supposed to be
>  used, why is it provided at all?
>
>  thanks,
>  Travis
>
>
>
>  On Mon, Mar 31, 2008 at 3:23 PM, lin <no...@gmail.com> wrote:
>
>  > Well, we would like to use hadoop streaming because our current system is
>  > in
>  > C++ and it is easier to migrate to hadoop streaming. Also we have very
>  > strict performance requirements. Java seems to be too slow. I rewrote the
>  > first program in Java and it runs 4 to 5 times slower than the C++ one.
>  >
>  > On Mon, Mar 31, 2008 at 3:15 PM, Ted Dunning <td...@veoh.com> wrote:
>  >
>  > >
>  > >
>  > > Hadoop can't split a gzipped file so you will only get as many maps as
>  > you
>  > > have files.
>  > >
>  > > Why the obsession with hadoop streaming?  It is at best a jury rigged
>  > > solution.
>  > >
>  > >
>  > > On 3/31/08 3:12 PM, "lin" <no...@gmail.com> wrote:
>  > >
>  > > > Does Hadoop automatically decompress the gzipped file? I only have a
>  > > single
>  > > > input file. Does it have to be splitted and then gzipped?
>  > > >
>  > > > I gzipped the input file and Hadoop only created one map task. Still
>  > > java is
>  > > > using more than 90% CPU.
>  > > >
>  > > > On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <
>  > andreas@kostyrka.org>
>  > > > wrote:
>  > > >
>  > > >> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
>  > > >> provide the input files gzipped. Not great difference (e.g. 50%
>  > slower
>  > > >> when not gzipped, plus it took more than twice as long to upload the
>  > > >> data to HDFS-on-S3 in the first place), but still probably relevant.
>  > > >>
>  > > >> Andreas
>  > > >>
>  > > >> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
>  > > >>> I'm running custom map programs written in C++. What the programs do
>  > > is
>  > > >> very
>  > > >>> simple. For example, in program 2, for each input line        ID
>  > node1
>  > > >> node2
>  > > >>> ... nodeN
>  > > >>> the program outputs
>  > > >>>         node1 ID
>  > > >>>         node2 ID
>  > > >>>         ...
>  > > >>>         nodeN ID
>  > > >>>
>  > > >>> Each node has 4GB to 8GB of memory. The java memory setting is
>  > > -Xmx300m.
>  > > >>>
>  > > >>> I agree that it depends on the scripts. I tried replicating the
>  > > >> computation
>  > > >>> for each input line by 10 times and saw significantly better
>  > speedup.
>  > > >> But it
>  > > >>> is still pretty bad that Hadoop streaming has such big overhead for
>  > > >> simple
>  > > >>> programs.
>  > > >>>
>  > > >>> I also tried writing program 1 with Hadoop Java API. I got almost
>  > > 1000%
>  > > >>> speed up on the cluster.
>  > > >>>
>  > > >>> Lin
>  > > >>>
>  > > >>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <
>  > > munkey906@gmail.com>
>  > > >>> wrote:
>  > > >>>
>  > > >>>> are you running a custom map script or a standard linux command
>  > like
>  > > >> WC?
>  > > >>>>  If
>  > > >>>> custom, what does your script do?
>  > > >>>>
>  > > >>>> How much ram do you have?  what are you Java memory settings?
>  > > >>>>
>  > > >>>> I used the following setup
>  > > >>>>
>  > > >>>> 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a
>  > 4
>  > > >> task
>  > > >>>> max.
>  > > >>>>
>  > > >>>> I got the following results
>  > > >>>>
>  > > >>>> WC 30-40% speedup
>  > > >>>> Sort 40% speedup
>  > > >>>> Grep 5X slowdown (turns out this was due to what you described
>  > > >> above...
>  > > >>>> Grep
>  > > >>>> is just very highly optimized for command line)
>  > > >>>> Custom perl script which is essentially a For loop which matches
>  > each
>  > > >> row
>  > > >>>> of
>  > > >>>> a dataset to a set of 100 categories) 60% speedup.
>  > > >>>>
>  > > >>>> So I do think that it depends on your script... and some other
>  > > >> settings of
>  > > >>>> yours.
>  > > >>>>
>  > > >>>> Theo
>  > > >>>>
>  > > >>>> On Mon, Mar 31, 2008 at 2:00 PM, lin <no...@gmail.com> wrote:
>  > > >>>>
>  > > >>>>> Hi,
>  > > >>>>>
>  > > >>>>> I am looking into using Hadoop streaming to parallelize some
>  > simple
>  > > >>>>> programs. So far the performance has been pretty disappointing.
>  > > >>>>>
>  > > >>>>> The cluster contains 5 nodes. Each node has two CPU cores. The
>  > task
>  > > >>>>> capacity
>  > > >>>>> of each node is 2. The Hadoop version is 0.15.
>  > > >>>>>
>  > > >>>>> Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes
>  > > >> in
>  > > >>>>> standalone (on a single CPU core). Program runs for 5 minutes on
>  > the
>  > > >>>>> Hadoop
>  > > >>>>> cluster and 4.5 minutes in standalone. Both programs run as
>  > map-only
>  > > >>>> jobs.
>  > > >>>>>
>  > > >>>>> I understand that there is some overhead in starting up tasks,
>  > > >> reading
>  > > >>>> to
>  > > >>>>> and writing from the distributed file system. But they do not seem
>  > > >> to
>  > > >>>>> explain all the overhead. Most map tasks are data-local. I
>  > modified
>  > > >>>>> program
>  > > >>>>> 1 to output nothing and saw the same magnitude of overhead.
>  > > >>>>>
>  > > >>>>> The output of top shows that the majority of the CPU time is
>  > > >> consumed by
>  > > >>>>> Hadoop java processes (e.g.
>  > > >> org.apache.hadoop.mapred.TaskTracker$Child).
>  > > >>>>> So
>  > > >>>>> I added a profile option (-agentlib:hprof=cpu=samples) to
>  > > >>>>> mapred.child.java.opts.
>  > > >>>>>
>  > > >>>>> The profile results show that most of CPU time is spent in the
>  > > >> following
>  > > >>>>> methods
>  > > >>>>>
>  > > >>>>>   rank   self  accum   count trace method
>  > > >>>>>
>  > > >>>>>   1 23.76% 23.76%    1246 300472
>  > > >>>> java.lang.UNIXProcess.waitForProcessExit
>  > > >>>>>
>  > > >>>>>   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
>  > > >>>>>
>  > > >>>>>   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
>  > > >>>>>
>  > > >>>>>   4 16.15% 87.32%     847 300478
>  > java.io.FileOutputStream.writeBytes
>  > > >>>>>
>  > > >>>>> And their stack traces show that these methods are for interacting
>  > > >> with
>  > > >>>>> the
>  > > >>>>> map program.
>  > > >>>>>
>  > > >>>>>
>  > > >>>>> TRACE 300472:
>  > > >>>>>
>  > > >>>>>
>  > > >>>>>  java.lang.UNIXProcess.waitForProcessExit(
>  > > >> UNIXProcess.java:Unknownline)
>  > > >>>>>
>  > > >>>>>        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>  > > >>>>>
>  > > >>>>>        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>  > > >>>>>
>  > > >>>>> TRACE 300474:
>  > > >>>>>
>  > > >>>>>        java.io.FileInputStream.readBytes(
>  > > >> FileInputStream.java:Unknown
>  > > >>>>> line)
>  > > >>>>>
>  > > >>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedInputStream.read1(BufferedInputStream.java
>  > > >> :256)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
>  > > >> :317)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
>  > > >> :218)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
>  > > >> :237)
>  > > >>>>>
>  > > >>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
>  > > >>>>> LineRecordReader.java:136)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
>  > > >>>>> UTF8ByteArrayUtils.java:157)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
>  > > >>>>> PipeMapRed.java:348)
>  > > >>>>>
>  > > >>>>> TRACE 300479:
>  > > >>>>>
>  > > >>>>>        java.io.FileInputStream.readBytes(
>  > > >> FileInputStream.java:Unknown
>  > > >>>>> line)
>  > > >>>>>
>  > > >>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
>  > > >> :218)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
>  > > >> :237)
>  > > >>>>>
>  > > >>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
>  > > >>>>> LineRecordReader.java:136)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
>  > > >>>>> UTF8ByteArrayUtils.java:157)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
>  > > >>>>> PipeMapRed.java:399)
>  > > >>>>>
>  > > >>>>> TRACE 300478:
>  > > >>>>>
>  > > >>>>>
>  > > >>>>>  java.io.FileOutputStream.writeBytes(
>  > > >> FileOutputStream.java:Unknownline)
>  > > >>>>>
>  > > >>>>>        java.io.FileOutputStream.write(FileOutputStream.java:260)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedOutputStream.flushBuffer(
>  > > >>>> BufferedOutputStream.java
>  > > >>>>> :65)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedOutputStream.flush(
>  > BufferedOutputStream.java
>  > > >> :123)
>  > > >>>>>
>  > > >>>>>        java.io.BufferedOutputStream.flush(
>  > BufferedOutputStream.java
>  > > >> :124)
>  > > >>>>>
>  > > >>>>>        java.io.DataOutputStream.flush(DataOutputStream.java:106)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java
>  > > >> :96)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>  > > >>>>>
>  > > >>>>>        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>  > > >>>>>        org.apache.hadoop.mapred.TaskTracker$Child.main(
>  > > >> TaskTracker.java
>  > > >>>>> :1760)
>  > > >>>>>
>  > > >>>>>
>  > > >>>>> I don't understand why Hadoop streaming needs so much CPU time to
>  > > >> read
>  > > >>>>> from
>  > > >>>>> and write to the map program. Note it takes 23.67% time to read
>  > from
>  > > >> the
>  > > >>>>> standard error of the map program while the program does not
>  > output
>  > > >> any
>  > > >>>>> error at all!
>  > > >>>>>
>  > > >>>>> Does anyone know any way to get rid of this seemingly unnecessary
>  > > >>>> overhead
>  > > >>>>> in Hadoop streaming?
>  > > >>>>>
>  > > >>>>> Thanks,
>  > > >>>>>
>  > > >>>>> Lin
>  > > >>>>>
>  > > >>>>
>  > > >>>>
>  > > >>>>
>  > > >>>> --
>  > > >>>> Theodore Van Rooy
>  > > >>>> http://greentheo.scroggles.com
>  > > >>>>
>  > > >>
>  > >
>  > >
>  >
>
>
>
>  --
>  Travis Brady
>  www.mochiads.com
>

Re: Hadoop streaming performance problem

Posted by Theodore Van Rooy <mu...@gmail.com>.

I agree... as I said in one of the earlier emails, I saw a 50% speedup in a
perl script which categorizes O(10^9) rows at a time.  Also I wrote a very
simple python script (something like a 'cat'), and saw similar speedup.
These tests were with 1 Gig files.

We were testing this here at DoubleClick (though it's kind of pointless now
given that we have access to Google's MapReduce cluster :-) , and we
regularly process 25-100 Gig datasets... the best part of which is that we
don't have to rewrite much of our perl, R or bash code.



On Mon, Mar 31, 2008 at 7:21 PM, Ted Dunning <td...@veoh.com> wrote:

>
> My experiences with Groovy are similar.  Noticeable slowdown, but quite
> bearable (almost always better than 50% of best attainable speed).
>
> The highest virtue is that simple programs become simple again.  Word
> count
> is < 5 lines of code.
>
>
>
>
> On 3/31/08 6:10 PM, "Colin Evans" <co...@metaweb.com> wrote:
>
> > At Metaweb, we did a lot of comparisons between streaming (using Python)
> > and native Java, and in general streaming performance was not much
> > slower than the native java -- most of the slowdown was from Python
> > being a slow language.
> >
> > The main problems with streaming apps that we found are that they are
> > hard to write and there are many ways that you can make simple mistakes
> > in streaming that slow down performance.
> >
> > We've been experimenting with embedding JavaScript (Rhino) and Jython
> > for writing jobs, and have found that performance is good and the apps
> > are much easier to write.  The tight Java integration means that
> > performance bottlenecks get rewritten in Java with little sacrifice to
> > development speed.  One of these days we'll open source these
> frameworks.
> >
> >
> >
> > Parand Darugar wrote:
> >> Travis Brady wrote:
> >>> This brings up two interesting issues:
> >>>
> >>> 1. Hadoop streaming is a potentially very powerful tool, especially
> for
> >>> those of us who don't work in Java for whatever reason
> >>> 2. If Hadoop streaming is "at best a jury rigged solution" then that
> >>> should
> >>> be made known somewhere on the wiki.  If it's really not supposed to
> be
> >>> used, why is it provided at all?
> >>>
> >>
> >> A set of reasonable performance tests and results would be very
> >> helpful in helping people decide whether to go with streaming or not.
> >> Hopefully we can get some numbers from this thread and publish them?
> >> Anyone else compared streaming with native java?
> >>
> >> Best,
> >>
> >> Parand
> >
>
>


-- 
Theodore Van Rooy
http://greentheo.scroggles.com

Re: Hadoop streaming performance problem

Posted by Andreas Kostyrka <an...@kostyrka.org>.

Beg you pardon, Python is a fast language, although simple operations
are usually quite more expensive then in lower level languages, but at
least when used by somebody who has enough experience, that doesn't
matter to much. Actually, in many practical cases, because of project
deadlines, C++ (and to a lesser part Java) implementations end up with
the more naive algorithms/designs, and get beaten by Python. In some
cases this gets even stronger when the Python guys cheat and implement
the inner loop stuff with Pyrex/Cython. (That's observation is not
limited to Python, it usually applies to all higher level languages.
OTOH, I've beaten C++ in a quite 1:1 development race, so I tend to
write Python.)

Andreas

Am Montag, den 31.03.2008, 18:10 -0700 schrieb Colin Evans:
> At Metaweb, we did a lot of comparisons between streaming (using Python) 
> and native Java, and in general streaming performance was not much 
> slower than the native java -- most of the slowdown was from Python 
> being a slow language. 
> 
> The main problems with streaming apps that we found are that they are 
> hard to write and there are many ways that you can make simple mistakes 
> in streaming that slow down performance.
> 
> We've been experimenting with embedding JavaScript (Rhino) and Jython 
> for writing jobs, and have found that performance is good and the apps 
> are much easier to write.  The tight Java integration means that 
> performance bottlenecks get rewritten in Java with little sacrifice to 
> development speed.  One of these days we'll open source these frameworks.
> 
> 
> 
> Parand Darugar wrote:
> > Travis Brady wrote:
> >> This brings up two interesting issues:
> >>
> >> 1. Hadoop streaming is a potentially very powerful tool, especially for
> >> those of us who don't work in Java for whatever reason
> >> 2. If Hadoop streaming is "at best a jury rigged solution" then that 
> >> should
> >> be made known somewhere on the wiki.  If it's really not supposed to be
> >> used, why is it provided at all?
> >>   
> >
> > A set of reasonable performance tests and results would be very 
> > helpful in helping people decide whether to go with streaming or not. 
> > Hopefully we can get some numbers from this thread and publish them? 
> > Anyone else compared streaming with native java?
> >
> > Best,
> >
> > Parand

Re: Hadoop streaming performance problem

Posted by Ted Dunning <td...@veoh.com>.

My experiences with Groovy are similar.  Noticeable slowdown, but quite
bearable (almost always better than 50% of best attainable speed).

The highest virtue is that simple programs become simple again.  Word count
is < 5 lines of code.

 


On 3/31/08 6:10 PM, "Colin Evans" <co...@metaweb.com> wrote:

> At Metaweb, we did a lot of comparisons between streaming (using Python)
> and native Java, and in general streaming performance was not much
> slower than the native java -- most of the slowdown was from Python
> being a slow language.
> 
> The main problems with streaming apps that we found are that they are
> hard to write and there are many ways that you can make simple mistakes
> in streaming that slow down performance.
> 
> We've been experimenting with embedding JavaScript (Rhino) and Jython
> for writing jobs, and have found that performance is good and the apps
> are much easier to write.  The tight Java integration means that
> performance bottlenecks get rewritten in Java with little sacrifice to
> development speed.  One of these days we'll open source these frameworks.
> 
> 
> 
> Parand Darugar wrote:
>> Travis Brady wrote:
>>> This brings up two interesting issues:
>>> 
>>> 1. Hadoop streaming is a potentially very powerful tool, especially for
>>> those of us who don't work in Java for whatever reason
>>> 2. If Hadoop streaming is "at best a jury rigged solution" then that
>>> should
>>> be made known somewhere on the wiki.  If it's really not supposed to be
>>> used, why is it provided at all?
>>>   
>> 
>> A set of reasonable performance tests and results would be very
>> helpful in helping people decide whether to go with streaming or not.
>> Hopefully we can get some numbers from this thread and publish them?
>> Anyone else compared streaming with native java?
>> 
>> Best,
>> 
>> Parand
>

Re: Hadoop streaming performance problem

Posted by Colin Evans <co...@metaweb.com>.

At Metaweb, we did a lot of comparisons between streaming (using Python) 
and native Java, and in general streaming performance was not much 
slower than the native java -- most of the slowdown was from Python 
being a slow language. 

The main problems with streaming apps that we found are that they are 
hard to write and there are many ways that you can make simple mistakes 
in streaming that slow down performance.

We've been experimenting with embedding JavaScript (Rhino) and Jython 
for writing jobs, and have found that performance is good and the apps 
are much easier to write.  The tight Java integration means that 
performance bottlenecks get rewritten in Java with little sacrifice to 
development speed.  One of these days we'll open source these frameworks.

Parand Darugar wrote:
> Travis Brady wrote:
>> This brings up two interesting issues:
>>
>> 1. Hadoop streaming is a potentially very powerful tool, especially for
>> those of us who don't work in Java for whatever reason
>> 2. If Hadoop streaming is "at best a jury rigged solution" then that 
>> should
>> be made known somewhere on the wiki.  If it's really not supposed to be
>> used, why is it provided at all?
>>   
>
> A set of reasonable performance tests and results would be very 
> helpful in helping people decide whether to go with streaming or not. 
> Hopefully we can get some numbers from this thread and publish them? 
> Anyone else compared streaming with native java?
>
> Best,
>
> Parand

Re: Hadoop streaming performance problem

Posted by Parand Darugar <da...@yahoo-inc.com>.

Travis Brady wrote:
> This brings up two interesting issues:
>
> 1. Hadoop streaming is a potentially very powerful tool, especially for
> those of us who don't work in Java for whatever reason
> 2. If Hadoop streaming is "at best a jury rigged solution" then that should
> be made known somewhere on the wiki.  If it's really not supposed to be
> used, why is it provided at all?
>   

A set of reasonable performance tests and results would be very helpful 
in helping people decide whether to go with streaming or not. Hopefully 
we can get some numbers from this thread and publish them? Anyone else 
compared streaming with native java?

Best,

Parand

Re: Hadoop streaming performance problem

Posted by Ted Dunning <td...@veoh.com>.

It is provided because of point (2).

But that doesn't necessarily make it a good thing to do.  The basic idea has
real problems.

Hive is likely to resolve many of these issues (when it becomes publicly
available) but some are inherent with the basic idea of moving data across
language barriers too many times.


On 3/31/08 3:31 PM, "Travis Brady" <tr...@mochimedia.com> wrote:

> This brings up two interesting issues:
> 
> 1. Hadoop streaming is a potentially very powerful tool, especially for
> those of us who don't work in Java for whatever reason
> 2. If Hadoop streaming is "at best a jury rigged solution" then that should
> be made known somewhere on the wiki.  If it's really not supposed to be
> used, why is it provided at all?
> 
> thanks,
> Travis
> 
> On Mon, Mar 31, 2008 at 3:23 PM, lin <no...@gmail.com> wrote:
> 
>> Well, we would like to use hadoop streaming because our current system is
>> in
>> C++ and it is easier to migrate to hadoop streaming. Also we have very
>> strict performance requirements. Java seems to be too slow. I rewrote the
>> first program in Java and it runs 4 to 5 times slower than the C++ one.
>> 
>> On Mon, Mar 31, 2008 at 3:15 PM, Ted Dunning <td...@veoh.com> wrote:
>> 
>>> 
>>> 
>>> Hadoop can't split a gzipped file so you will only get as many maps as
>> you
>>> have files.
>>> 
>>> Why the obsession with hadoop streaming?  It is at best a jury rigged
>>> solution.
>>> 
>>> 
>>> On 3/31/08 3:12 PM, "lin" <no...@gmail.com> wrote:
>>> 
>>>> Does Hadoop automatically decompress the gzipped file? I only have a
>>> single
>>>> input file. Does it have to be splitted and then gzipped?
>>>> 
>>>> I gzipped the input file and Hadoop only created one map task. Still
>>> java is
>>>> using more than 90% CPU.
>>>> 
>>>> On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <
>> andreas@kostyrka.org>
>>>> wrote:
>>>> 
>>>>> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
>>>>> provide the input files gzipped. Not great difference (e.g. 50%
>> slower
>>>>> when not gzipped, plus it took more than twice as long to upload the
>>>>> data to HDFS-on-S3 in the first place), but still probably relevant.
>>>>> 
>>>>> Andreas
>>>>> 
>>>>> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
>>>>>> I'm running custom map programs written in C++. What the programs do
>>> is
>>>>> very
>>>>>> simple. For example, in program 2, for each input line        ID
>> node1
>>>>> node2
>>>>>> ... nodeN
>>>>>> the program outputs
>>>>>>         node1 ID
>>>>>>         node2 ID
>>>>>>         ...
>>>>>>         nodeN ID
>>>>>> 
>>>>>> Each node has 4GB to 8GB of memory. The java memory setting is
>>> -Xmx300m.
>>>>>> 
>>>>>> I agree that it depends on the scripts. I tried replicating the
>>>>> computation
>>>>>> for each input line by 10 times and saw significantly better
>> speedup.
>>>>> But it
>>>>>> is still pretty bad that Hadoop streaming has such big overhead for
>>>>> simple
>>>>>> programs.
>>>>>> 
>>>>>> I also tried writing program 1 with Hadoop Java API. I got almost
>>> 1000%
>>>>>> speed up on the cluster.
>>>>>> 
>>>>>> Lin
>>>>>> 
>>>>>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <
>>> munkey906@gmail.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> are you running a custom map script or a standard linux command
>> like
>>>>> WC?
>>>>>>>  If
>>>>>>> custom, what does your script do?
>>>>>>> 
>>>>>>> How much ram do you have?  what are you Java memory settings?
>>>>>>> 
>>>>>>> I used the following setup
>>>>>>> 
>>>>>>> 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a
>> 4
>>>>> task
>>>>>>> max.
>>>>>>> 
>>>>>>> I got the following results
>>>>>>> 
>>>>>>> WC 30-40% speedup
>>>>>>> Sort 40% speedup
>>>>>>> Grep 5X slowdown (turns out this was due to what you described
>>>>> above...
>>>>>>> Grep
>>>>>>> is just very highly optimized for command line)
>>>>>>> Custom perl script which is essentially a For loop which matches
>> each
>>>>> row
>>>>>>> of
>>>>>>> a dataset to a set of 100 categories) 60% speedup.
>>>>>>> 
>>>>>>> So I do think that it depends on your script... and some other
>>>>> settings of
>>>>>>> yours.
>>>>>>> 
>>>>>>> Theo
>>>>>>> 
>>>>>>> On Mon, Mar 31, 2008 at 2:00 PM, lin <no...@gmail.com> wrote:
>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> I am looking into using Hadoop streaming to parallelize some
>> simple
>>>>>>>> programs. So far the performance has been pretty disappointing.
>>>>>>>> 
>>>>>>>> The cluster contains 5 nodes. Each node has two CPU cores. The
>> task
>>>>>>>> capacity
>>>>>>>> of each node is 2. The Hadoop version is 0.15.
>>>>>>>> 
>>>>>>>> Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes
>>>>> in
>>>>>>>> standalone (on a single CPU core). Program runs for 5 minutes on
>> the
>>>>>>>> Hadoop
>>>>>>>> cluster and 4.5 minutes in standalone. Both programs run as
>> map-only
>>>>>>> jobs.
>>>>>>>> 
>>>>>>>> I understand that there is some overhead in starting up tasks,
>>>>> reading
>>>>>>> to
>>>>>>>> and writing from the distributed file system. But they do not seem
>>>>> to
>>>>>>>> explain all the overhead. Most map tasks are data-local. I
>> modified
>>>>>>>> program
>>>>>>>> 1 to output nothing and saw the same magnitude of overhead.
>>>>>>>> 
>>>>>>>> The output of top shows that the majority of the CPU time is
>>>>> consumed by
>>>>>>>> Hadoop java processes (e.g.
>>>>> org.apache.hadoop.mapred.TaskTracker$Child).
>>>>>>>> So
>>>>>>>> I added a profile option (-agentlib:hprof=cpu=samples) to
>>>>>>>> mapred.child.java.opts.
>>>>>>>> 
>>>>>>>> The profile results show that most of CPU time is spent in the
>>>>> following
>>>>>>>> methods
>>>>>>>> 
>>>>>>>>   rank   self  accum   count trace method
>>>>>>>> 
>>>>>>>>   1 23.76% 23.76%    1246 300472
>>>>>>> java.lang.UNIXProcess.waitForProcessExit
>>>>>>>> 
>>>>>>>>   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
>>>>>>>> 
>>>>>>>>   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
>>>>>>>> 
>>>>>>>>   4 16.15% 87.32%     847 300478
>> java.io.FileOutputStream.writeBytes
>>>>>>>> 
>>>>>>>> And their stack traces show that these methods are for interacting
>>>>> with
>>>>>>>> the
>>>>>>>> map program.
>>>>>>>> 
>>>>>>>> 
>>>>>>>> TRACE 300472:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  java.lang.UNIXProcess.waitForProcessExit(
>>>>> UNIXProcess.java:Unknownline)
>>>>>>>> 
>>>>>>>>        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>>>>>>>> 
>>>>>>>>        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>>>>>>>> 
>>>>>>>> TRACE 300474:
>>>>>>>> 
>>>>>>>>        java.io.FileInputStream.readBytes(
>>>>> FileInputStream.java:Unknown
>>>>>>>> line)
>>>>>>>> 
>>>>>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
>>>>>>>> 
>>>>>>>>        java.io.BufferedInputStream.read1(BufferedInputStream.java
>>>>> :256)
>>>>>>>> 
>>>>>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
>>>>> :317)
>>>>>>>> 
>>>>>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
>>>>> :218)
>>>>>>>> 
>>>>>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
>>>>> :237)
>>>>>>>> 
>>>>>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
>>>>>>>> 
>>>>>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
>>>>>>>> LineRecordReader.java:136)
>>>>>>>> 
>>>>>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
>>>>>>>> UTF8ByteArrayUtils.java:157)
>>>>>>>> 
>>>>>>>>        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
>>>>>>>> PipeMapRed.java:348)
>>>>>>>> 
>>>>>>>> TRACE 300479:
>>>>>>>> 
>>>>>>>>        java.io.FileInputStream.readBytes(
>>>>> FileInputStream.java:Unknown
>>>>>>>> line)
>>>>>>>> 
>>>>>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
>>>>>>>> 
>>>>>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
>>>>> :218)
>>>>>>>> 
>>>>>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
>>>>> :237)
>>>>>>>> 
>>>>>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
>>>>>>>> 
>>>>>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
>>>>>>>> LineRecordReader.java:136)
>>>>>>>> 
>>>>>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
>>>>>>>> UTF8ByteArrayUtils.java:157)
>>>>>>>> 
>>>>>>>>        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
>>>>>>>> PipeMapRed.java:399)
>>>>>>>> 
>>>>>>>> TRACE 300478:
>>>>>>>> 
>>>>>>>> 
>>>>>>>>  java.io.FileOutputStream.writeBytes(
>>>>> FileOutputStream.java:Unknownline)
>>>>>>>> 
>>>>>>>>        java.io.FileOutputStream.write(FileOutputStream.java:260)
>>>>>>>> 
>>>>>>>>        java.io.BufferedOutputStream.flushBuffer(
>>>>>>> BufferedOutputStream.java
>>>>>>>> :65)
>>>>>>>> 
>>>>>>>>        java.io.BufferedOutputStream.flush(
>> BufferedOutputStream.java
>>>>> :123)
>>>>>>>> 
>>>>>>>>        java.io.BufferedOutputStream.flush(
>> BufferedOutputStream.java
>>>>> :124)
>>>>>>>> 
>>>>>>>>        java.io.DataOutputStream.flush(DataOutputStream.java:106)
>>>>>>>> 
>>>>>>>>        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java
>>>>> :96)
>>>>>>>> 
>>>>>>>>        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>>>>> 
>>>>>>>>        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>>>>>>>        org.apache.hadoop.mapred.TaskTracker$Child.main(
>>>>> TaskTracker.java
>>>>>>>> :1760)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> I don't understand why Hadoop streaming needs so much CPU time to
>>>>> read
>>>>>>>> from
>>>>>>>> and write to the map program. Note it takes 23.67% time to read
>> from
>>>>> the
>>>>>>>> standard error of the map program while the program does not
>> output
>>>>> any
>>>>>>>> error at all!
>>>>>>>> 
>>>>>>>> Does anyone know any way to get rid of this seemingly unnecessary
>>>>>>> overhead
>>>>>>>> in Hadoop streaming?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> 
>>>>>>>> Lin
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> --
>>>>>>> Theodore Van Rooy
>>>>>>> http://greentheo.scroggles.com
>>>>>>> 
>>>>> 
>>> 
>>> 
>> 
> 
>

Re: Hadoop streaming performance problem

Posted by Travis Brady <tr...@mochimedia.com>.

This brings up two interesting issues:

1. Hadoop streaming is a potentially very powerful tool, especially for
those of us who don't work in Java for whatever reason
2. If Hadoop streaming is "at best a jury rigged solution" then that should
be made known somewhere on the wiki.  If it's really not supposed to be
used, why is it provided at all?

thanks,
Travis

On Mon, Mar 31, 2008 at 3:23 PM, lin <no...@gmail.com> wrote:

> Well, we would like to use hadoop streaming because our current system is
> in
> C++ and it is easier to migrate to hadoop streaming. Also we have very
> strict performance requirements. Java seems to be too slow. I rewrote the
> first program in Java and it runs 4 to 5 times slower than the C++ one.
>
> On Mon, Mar 31, 2008 at 3:15 PM, Ted Dunning <td...@veoh.com> wrote:
>
> >
> >
> > Hadoop can't split a gzipped file so you will only get as many maps as
> you
> > have files.
> >
> > Why the obsession with hadoop streaming?  It is at best a jury rigged
> > solution.
> >
> >
> > On 3/31/08 3:12 PM, "lin" <no...@gmail.com> wrote:
> >
> > > Does Hadoop automatically decompress the gzipped file? I only have a
> > single
> > > input file. Does it have to be splitted and then gzipped?
> > >
> > > I gzipped the input file and Hadoop only created one map task. Still
> > java is
> > > using more than 90% CPU.
> > >
> > > On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <
> andreas@kostyrka.org>
> > > wrote:
> > >
> > >> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
> > >> provide the input files gzipped. Not great difference (e.g. 50%
> slower
> > >> when not gzipped, plus it took more than twice as long to upload the
> > >> data to HDFS-on-S3 in the first place), but still probably relevant.
> > >>
> > >> Andreas
> > >>
> > >> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
> > >>> I'm running custom map programs written in C++. What the programs do
> > is
> > >> very
> > >>> simple. For example, in program 2, for each input line        ID
> node1
> > >> node2
> > >>> ... nodeN
> > >>> the program outputs
> > >>>         node1 ID
> > >>>         node2 ID
> > >>>         ...
> > >>>         nodeN ID
> > >>>
> > >>> Each node has 4GB to 8GB of memory. The java memory setting is
> > -Xmx300m.
> > >>>
> > >>> I agree that it depends on the scripts. I tried replicating the
> > >> computation
> > >>> for each input line by 10 times and saw significantly better
> speedup.
> > >> But it
> > >>> is still pretty bad that Hadoop streaming has such big overhead for
> > >> simple
> > >>> programs.
> > >>>
> > >>> I also tried writing program 1 with Hadoop Java API. I got almost
> > 1000%
> > >>> speed up on the cluster.
> > >>>
> > >>> Lin
> > >>>
> > >>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <
> > munkey906@gmail.com>
> > >>> wrote:
> > >>>
> > >>>> are you running a custom map script or a standard linux command
> like
> > >> WC?
> > >>>>  If
> > >>>> custom, what does your script do?
> > >>>>
> > >>>> How much ram do you have?  what are you Java memory settings?
> > >>>>
> > >>>> I used the following setup
> > >>>>
> > >>>> 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a
> 4
> > >> task
> > >>>> max.
> > >>>>
> > >>>> I got the following results
> > >>>>
> > >>>> WC 30-40% speedup
> > >>>> Sort 40% speedup
> > >>>> Grep 5X slowdown (turns out this was due to what you described
> > >> above...
> > >>>> Grep
> > >>>> is just very highly optimized for command line)
> > >>>> Custom perl script which is essentially a For loop which matches
> each
> > >> row
> > >>>> of
> > >>>> a dataset to a set of 100 categories) 60% speedup.
> > >>>>
> > >>>> So I do think that it depends on your script... and some other
> > >> settings of
> > >>>> yours.
> > >>>>
> > >>>> Theo
> > >>>>
> > >>>> On Mon, Mar 31, 2008 at 2:00 PM, lin <no...@gmail.com> wrote:
> > >>>>
> > >>>>> Hi,
> > >>>>>
> > >>>>> I am looking into using Hadoop streaming to parallelize some
> simple
> > >>>>> programs. So far the performance has been pretty disappointing.
> > >>>>>
> > >>>>> The cluster contains 5 nodes. Each node has two CPU cores. The
> task
> > >>>>> capacity
> > >>>>> of each node is 2. The Hadoop version is 0.15.
> > >>>>>
> > >>>>> Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes
> > >> in
> > >>>>> standalone (on a single CPU core). Program runs for 5 minutes on
> the
> > >>>>> Hadoop
> > >>>>> cluster and 4.5 minutes in standalone. Both programs run as
> map-only
> > >>>> jobs.
> > >>>>>
> > >>>>> I understand that there is some overhead in starting up tasks,
> > >> reading
> > >>>> to
> > >>>>> and writing from the distributed file system. But they do not seem
> > >> to
> > >>>>> explain all the overhead. Most map tasks are data-local. I
> modified
> > >>>>> program
> > >>>>> 1 to output nothing and saw the same magnitude of overhead.
> > >>>>>
> > >>>>> The output of top shows that the majority of the CPU time is
> > >> consumed by
> > >>>>> Hadoop java processes (e.g.
> > >> org.apache.hadoop.mapred.TaskTracker$Child).
> > >>>>> So
> > >>>>> I added a profile option (-agentlib:hprof=cpu=samples) to
> > >>>>> mapred.child.java.opts.
> > >>>>>
> > >>>>> The profile results show that most of CPU time is spent in the
> > >> following
> > >>>>> methods
> > >>>>>
> > >>>>>   rank   self  accum   count trace method
> > >>>>>
> > >>>>>   1 23.76% 23.76%    1246 300472
> > >>>> java.lang.UNIXProcess.waitForProcessExit
> > >>>>>
> > >>>>>   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
> > >>>>>
> > >>>>>   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
> > >>>>>
> > >>>>>   4 16.15% 87.32%     847 300478
> java.io.FileOutputStream.writeBytes
> > >>>>>
> > >>>>> And their stack traces show that these methods are for interacting
> > >> with
> > >>>>> the
> > >>>>> map program.
> > >>>>>
> > >>>>>
> > >>>>> TRACE 300472:
> > >>>>>
> > >>>>>
> > >>>>>  java.lang.UNIXProcess.waitForProcessExit(
> > >> UNIXProcess.java:Unknownline)
> > >>>>>
> > >>>>>        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> > >>>>>
> > >>>>>        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > >>>>>
> > >>>>> TRACE 300474:
> > >>>>>
> > >>>>>        java.io.FileInputStream.readBytes(
> > >> FileInputStream.java:Unknown
> > >>>>> line)
> > >>>>>
> > >>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
> > >>>>>
> > >>>>>        java.io.BufferedInputStream.read1(BufferedInputStream.java
> > >> :256)
> > >>>>>
> > >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
> > >> :317)
> > >>>>>
> > >>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
> > >> :218)
> > >>>>>
> > >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
> > >> :237)
> > >>>>>
> > >>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > >>>>>
> > >>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > >>>>> LineRecordReader.java:136)
> > >>>>>
> > >>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > >>>>> UTF8ByteArrayUtils.java:157)
> > >>>>>
> > >>>>>        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> > >>>>> PipeMapRed.java:348)
> > >>>>>
> > >>>>> TRACE 300479:
> > >>>>>
> > >>>>>        java.io.FileInputStream.readBytes(
> > >> FileInputStream.java:Unknown
> > >>>>> line)
> > >>>>>
> > >>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
> > >>>>>
> > >>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
> > >> :218)
> > >>>>>
> > >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
> > >> :237)
> > >>>>>
> > >>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > >>>>>
> > >>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > >>>>> LineRecordReader.java:136)
> > >>>>>
> > >>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > >>>>> UTF8ByteArrayUtils.java:157)
> > >>>>>
> > >>>>>        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> > >>>>> PipeMapRed.java:399)
> > >>>>>
> > >>>>> TRACE 300478:
> > >>>>>
> > >>>>>
> > >>>>>  java.io.FileOutputStream.writeBytes(
> > >> FileOutputStream.java:Unknownline)
> > >>>>>
> > >>>>>        java.io.FileOutputStream.write(FileOutputStream.java:260)
> > >>>>>
> > >>>>>        java.io.BufferedOutputStream.flushBuffer(
> > >>>> BufferedOutputStream.java
> > >>>>> :65)
> > >>>>>
> > >>>>>        java.io.BufferedOutputStream.flush(
> BufferedOutputStream.java
> > >> :123)
> > >>>>>
> > >>>>>        java.io.BufferedOutputStream.flush(
> BufferedOutputStream.java
> > >> :124)
> > >>>>>
> > >>>>>        java.io.DataOutputStream.flush(DataOutputStream.java:106)
> > >>>>>
> > >>>>>        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java
> > >> :96)
> > >>>>>
> > >>>>>        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > >>>>>
> > >>>>>        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > >>>>>        org.apache.hadoop.mapred.TaskTracker$Child.main(
> > >> TaskTracker.java
> > >>>>> :1760)
> > >>>>>
> > >>>>>
> > >>>>> I don't understand why Hadoop streaming needs so much CPU time to
> > >> read
> > >>>>> from
> > >>>>> and write to the map program. Note it takes 23.67% time to read
> from
> > >> the
> > >>>>> standard error of the map program while the program does not
> output
> > >> any
> > >>>>> error at all!
> > >>>>>
> > >>>>> Does anyone know any way to get rid of this seemingly unnecessary
> > >>>> overhead
> > >>>>> in Hadoop streaming?
> > >>>>>
> > >>>>> Thanks,
> > >>>>>
> > >>>>> Lin
> > >>>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> --
> > >>>> Theodore Van Rooy
> > >>>> http://greentheo.scroggles.com
> > >>>>
> > >>
> >
> >
>



-- 
Travis Brady
www.mochiads.com

Re: Hadoop streaming performance problem

Posted by Ted Dunning <td...@veoh.com>.

This seems a bit surprising.  In my experience well-written Java is
generally just about as fast as C++, especially for I/O bound work.  The
exceptions are:

- java startup is still slow.  This shouldn't matter much here because you
are using streaming anyway so you have java startup + C startup.

- java memory use is larger than a perfectly written C++ program and
probably always will be.  Of course, getting that perfectly written C++
program is pretty difficult.

If the program in question is the one you posted earlier, it is very hard to
imagine that java would not saturate the I/O channels.  Can you confirm that
you are literally doing:

    for (node : line.split(" ")) {
      out.collect(node, key);
    }

And finding a serious difference in speed?

On 3/31/08 3:23 PM, "lin" <no...@gmail.com> wrote:

> Java seems to be too slow. I rewrote the
> first program in Java and it runs 4 to 5 times slower than the C++ one.

Re: Hadoop streaming performance problem

Posted by lin <no...@gmail.com>.

Well, we would like to use hadoop streaming because our current system is in
C++ and it is easier to migrate to hadoop streaming. Also we have very
strict performance requirements. Java seems to be too slow. I rewrote the
first program in Java and it runs 4 to 5 times slower than the C++ one.

On Mon, Mar 31, 2008 at 3:15 PM, Ted Dunning <td...@veoh.com> wrote:

>
>
> Hadoop can't split a gzipped file so you will only get as many maps as you
> have files.
>
> Why the obsession with hadoop streaming?  It is at best a jury rigged
> solution.
>
>
> On 3/31/08 3:12 PM, "lin" <no...@gmail.com> wrote:
>
> > Does Hadoop automatically decompress the gzipped file? I only have a
> single
> > input file. Does it have to be splitted and then gzipped?
> >
> > I gzipped the input file and Hadoop only created one map task. Still
> java is
> > using more than 90% CPU.
> >
> > On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <an...@kostyrka.org>
> > wrote:
> >
> >> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
> >> provide the input files gzipped. Not great difference (e.g. 50% slower
> >> when not gzipped, plus it took more than twice as long to upload the
> >> data to HDFS-on-S3 in the first place), but still probably relevant.
> >>
> >> Andreas
> >>
> >> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
> >>> I'm running custom map programs written in C++. What the programs do
> is
> >> very
> >>> simple. For example, in program 2, for each input line        ID node1
> >> node2
> >>> ... nodeN
> >>> the program outputs
> >>>         node1 ID
> >>>         node2 ID
> >>>         ...
> >>>         nodeN ID
> >>>
> >>> Each node has 4GB to 8GB of memory. The java memory setting is
> -Xmx300m.
> >>>
> >>> I agree that it depends on the scripts. I tried replicating the
> >> computation
> >>> for each input line by 10 times and saw significantly better speedup.
> >> But it
> >>> is still pretty bad that Hadoop streaming has such big overhead for
> >> simple
> >>> programs.
> >>>
> >>> I also tried writing program 1 with Hadoop Java API. I got almost
> 1000%
> >>> speed up on the cluster.
> >>>
> >>> Lin
> >>>
> >>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <
> munkey906@gmail.com>
> >>> wrote:
> >>>
> >>>> are you running a custom map script or a standard linux command like
> >> WC?
> >>>>  If
> >>>> custom, what does your script do?
> >>>>
> >>>> How much ram do you have?  what are you Java memory settings?
> >>>>
> >>>> I used the following setup
> >>>>
> >>>> 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4
> >> task
> >>>> max.
> >>>>
> >>>> I got the following results
> >>>>
> >>>> WC 30-40% speedup
> >>>> Sort 40% speedup
> >>>> Grep 5X slowdown (turns out this was due to what you described
> >> above...
> >>>> Grep
> >>>> is just very highly optimized for command line)
> >>>> Custom perl script which is essentially a For loop which matches each
> >> row
> >>>> of
> >>>> a dataset to a set of 100 categories) 60% speedup.
> >>>>
> >>>> So I do think that it depends on your script... and some other
> >> settings of
> >>>> yours.
> >>>>
> >>>> Theo
> >>>>
> >>>> On Mon, Mar 31, 2008 at 2:00 PM, lin <no...@gmail.com> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> I am looking into using Hadoop streaming to parallelize some simple
> >>>>> programs. So far the performance has been pretty disappointing.
> >>>>>
> >>>>> The cluster contains 5 nodes. Each node has two CPU cores. The task
> >>>>> capacity
> >>>>> of each node is 2. The Hadoop version is 0.15.
> >>>>>
> >>>>> Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes
> >> in
> >>>>> standalone (on a single CPU core). Program runs for 5 minutes on the
> >>>>> Hadoop
> >>>>> cluster and 4.5 minutes in standalone. Both programs run as map-only
> >>>> jobs.
> >>>>>
> >>>>> I understand that there is some overhead in starting up tasks,
> >> reading
> >>>> to
> >>>>> and writing from the distributed file system. But they do not seem
> >> to
> >>>>> explain all the overhead. Most map tasks are data-local. I modified
> >>>>> program
> >>>>> 1 to output nothing and saw the same magnitude of overhead.
> >>>>>
> >>>>> The output of top shows that the majority of the CPU time is
> >> consumed by
> >>>>> Hadoop java processes (e.g.
> >> org.apache.hadoop.mapred.TaskTracker$Child).
> >>>>> So
> >>>>> I added a profile option (-agentlib:hprof=cpu=samples) to
> >>>>> mapred.child.java.opts.
> >>>>>
> >>>>> The profile results show that most of CPU time is spent in the
> >> following
> >>>>> methods
> >>>>>
> >>>>>   rank   self  accum   count trace method
> >>>>>
> >>>>>   1 23.76% 23.76%    1246 300472
> >>>> java.lang.UNIXProcess.waitForProcessExit
> >>>>>
> >>>>>   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
> >>>>>
> >>>>>   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
> >>>>>
> >>>>>   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
> >>>>>
> >>>>> And their stack traces show that these methods are for interacting
> >> with
> >>>>> the
> >>>>> map program.
> >>>>>
> >>>>>
> >>>>> TRACE 300472:
> >>>>>
> >>>>>
> >>>>>  java.lang.UNIXProcess.waitForProcessExit(
> >> UNIXProcess.java:Unknownline)
> >>>>>
> >>>>>        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> >>>>>
> >>>>>        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> >>>>>
> >>>>> TRACE 300474:
> >>>>>
> >>>>>        java.io.FileInputStream.readBytes(
> >> FileInputStream.java:Unknown
> >>>>> line)
> >>>>>
> >>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
> >>>>>
> >>>>>        java.io.BufferedInputStream.read1(BufferedInputStream.java
> >> :256)
> >>>>>
> >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
> >> :317)
> >>>>>
> >>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
> >> :218)
> >>>>>
> >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
> >> :237)
> >>>>>
> >>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
> >>>>>
> >>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
> >>>>> LineRecordReader.java:136)
> >>>>>
> >>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> >>>>> UTF8ByteArrayUtils.java:157)
> >>>>>
> >>>>>        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> >>>>> PipeMapRed.java:348)
> >>>>>
> >>>>> TRACE 300479:
> >>>>>
> >>>>>        java.io.FileInputStream.readBytes(
> >> FileInputStream.java:Unknown
> >>>>> line)
> >>>>>
> >>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
> >>>>>
> >>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
> >> :218)
> >>>>>
> >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
> >> :237)
> >>>>>
> >>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
> >>>>>
> >>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
> >>>>> LineRecordReader.java:136)
> >>>>>
> >>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> >>>>> UTF8ByteArrayUtils.java:157)
> >>>>>
> >>>>>        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> >>>>> PipeMapRed.java:399)
> >>>>>
> >>>>> TRACE 300478:
> >>>>>
> >>>>>
> >>>>>  java.io.FileOutputStream.writeBytes(
> >> FileOutputStream.java:Unknownline)
> >>>>>
> >>>>>        java.io.FileOutputStream.write(FileOutputStream.java:260)
> >>>>>
> >>>>>        java.io.BufferedOutputStream.flushBuffer(
> >>>> BufferedOutputStream.java
> >>>>> :65)
> >>>>>
> >>>>>        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
> >> :123)
> >>>>>
> >>>>>        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
> >> :124)
> >>>>>
> >>>>>        java.io.DataOutputStream.flush(DataOutputStream.java:106)
> >>>>>
> >>>>>        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java
> >> :96)
> >>>>>
> >>>>>        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>>>>
> >>>>>        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> >>>>>        org.apache.hadoop.mapred.TaskTracker$Child.main(
> >> TaskTracker.java
> >>>>> :1760)
> >>>>>
> >>>>>
> >>>>> I don't understand why Hadoop streaming needs so much CPU time to
> >> read
> >>>>> from
> >>>>> and write to the map program. Note it takes 23.67% time to read from
> >> the
> >>>>> standard error of the map program while the program does not output
> >> any
> >>>>> error at all!
> >>>>>
> >>>>> Does anyone know any way to get rid of this seemingly unnecessary
> >>>> overhead
> >>>>> in Hadoop streaming?
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Lin
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Theodore Van Rooy
> >>>> http://greentheo.scroggles.com
> >>>>
> >>
>
>

Re: Hadoop streaming performance problem

Posted by Andreas Kostyrka <an...@kostyrka.org>.

Because many many people do one enjoy that verbose language you know.
(Just replaced an old 754 lines long task with a ported one that take 89
lines.)

So as crazy it might sound to some here, hadoop streaming is the primary
interface for probably a sizeable part of the "user population". (Users
being developers writing workloads for Hadoop.)

Andreas

Am Montag, den 31.03.2008, 15:15 -0700 schrieb Ted Dunning:
> 
> Hadoop can't split a gzipped file so you will only get as many maps as you
> have files.
> 
> Why the obsession with hadoop streaming?  It is at best a jury rigged
> solution.
> 
> 
> On 3/31/08 3:12 PM, "lin" <no...@gmail.com> wrote:
> 
> > Does Hadoop automatically decompress the gzipped file? I only have a single
> > input file. Does it have to be splitted and then gzipped?
> > 
> > I gzipped the input file and Hadoop only created one map task. Still java is
> > using more than 90% CPU.
> > 
> > On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <an...@kostyrka.org>
> > wrote:
> > 
> >> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
> >> provide the input files gzipped. Not great difference (e.g. 50% slower
> >> when not gzipped, plus it took more than twice as long to upload the
> >> data to HDFS-on-S3 in the first place), but still probably relevant.
> >> 
> >> Andreas
> >> 
> >> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
> >>> I'm running custom map programs written in C++. What the programs do is
> >> very
> >>> simple. For example, in program 2, for each input line        ID node1
> >> node2
> >>> ... nodeN
> >>> the program outputs
> >>>         node1 ID
> >>>         node2 ID
> >>>         ...
> >>>         nodeN ID
> >>> 
> >>> Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m.
> >>> 
> >>> I agree that it depends on the scripts. I tried replicating the
> >> computation
> >>> for each input line by 10 times and saw significantly better speedup.
> >> But it
> >>> is still pretty bad that Hadoop streaming has such big overhead for
> >> simple
> >>> programs.
> >>> 
> >>> I also tried writing program 1 with Hadoop Java API. I got almost 1000%
> >>> speed up on the cluster.
> >>> 
> >>> Lin
> >>> 
> >>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <mu...@gmail.com>
> >>> wrote:
> >>> 
> >>>> are you running a custom map script or a standard linux command like
> >> WC?
> >>>>  If
> >>>> custom, what does your script do?
> >>>> 
> >>>> How much ram do you have?  what are you Java memory settings?
> >>>> 
> >>>> I used the following setup
> >>>> 
> >>>> 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4
> >> task
> >>>> max.
> >>>> 
> >>>> I got the following results
> >>>> 
> >>>> WC 30-40% speedup
> >>>> Sort 40% speedup
> >>>> Grep 5X slowdown (turns out this was due to what you described
> >> above...
> >>>> Grep
> >>>> is just very highly optimized for command line)
> >>>> Custom perl script which is essentially a For loop which matches each
> >> row
> >>>> of
> >>>> a dataset to a set of 100 categories) 60% speedup.
> >>>> 
> >>>> So I do think that it depends on your script... and some other
> >> settings of
> >>>> yours.
> >>>> 
> >>>> Theo
> >>>> 
> >>>> On Mon, Mar 31, 2008 at 2:00 PM, lin <no...@gmail.com> wrote:
> >>>> 
> >>>>> Hi,
> >>>>> 
> >>>>> I am looking into using Hadoop streaming to parallelize some simple
> >>>>> programs. So far the performance has been pretty disappointing.
> >>>>> 
> >>>>> The cluster contains 5 nodes. Each node has two CPU cores. The task
> >>>>> capacity
> >>>>> of each node is 2. The Hadoop version is 0.15.
> >>>>> 
> >>>>> Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes
> >> in
> >>>>> standalone (on a single CPU core). Program runs for 5 minutes on the
> >>>>> Hadoop
> >>>>> cluster and 4.5 minutes in standalone. Both programs run as map-only
> >>>> jobs.
> >>>>> 
> >>>>> I understand that there is some overhead in starting up tasks,
> >> reading
> >>>> to
> >>>>> and writing from the distributed file system. But they do not seem
> >> to
> >>>>> explain all the overhead. Most map tasks are data-local. I modified
> >>>>> program
> >>>>> 1 to output nothing and saw the same magnitude of overhead.
> >>>>> 
> >>>>> The output of top shows that the majority of the CPU time is
> >> consumed by
> >>>>> Hadoop java processes (e.g.
> >> org.apache.hadoop.mapred.TaskTracker$Child).
> >>>>> So
> >>>>> I added a profile option (-agentlib:hprof=cpu=samples) to
> >>>>> mapred.child.java.opts.
> >>>>> 
> >>>>> The profile results show that most of CPU time is spent in the
> >> following
> >>>>> methods
> >>>>> 
> >>>>>   rank   self  accum   count trace method
> >>>>> 
> >>>>>   1 23.76% 23.76%    1246 300472
> >>>> java.lang.UNIXProcess.waitForProcessExit
> >>>>> 
> >>>>>   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
> >>>>> 
> >>>>>   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
> >>>>> 
> >>>>>   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
> >>>>> 
> >>>>> And their stack traces show that these methods are for interacting
> >> with
> >>>>> the
> >>>>> map program.
> >>>>> 
> >>>>> 
> >>>>> TRACE 300472:
> >>>>> 
> >>>>> 
> >>>>>  java.lang.UNIXProcess.waitForProcessExit(
> >> UNIXProcess.java:Unknownline)
> >>>>> 
> >>>>>        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> >>>>> 
> >>>>>        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> >>>>> 
> >>>>> TRACE 300474:
> >>>>> 
> >>>>>        java.io.FileInputStream.readBytes(
> >> FileInputStream.java:Unknown
> >>>>> line)
> >>>>> 
> >>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
> >>>>> 
> >>>>>        java.io.BufferedInputStream.read1(BufferedInputStream.java
> >> :256)
> >>>>> 
> >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
> >> :317)
> >>>>> 
> >>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
> >> :218)
> >>>>> 
> >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
> >> :237)
> >>>>> 
> >>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
> >>>>> 
> >>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
> >>>>> LineRecordReader.java:136)
> >>>>> 
> >>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> >>>>> UTF8ByteArrayUtils.java:157)
> >>>>> 
> >>>>>        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> >>>>> PipeMapRed.java:348)
> >>>>> 
> >>>>> TRACE 300479:
> >>>>> 
> >>>>>        java.io.FileInputStream.readBytes(
> >> FileInputStream.java:Unknown
> >>>>> line)
> >>>>> 
> >>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
> >>>>> 
> >>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
> >> :218)
> >>>>> 
> >>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
> >> :237)
> >>>>> 
> >>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
> >>>>> 
> >>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
> >>>>> LineRecordReader.java:136)
> >>>>> 
> >>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> >>>>> UTF8ByteArrayUtils.java:157)
> >>>>> 
> >>>>>        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> >>>>> PipeMapRed.java:399)
> >>>>> 
> >>>>> TRACE 300478:
> >>>>> 
> >>>>> 
> >>>>>  java.io.FileOutputStream.writeBytes(
> >> FileOutputStream.java:Unknownline)
> >>>>> 
> >>>>>        java.io.FileOutputStream.write(FileOutputStream.java:260)
> >>>>> 
> >>>>>        java.io.BufferedOutputStream.flushBuffer(
> >>>> BufferedOutputStream.java
> >>>>> :65)
> >>>>> 
> >>>>>        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
> >> :123)
> >>>>> 
> >>>>>        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
> >> :124)
> >>>>> 
> >>>>>        java.io.DataOutputStream.flush(DataOutputStream.java:106)
> >>>>> 
> >>>>>        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java
> >> :96)
> >>>>> 
> >>>>>        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >>>>> 
> >>>>>        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> >>>>>        org.apache.hadoop.mapred.TaskTracker$Child.main(
> >> TaskTracker.java
> >>>>> :1760)
> >>>>> 
> >>>>> 
> >>>>> I don't understand why Hadoop streaming needs so much CPU time to
> >> read
> >>>>> from
> >>>>> and write to the map program. Note it takes 23.67% time to read from
> >> the
> >>>>> standard error of the map program while the program does not output
> >> any
> >>>>> error at all!
> >>>>> 
> >>>>> Does anyone know any way to get rid of this seemingly unnecessary
> >>>> overhead
> >>>>> in Hadoop streaming?
> >>>>> 
> >>>>> Thanks,
> >>>>> 
> >>>>> Lin
> >>>>> 
> >>>> 
> >>>> 
> >>>> 
> >>>> --
> >>>> Theodore Van Rooy
> >>>> http://greentheo.scroggles.com
> >>>> 
> >>

Re: Hadoop streaming performance problem

Posted by Ted Dunning <td...@veoh.com>.


Hadoop can't split a gzipped file so you will only get as many maps as you
have files.

Why the obsession with hadoop streaming?  It is at best a jury rigged
solution.


On 3/31/08 3:12 PM, "lin" <no...@gmail.com> wrote:

> Does Hadoop automatically decompress the gzipped file? I only have a single
> input file. Does it have to be splitted and then gzipped?
> 
> I gzipped the input file and Hadoop only created one map task. Still java is
> using more than 90% CPU.
> 
> On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <an...@kostyrka.org>
> wrote:
> 
>> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
>> provide the input files gzipped. Not great difference (e.g. 50% slower
>> when not gzipped, plus it took more than twice as long to upload the
>> data to HDFS-on-S3 in the first place), but still probably relevant.
>> 
>> Andreas
>> 
>> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
>>> I'm running custom map programs written in C++. What the programs do is
>> very
>>> simple. For example, in program 2, for each input line        ID node1
>> node2
>>> ... nodeN
>>> the program outputs
>>>         node1 ID
>>>         node2 ID
>>>         ...
>>>         nodeN ID
>>> 
>>> Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m.
>>> 
>>> I agree that it depends on the scripts. I tried replicating the
>> computation
>>> for each input line by 10 times and saw significantly better speedup.
>> But it
>>> is still pretty bad that Hadoop streaming has such big overhead for
>> simple
>>> programs.
>>> 
>>> I also tried writing program 1 with Hadoop Java API. I got almost 1000%
>>> speed up on the cluster.
>>> 
>>> Lin
>>> 
>>> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <mu...@gmail.com>
>>> wrote:
>>> 
>>>> are you running a custom map script or a standard linux command like
>> WC?
>>>>  If
>>>> custom, what does your script do?
>>>> 
>>>> How much ram do you have?  what are you Java memory settings?
>>>> 
>>>> I used the following setup
>>>> 
>>>> 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4
>> task
>>>> max.
>>>> 
>>>> I got the following results
>>>> 
>>>> WC 30-40% speedup
>>>> Sort 40% speedup
>>>> Grep 5X slowdown (turns out this was due to what you described
>> above...
>>>> Grep
>>>> is just very highly optimized for command line)
>>>> Custom perl script which is essentially a For loop which matches each
>> row
>>>> of
>>>> a dataset to a set of 100 categories) 60% speedup.
>>>> 
>>>> So I do think that it depends on your script... and some other
>> settings of
>>>> yours.
>>>> 
>>>> Theo
>>>> 
>>>> On Mon, Mar 31, 2008 at 2:00 PM, lin <no...@gmail.com> wrote:
>>>> 
>>>>> Hi,
>>>>> 
>>>>> I am looking into using Hadoop streaming to parallelize some simple
>>>>> programs. So far the performance has been pretty disappointing.
>>>>> 
>>>>> The cluster contains 5 nodes. Each node has two CPU cores. The task
>>>>> capacity
>>>>> of each node is 2. The Hadoop version is 0.15.
>>>>> 
>>>>> Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes
>> in
>>>>> standalone (on a single CPU core). Program runs for 5 minutes on the
>>>>> Hadoop
>>>>> cluster and 4.5 minutes in standalone. Both programs run as map-only
>>>> jobs.
>>>>> 
>>>>> I understand that there is some overhead in starting up tasks,
>> reading
>>>> to
>>>>> and writing from the distributed file system. But they do not seem
>> to
>>>>> explain all the overhead. Most map tasks are data-local. I modified
>>>>> program
>>>>> 1 to output nothing and saw the same magnitude of overhead.
>>>>> 
>>>>> The output of top shows that the majority of the CPU time is
>> consumed by
>>>>> Hadoop java processes (e.g.
>> org.apache.hadoop.mapred.TaskTracker$Child).
>>>>> So
>>>>> I added a profile option (-agentlib:hprof=cpu=samples) to
>>>>> mapred.child.java.opts.
>>>>> 
>>>>> The profile results show that most of CPU time is spent in the
>> following
>>>>> methods
>>>>> 
>>>>>   rank   self  accum   count trace method
>>>>> 
>>>>>   1 23.76% 23.76%    1246 300472
>>>> java.lang.UNIXProcess.waitForProcessExit
>>>>> 
>>>>>   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
>>>>> 
>>>>>   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
>>>>> 
>>>>>   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
>>>>> 
>>>>> And their stack traces show that these methods are for interacting
>> with
>>>>> the
>>>>> map program.
>>>>> 
>>>>> 
>>>>> TRACE 300472:
>>>>> 
>>>>> 
>>>>>  java.lang.UNIXProcess.waitForProcessExit(
>> UNIXProcess.java:Unknownline)
>>>>> 
>>>>>        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>>>>> 
>>>>>        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>>>>> 
>>>>> TRACE 300474:
>>>>> 
>>>>>        java.io.FileInputStream.readBytes(
>> FileInputStream.java:Unknown
>>>>> line)
>>>>> 
>>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
>>>>> 
>>>>>        java.io.BufferedInputStream.read1(BufferedInputStream.java
>> :256)
>>>>> 
>>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
>> :317)
>>>>> 
>>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
>> :218)
>>>>> 
>>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
>> :237)
>>>>> 
>>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
>>>>> 
>>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
>>>>> LineRecordReader.java:136)
>>>>> 
>>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
>>>>> UTF8ByteArrayUtils.java:157)
>>>>> 
>>>>>        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
>>>>> PipeMapRed.java:348)
>>>>> 
>>>>> TRACE 300479:
>>>>> 
>>>>>        java.io.FileInputStream.readBytes(
>> FileInputStream.java:Unknown
>>>>> line)
>>>>> 
>>>>>        java.io.FileInputStream.read(FileInputStream.java:199)
>>>>> 
>>>>>        java.io.BufferedInputStream.fill(BufferedInputStream.java
>> :218)
>>>>> 
>>>>>        java.io.BufferedInputStream.read(BufferedInputStream.java
>> :237)
>>>>> 
>>>>>        java.io.FilterInputStream.read(FilterInputStream.java:66)
>>>>> 
>>>>>        org.apache.hadoop.mapred.LineRecordReader.readLine(
>>>>> LineRecordReader.java:136)
>>>>> 
>>>>>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
>>>>> UTF8ByteArrayUtils.java:157)
>>>>> 
>>>>>        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
>>>>> PipeMapRed.java:399)
>>>>> 
>>>>> TRACE 300478:
>>>>> 
>>>>> 
>>>>>  java.io.FileOutputStream.writeBytes(
>> FileOutputStream.java:Unknownline)
>>>>> 
>>>>>        java.io.FileOutputStream.write(FileOutputStream.java:260)
>>>>> 
>>>>>        java.io.BufferedOutputStream.flushBuffer(
>>>> BufferedOutputStream.java
>>>>> :65)
>>>>> 
>>>>>        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
>> :123)
>>>>> 
>>>>>        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
>> :124)
>>>>> 
>>>>>        java.io.DataOutputStream.flush(DataOutputStream.java:106)
>>>>> 
>>>>>        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java
>> :96)
>>>>> 
>>>>>        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>>>>> 
>>>>>        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>>>>>        org.apache.hadoop.mapred.TaskTracker$Child.main(
>> TaskTracker.java
>>>>> :1760)
>>>>> 
>>>>> 
>>>>> I don't understand why Hadoop streaming needs so much CPU time to
>> read
>>>>> from
>>>>> and write to the map program. Note it takes 23.67% time to read from
>> the
>>>>> standard error of the map program while the program does not output
>> any
>>>>> error at all!
>>>>> 
>>>>> Does anyone know any way to get rid of this seemingly unnecessary
>>>> overhead
>>>>> in Hadoop streaming?
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Lin
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Theodore Van Rooy
>>>> http://greentheo.scroggles.com
>>>> 
>>

Re: Hadoop streaming performance problem

Posted by lin <no...@gmail.com>.

Does Hadoop automatically decompress the gzipped file? I only have a single
input file. Does it have to be splitted and then gzipped?

I gzipped the input file and Hadoop only created one map task. Still java is
using more than 90% CPU.

On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka <an...@kostyrka.org>
wrote:

> Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
> provide the input files gzipped. Not great difference (e.g. 50% slower
> when not gzipped, plus it took more than twice as long to upload the
> data to HDFS-on-S3 in the first place), but still probably relevant.
>
> Andreas
>
> Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
> > I'm running custom map programs written in C++. What the programs do is
> very
> > simple. For example, in program 2, for each input line        ID node1
> node2
> > ... nodeN
> > the program outputs
> >         node1 ID
> >         node2 ID
> >         ...
> >         nodeN ID
> >
> > Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m.
> >
> > I agree that it depends on the scripts. I tried replicating the
> computation
> > for each input line by 10 times and saw significantly better speedup.
> But it
> > is still pretty bad that Hadoop streaming has such big overhead for
> simple
> > programs.
> >
> > I also tried writing program 1 with Hadoop Java API. I got almost 1000%
> > speed up on the cluster.
> >
> > Lin
> >
> > On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <mu...@gmail.com>
> > wrote:
> >
> > > are you running a custom map script or a standard linux command like
> WC?
> > >  If
> > > custom, what does your script do?
> > >
> > > How much ram do you have?  what are you Java memory settings?
> > >
> > > I used the following setup
> > >
> > > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4
> task
> > > max.
> > >
> > > I got the following results
> > >
> > > WC 30-40% speedup
> > > Sort 40% speedup
> > > Grep 5X slowdown (turns out this was due to what you described
> above...
> > > Grep
> > > is just very highly optimized for command line)
> > > Custom perl script which is essentially a For loop which matches each
> row
> > > of
> > > a dataset to a set of 100 categories) 60% speedup.
> > >
> > > So I do think that it depends on your script... and some other
> settings of
> > > yours.
> > >
> > > Theo
> > >
> > > On Mon, Mar 31, 2008 at 2:00 PM, lin <no...@gmail.com> wrote:
> > >
> > > > Hi,
> > > >
> > > > I am looking into using Hadoop streaming to parallelize some simple
> > > > programs. So far the performance has been pretty disappointing.
> > > >
> > > > The cluster contains 5 nodes. Each node has two CPU cores. The task
> > > > capacity
> > > > of each node is 2. The Hadoop version is 0.15.
> > > >
> > > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes
> in
> > > > standalone (on a single CPU core). Program runs for 5 minutes on the
> > > > Hadoop
> > > > cluster and 4.5 minutes in standalone. Both programs run as map-only
> > > jobs.
> > > >
> > > > I understand that there is some overhead in starting up tasks,
> reading
> > > to
> > > > and writing from the distributed file system. But they do not seem
> to
> > > > explain all the overhead. Most map tasks are data-local. I modified
> > > > program
> > > > 1 to output nothing and saw the same magnitude of overhead.
> > > >
> > > > The output of top shows that the majority of the CPU time is
> consumed by
> > > > Hadoop java processes (e.g.
> org.apache.hadoop.mapred.TaskTracker$Child).
> > > > So
> > > > I added a profile option (-agentlib:hprof=cpu=samples) to
> > > > mapred.child.java.opts.
> > > >
> > > > The profile results show that most of CPU time is spent in the
> following
> > > > methods
> > > >
> > > >   rank   self  accum   count trace method
> > > >
> > > >   1 23.76% 23.76%    1246 300472
> > > java.lang.UNIXProcess.waitForProcessExit
> > > >
> > > >   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
> > > >
> > > >   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
> > > >
> > > >   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
> > > >
> > > > And their stack traces show that these methods are for interacting
> with
> > > > the
> > > > map program.
> > > >
> > > >
> > > > TRACE 300472:
> > > >
> > > >
> > > >  java.lang.UNIXProcess.waitForProcessExit(
> UNIXProcess.java:Unknownline)
> > > >
> > > >        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> > > >
> > > >        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > > >
> > > > TRACE 300474:
> > > >
> > > >        java.io.FileInputStream.readBytes(
> FileInputStream.java:Unknown
> > > > line)
> > > >
> > > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > > >
> > > >        java.io.BufferedInputStream.read1(BufferedInputStream.java
> :256)
> > > >
> > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> :317)
> > > >
> > > >        java.io.BufferedInputStream.fill(BufferedInputStream.java
> :218)
> > > >
> > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> :237)
> > > >
> > > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > > >
> > > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > > LineRecordReader.java:136)
> > > >
> > > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > > UTF8ByteArrayUtils.java:157)
> > > >
> > > >        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> > > > PipeMapRed.java:348)
> > > >
> > > > TRACE 300479:
> > > >
> > > >        java.io.FileInputStream.readBytes(
> FileInputStream.java:Unknown
> > > > line)
> > > >
> > > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > > >
> > > >        java.io.BufferedInputStream.fill(BufferedInputStream.java
> :218)
> > > >
> > > >        java.io.BufferedInputStream.read(BufferedInputStream.java
> :237)
> > > >
> > > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > > >
> > > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > > LineRecordReader.java:136)
> > > >
> > > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > > UTF8ByteArrayUtils.java:157)
> > > >
> > > >        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> > > > PipeMapRed.java:399)
> > > >
> > > > TRACE 300478:
> > > >
> > > >
> > > >  java.io.FileOutputStream.writeBytes(
> FileOutputStream.java:Unknownline)
> > > >
> > > >        java.io.FileOutputStream.write(FileOutputStream.java:260)
> > > >
> > > >        java.io.BufferedOutputStream.flushBuffer(
> > > BufferedOutputStream.java
> > > > :65)
> > > >
> > > >        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
> :123)
> > > >
> > > >        java.io.BufferedOutputStream.flush(BufferedOutputStream.java
> :124)
> > > >
> > > >        java.io.DataOutputStream.flush(DataOutputStream.java:106)
> > > >
> > > >        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java
> :96)
> > > >
> > > >        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > > >
> > > >        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > > >        org.apache.hadoop.mapred.TaskTracker$Child.main(
> TaskTracker.java
> > > > :1760)
> > > >
> > > >
> > > > I don't understand why Hadoop streaming needs so much CPU time to
> read
> > > > from
> > > > and write to the map program. Note it takes 23.67% time to read from
> the
> > > > standard error of the map program while the program does not output
> any
> > > > error at all!
> > > >
> > > > Does anyone know any way to get rid of this seemingly unnecessary
> > > overhead
> > > > in Hadoop streaming?
> > > >
> > > > Thanks,
> > > >
> > > > Lin
> > > >
> > >
> > >
> > >
> > > --
> > > Theodore Van Rooy
> > > http://greentheo.scroggles.com
> > >
>

Re: Hadoop streaming performance problem

Posted by Andreas Kostyrka <an...@kostyrka.org>.

Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
provide the input files gzipped. Not great difference (e.g. 50% slower
when not gzipped, plus it took more than twice as long to upload the
data to HDFS-on-S3 in the first place), but still probably relevant.

Andreas

Am Montag, den 31.03.2008, 13:30 -0700 schrieb lin:
> I'm running custom map programs written in C++. What the programs do is very
> simple. For example, in program 2, for each input line        ID node1 node2
> ... nodeN
> the program outputs
>         node1 ID
>         node2 ID
>         ...
>         nodeN ID
> 
> Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m.
> 
> I agree that it depends on the scripts. I tried replicating the computation
> for each input line by 10 times and saw significantly better speedup. But it
> is still pretty bad that Hadoop streaming has such big overhead for simple
> programs.
> 
> I also tried writing program 1 with Hadoop Java API. I got almost 1000%
> speed up on the cluster.
> 
> Lin
> 
> On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <mu...@gmail.com>
> wrote:
> 
> > are you running a custom map script or a standard linux command like WC?
> >  If
> > custom, what does your script do?
> >
> > How much ram do you have?  what are you Java memory settings?
> >
> > I used the following setup
> >
> > 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 task
> > max.
> >
> > I got the following results
> >
> > WC 30-40% speedup
> > Sort 40% speedup
> > Grep 5X slowdown (turns out this was due to what you described above...
> > Grep
> > is just very highly optimized for command line)
> > Custom perl script which is essentially a For loop which matches each row
> > of
> > a dataset to a set of 100 categories) 60% speedup.
> >
> > So I do think that it depends on your script... and some other settings of
> > yours.
> >
> > Theo
> >
> > On Mon, Mar 31, 2008 at 2:00 PM, lin <no...@gmail.com> wrote:
> >
> > > Hi,
> > >
> > > I am looking into using Hadoop streaming to parallelize some simple
> > > programs. So far the performance has been pretty disappointing.
> > >
> > > The cluster contains 5 nodes. Each node has two CPU cores. The task
> > > capacity
> > > of each node is 2. The Hadoop version is 0.15.
> > >
> > > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in
> > > standalone (on a single CPU core). Program runs for 5 minutes on the
> > > Hadoop
> > > cluster and 4.5 minutes in standalone. Both programs run as map-only
> > jobs.
> > >
> > > I understand that there is some overhead in starting up tasks, reading
> > to
> > > and writing from the distributed file system. But they do not seem to
> > > explain all the overhead. Most map tasks are data-local. I modified
> > > program
> > > 1 to output nothing and saw the same magnitude of overhead.
> > >
> > > The output of top shows that the majority of the CPU time is consumed by
> > > Hadoop java processes (e.g. org.apache.hadoop.mapred.TaskTracker$Child).
> > > So
> > > I added a profile option (-agentlib:hprof=cpu=samples) to
> > > mapred.child.java.opts.
> > >
> > > The profile results show that most of CPU time is spent in the following
> > > methods
> > >
> > >   rank   self  accum   count trace method
> > >
> > >   1 23.76% 23.76%    1246 300472
> > java.lang.UNIXProcess.waitForProcessExit
> > >
> > >   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
> > >
> > >   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
> > >
> > >   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
> > >
> > > And their stack traces show that these methods are for interacting with
> > > the
> > > map program.
> > >
> > >
> > > TRACE 300472:
> > >
> > >
> > >  java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline)
> > >
> > >        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> > >
> > >        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> > >
> > > TRACE 300474:
> > >
> > >        java.io.FileInputStream.readBytes(FileInputStream.java:Unknown
> > > line)
> > >
> > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > >
> > >        java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> > >
> > >        java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> > >
> > >        java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> > >
> > >        java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> > >
> > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > >
> > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > LineRecordReader.java:136)
> > >
> > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > UTF8ByteArrayUtils.java:157)
> > >
> > >        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> > > PipeMapRed.java:348)
> > >
> > > TRACE 300479:
> > >
> > >        java.io.FileInputStream.readBytes(FileInputStream.java:Unknown
> > > line)
> > >
> > >        java.io.FileInputStream.read(FileInputStream.java:199)
> > >
> > >        java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> > >
> > >        java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> > >
> > >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> > >
> > >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > > LineRecordReader.java:136)
> > >
> > >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > > UTF8ByteArrayUtils.java:157)
> > >
> > >        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> > > PipeMapRed.java:399)
> > >
> > > TRACE 300478:
> > >
> > >
> > >  java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline)
> > >
> > >        java.io.FileOutputStream.write(FileOutputStream.java:260)
> > >
> > >        java.io.BufferedOutputStream.flushBuffer(
> > BufferedOutputStream.java
> > > :65)
> > >
> > >        java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> > >
> > >        java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
> > >
> > >        java.io.DataOutputStream.flush(DataOutputStream.java:106)
> > >
> > >        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
> > >
> > >        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> > >
> > >        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> > >        org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> > > :1760)
> > >
> > >
> > > I don't understand why Hadoop streaming needs so much CPU time to read
> > > from
> > > and write to the map program. Note it takes 23.67% time to read from the
> > > standard error of the map program while the program does not output any
> > > error at all!
> > >
> > > Does anyone know any way to get rid of this seemingly unnecessary
> > overhead
> > > in Hadoop streaming?
> > >
> > > Thanks,
> > >
> > > Lin
> > >
> >
> >
> >
> > --
> > Theodore Van Rooy
> > http://greentheo.scroggles.com
> >

Re: Hadoop streaming performance problem

Posted by lin <no...@gmail.com>.

I'm running custom map programs written in C++. What the programs do is very
simple. For example, in program 2, for each input line        ID node1 node2
... nodeN
the program outputs
        node1 ID
        node2 ID
        ...
        nodeN ID

Each node has 4GB to 8GB of memory. The java memory setting is -Xmx300m.

I agree that it depends on the scripts. I tried replicating the computation
for each input line by 10 times and saw significantly better speedup. But it
is still pretty bad that Hadoop streaming has such big overhead for simple
programs.

I also tried writing program 1 with Hadoop Java API. I got almost 1000%
speed up on the cluster.

Lin

On Mon, Mar 31, 2008 at 1:10 PM, Theodore Van Rooy <mu...@gmail.com>
wrote:

> are you running a custom map script or a standard linux command like WC?
>  If
> custom, what does your script do?
>
> How much ram do you have?  what are you Java memory settings?
>
> I used the following setup
>
> 2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 task
> max.
>
> I got the following results
>
> WC 30-40% speedup
> Sort 40% speedup
> Grep 5X slowdown (turns out this was due to what you described above...
> Grep
> is just very highly optimized for command line)
> Custom perl script which is essentially a For loop which matches each row
> of
> a dataset to a set of 100 categories) 60% speedup.
>
> So I do think that it depends on your script... and some other settings of
> yours.
>
> Theo
>
> On Mon, Mar 31, 2008 at 2:00 PM, lin <no...@gmail.com> wrote:
>
> > Hi,
> >
> > I am looking into using Hadoop streaming to parallelize some simple
> > programs. So far the performance has been pretty disappointing.
> >
> > The cluster contains 5 nodes. Each node has two CPU cores. The task
> > capacity
> > of each node is 2. The Hadoop version is 0.15.
> >
> > Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in
> > standalone (on a single CPU core). Program runs for 5 minutes on the
> > Hadoop
> > cluster and 4.5 minutes in standalone. Both programs run as map-only
> jobs.
> >
> > I understand that there is some overhead in starting up tasks, reading
> to
> > and writing from the distributed file system. But they do not seem to
> > explain all the overhead. Most map tasks are data-local. I modified
> > program
> > 1 to output nothing and saw the same magnitude of overhead.
> >
> > The output of top shows that the majority of the CPU time is consumed by
> > Hadoop java processes (e.g. org.apache.hadoop.mapred.TaskTracker$Child).
> > So
> > I added a profile option (-agentlib:hprof=cpu=samples) to
> > mapred.child.java.opts.
> >
> > The profile results show that most of CPU time is spent in the following
> > methods
> >
> >   rank   self  accum   count trace method
> >
> >   1 23.76% 23.76%    1246 300472
> java.lang.UNIXProcess.waitForProcessExit
> >
> >   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
> >
> >   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
> >
> >   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
> >
> > And their stack traces show that these methods are for interacting with
> > the
> > map program.
> >
> >
> > TRACE 300472:
> >
> >
> >  java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline)
> >
> >        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
> >
> >        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
> >
> > TRACE 300474:
> >
> >        java.io.FileInputStream.readBytes(FileInputStream.java:Unknown
> > line)
> >
> >        java.io.FileInputStream.read(FileInputStream.java:199)
> >
> >        java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
> >
> >        java.io.BufferedInputStream.read(BufferedInputStream.java:317)
> >
> >        java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> >
> >        java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> >
> >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> >
> >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > LineRecordReader.java:136)
> >
> >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > UTF8ByteArrayUtils.java:157)
> >
> >        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> > PipeMapRed.java:348)
> >
> > TRACE 300479:
> >
> >        java.io.FileInputStream.readBytes(FileInputStream.java:Unknown
> > line)
> >
> >        java.io.FileInputStream.read(FileInputStream.java:199)
> >
> >        java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
> >
> >        java.io.BufferedInputStream.read(BufferedInputStream.java:237)
> >
> >        java.io.FilterInputStream.read(FilterInputStream.java:66)
> >
> >        org.apache.hadoop.mapred.LineRecordReader.readLine(
> > LineRecordReader.java:136)
> >
> >        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> > UTF8ByteArrayUtils.java:157)
> >
> >        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> > PipeMapRed.java:399)
> >
> > TRACE 300478:
> >
> >
> >  java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline)
> >
> >        java.io.FileOutputStream.write(FileOutputStream.java:260)
> >
> >        java.io.BufferedOutputStream.flushBuffer(
> BufferedOutputStream.java
> > :65)
> >
> >        java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
> >
> >        java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
> >
> >        java.io.DataOutputStream.flush(DataOutputStream.java:106)
> >
> >        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
> >
> >        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
> >
> >        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
> >        org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> > :1760)
> >
> >
> > I don't understand why Hadoop streaming needs so much CPU time to read
> > from
> > and write to the map program. Note it takes 23.67% time to read from the
> > standard error of the map program while the program does not output any
> > error at all!
> >
> > Does anyone know any way to get rid of this seemingly unnecessary
> overhead
> > in Hadoop streaming?
> >
> > Thanks,
> >
> > Lin
> >
>
>
>
> --
> Theodore Van Rooy
> http://greentheo.scroggles.com
>

Re: Hadoop streaming performance problem

Posted by Theodore Van Rooy <mu...@gmail.com>.

are you running a custom map script or a standard linux command like WC?  If
custom, what does your script do?

How much ram do you have?  what are you Java memory settings?

I used the following setup

2 dual core, 16 G ram, 1000MB Java heap size on an empty box with a 4 task
max.

I got the following results

WC 30-40% speedup
Sort 40% speedup
Grep 5X slowdown (turns out this was due to what you described above... Grep
is just very highly optimized for command line)
Custom perl script which is essentially a For loop which matches each row of
a dataset to a set of 100 categories) 60% speedup.

So I do think that it depends on your script... and some other settings of
yours.

Theo

On Mon, Mar 31, 2008 at 2:00 PM, lin <no...@gmail.com> wrote:

> Hi,
>
> I am looking into using Hadoop streaming to parallelize some simple
> programs. So far the performance has been pretty disappointing.
>
> The cluster contains 5 nodes. Each node has two CPU cores. The task
> capacity
> of each node is 2. The Hadoop version is 0.15.
>
> Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in
> standalone (on a single CPU core). Program runs for 5 minutes on the
> Hadoop
> cluster and 4.5 minutes in standalone. Both programs run as map-only jobs.
>
> I understand that there is some overhead in starting up tasks, reading to
> and writing from the distributed file system. But they do not seem to
> explain all the overhead. Most map tasks are data-local. I modified
> program
> 1 to output nothing and saw the same magnitude of overhead.
>
> The output of top shows that the majority of the CPU time is consumed by
> Hadoop java processes (e.g. org.apache.hadoop.mapred.TaskTracker$Child).
> So
> I added a profile option (-agentlib:hprof=cpu=samples) to
> mapred.child.java.opts.
>
> The profile results show that most of CPU time is spent in the following
> methods
>
>   rank   self  accum   count trace method
>
>   1 23.76% 23.76%    1246 300472 java.lang.UNIXProcess.waitForProcessExit
>
>   2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
>
>   3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
>
>   4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
>
> And their stack traces show that these methods are for interacting with
> the
> map program.
>
>
> TRACE 300472:
>
>
>  java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline)
>
>        java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>
>        java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>
> TRACE 300474:
>
>        java.io.FileInputStream.readBytes(FileInputStream.java:Unknown
> line)
>
>        java.io.FileInputStream.read(FileInputStream.java:199)
>
>        java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>
>        java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>
>        java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>
>        java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>
>        java.io.FilterInputStream.read(FilterInputStream.java:66)
>
>        org.apache.hadoop.mapred.LineRecordReader.readLine(
> LineRecordReader.java:136)
>
>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> UTF8ByteArrayUtils.java:157)
>
>        org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> PipeMapRed.java:348)
>
> TRACE 300479:
>
>        java.io.FileInputStream.readBytes(FileInputStream.java:Unknown
> line)
>
>        java.io.FileInputStream.read(FileInputStream.java:199)
>
>        java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>
>        java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>
>        java.io.FilterInputStream.read(FilterInputStream.java:66)
>
>        org.apache.hadoop.mapred.LineRecordReader.readLine(
> LineRecordReader.java:136)
>
>        org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> UTF8ByteArrayUtils.java:157)
>
>        org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> PipeMapRed.java:399)
>
> TRACE 300478:
>
>
>  java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline)
>
>        java.io.FileOutputStream.write(FileOutputStream.java:260)
>
>        java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java
> :65)
>
>        java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>
>        java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
>
>        java.io.DataOutputStream.flush(DataOutputStream.java:106)
>
>        org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
>
>        org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>
>        org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>        org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> :1760)
>
>
> I don't understand why Hadoop streaming needs so much CPU time to read
> from
> and write to the map program. Note it takes 23.67% time to read from the
> standard error of the map program while the program does not output any
> error at all!
>
> Does anyone know any way to get rid of this seemingly unnecessary overhead
> in Hadoop streaming?
>
> Thanks,
>
> Lin
>



-- 
Theodore Van Rooy
http://greentheo.scroggles.com

Re: Hadoop streaming performance problem

Posted by Amareshwari Sriramadasu <am...@yahoo-inc.com>.

LineRecordReader.readLine() is deprecated by HADOOP-2285(http://issues.apache.org/jira/browse/HADOOP-2285) because it was slow.
But streaming still uses the method. HADOOP-2826 (http://issues.apache.org/jira/browse/HADOOP-2826) will remove the usage in streaming.
This change should improve streaming performance. When I ran simple cat from streaming, with HADOOP-2826 it ran in 33 seconds whereas 
with trunk it took 52 seconds.

Thanks
Amareshwari.

lin wrote:
> Hi,
>
> I am looking into using Hadoop streaming to parallelize some simple
> programs. So far the performance has been pretty disappointing.
>
> The cluster contains 5 nodes. Each node has two CPU cores. The task capacity
> of each node is 2. The Hadoop version is 0.15.
>
> Program 1 runs for 3.5 minutes on the Hadoop cluster and 2 minutes in
> standalone (on a single CPU core). Program runs for 5 minutes on the Hadoop
> cluster and 4.5 minutes in standalone. Both programs run as map-only jobs.
>
> I understand that there is some overhead in starting up tasks, reading to
> and writing from the distributed file system. But they do not seem to
> explain all the overhead. Most map tasks are data-local. I modified program
> 1 to output nothing and saw the same magnitude of overhead.
>
> The output of top shows that the majority of the CPU time is consumed by
> Hadoop java processes (e.g. org.apache.hadoop.mapred.TaskTracker$Child). So
> I added a profile option (-agentlib:hprof=cpu=samples) to
> mapred.child.java.opts.
>
> The profile results show that most of CPU time is spent in the following
> methods
>
>    rank   self  accum   count trace method
>
>    1 23.76% 23.76%    1246 300472 java.lang.UNIXProcess.waitForProcessExit
>
>    2 23.74% 47.50%    1245 300474 java.io.FileInputStream.readBytes
>
>    3 23.67% 71.17%    1241 300479 java.io.FileInputStream.readBytes
>
>    4 16.15% 87.32%     847 300478 java.io.FileOutputStream.writeBytes
>
> And their stack traces show that these methods are for interacting with the
> map program.
>
>
> TRACE 300472:
>
>         java.lang.UNIXProcess.waitForProcessExit(UNIXProcess.java:Unknownline)
>
>         java.lang.UNIXProcess.access$900(UNIXProcess.java:20)
>
>         java.lang.UNIXProcess$1$1.run(UNIXProcess.java:132)
>
> TRACE 300474:
>
>         java.io.FileInputStream.readBytes(FileInputStream.java:Unknown line)
>
>         java.io.FileInputStream.read(FileInputStream.java:199)
>
>         java.io.BufferedInputStream.read1(BufferedInputStream.java:256)
>
>         java.io.BufferedInputStream.read(BufferedInputStream.java:317)
>
>         java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>
>         java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>
>         java.io.FilterInputStream.read(FilterInputStream.java:66)
>
>         org.apache.hadoop.mapred.LineRecordReader.readLine(
> LineRecordReader.java:136)
>
>         org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> UTF8ByteArrayUtils.java:157)
>
>         org.apache.hadoop.streaming.PipeMapRed$MROutputThread.run(
> PipeMapRed.java:348)
>
> TRACE 300479:
>
>         java.io.FileInputStream.readBytes(FileInputStream.java:Unknown line)
>
>         java.io.FileInputStream.read(FileInputStream.java:199)
>
>         java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
>
>         java.io.BufferedInputStream.read(BufferedInputStream.java:237)
>
>         java.io.FilterInputStream.read(FilterInputStream.java:66)
>
>         org.apache.hadoop.mapred.LineRecordReader.readLine(
> LineRecordReader.java:136)
>
>         org.apache.hadoop.streaming.UTF8ByteArrayUtils.readLine(
> UTF8ByteArrayUtils.java:157)
>
>         org.apache.hadoop.streaming.PipeMapRed$MRErrorThread.run(
> PipeMapRed.java:399)
>
> TRACE 300478:
>
>         java.io.FileOutputStream.writeBytes(FileOutputStream.java:Unknownline)
>
>         java.io.FileOutputStream.write(FileOutputStream.java:260)
>
>         java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java
> :65)
>
>         java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123)
>
>         java.io.BufferedOutputStream.flush(BufferedOutputStream.java:124)
>
>         java.io.DataOutputStream.flush(DataOutputStream.java:106)
>
>         org.apache.hadoop.streaming.PipeMapper.map(PipeMapper.java:96)
>
>         org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:50)
>
>         org.apache.hadoop.mapred.MapTask.run(MapTask.java:192)
>         org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java
> :1760)
>
>
> I don't understand why Hadoop streaming needs so much CPU time to read from
> and write to the map program. Note it takes 23.67% time to read from the
> standard error of the map program while the program does not output any
> error at all!
>
> Does anyone know any way to get rid of this seemingly unnecessary overhead
> in Hadoop streaming?
>
> Thanks,
>
> Lin
>
>