You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-user@hadoop.apache.org by Theodore Van Rooy <mu...@gmail.com> on 2008/03/18 23:09:21 UTC

Fastest way to do grep via hadoop streaming

I've been benchmarking hadoop streaming against just regular old command
line grep.

I set the job to run 4 tasks at a time per box, with one box (with 4
processors).  The file is a 54 Gb file with <100 bytes per line (DFS block
size 128 MB).  I grep an item that shows up in about 2% of the lines in the
data set.

And then I set
-mapper "/bin/grep myregexp"
-numReduceTasks 0

MapReduce gives me a time to complete on average of about 45 minutes.

Command Line Unix gives me a time to complete of about 7 minutes.

Then I did the same with a much smaller file (1 GB) and still got MR=3min,
Linux=7seconds)

Does anyone know of a better/faster way to do grep via streaming?

Is there a better, more optimized version written in Java or Python?

Last, why would the method I am using take so long?  I've determined that
some of the time is write time (output) from the mappers... but could it
really be that much overhead due to read time?

Thanks for your help!
-- 
Theodore Van Rooy
http://greentheo.scroggles.com

Re: Fastest way to do grep via hadoop streaming

Posted by Ted Dunning <td...@veoh.com>.
Also, streaming is not likely to be the fastest way to solve your problem
because it introduces quite a bit more copying and, even worse, context
switches into the process (java moves the data, passes it to the mapper,
reads the results).  I have seen a comment that there were flushes being
done for every line of input to the mapper, for instance.


On 3/19/08 8:26 AM, "Theodore Van Rooy" <mu...@gmail.com> wrote:

> Thanks for the response, very informative!  I'll spend some time looking
> through the streaming code and try to get an even better understanding of
> the streaming process.
> 
> On Tue, Mar 18, 2008 at 4:44 PM, Joydeep Sen Sarma <js...@facebook.com>
> wrote:
> 
>> i hope this is not an error in setup - but many multiples worse is not
>> surprising (but not nice).
>> 
>> just think about the number of times hadoop will copy/scan data around (as
>> opposed to 'grep' - which is probably ultra optimized by this time) ..
>> 
>> - starting from getting bytes out of a file - they will first be buffered
>> in a java buffered stream (copy #1)
>> - then the buffered stream will be scanned for lines worth of data and
>> then copied into a Text (#2)
>> - the Text will then be written out to a buffered output stream (#3) to
>> the streaming script.
>> - perhaps, someone will tell me why the buffered output stream is flushed
>> every iteration by Streaming - but it is:
>>        clientOut_.flush();
>>  in any case - that's likely a system call every single line of input data
>> that copies into kernel space (#4)
>> 
>> once the data comes out of grep - we get another bunch - but who cares -
>> it's 2% of the data.
>> 
>> i don't know the dfs stack well enough to count copies there - but we can
>> probably bet that there's quite a few there as well. (for one - we will be
>> scanning the data at least once to do the crc check)
>> 
>> with 4 threads pounding the cpu and so much copying going around (and this
>> is not counting that java itself is reputedly memory intensive) - we are
>> probably memory bound by this time (which shows up as cpu bound).
>> 
>> sigh.
>> 
>> 
>> 
>> 
>> -----Original Message-----
>> From: Theodore Van Rooy [mailto:munkey906@gmail.com]
>> Sent: Tue 3/18/2008 3:09 PM
>> To: core-user@hadoop.apache.org
>> Subject: Fastest way to do grep via hadoop streaming
>> 
>> I've been benchmarking hadoop streaming against just regular old command
>> line grep.
>> 
>> I set the job to run 4 tasks at a time per box, with one box (with 4
>> processors).  The file is a 54 Gb file with <100 bytes per line (DFS block
>> size 128 MB).  I grep an item that shows up in about 2% of the lines in
>> the
>> data set.
>> 
>> And then I set
>> -mapper "/bin/grep myregexp"
>> -numReduceTasks 0
>> 
>> MapReduce gives me a time to complete on average of about 45 minutes.
>> 
>> Command Line Unix gives me a time to complete of about 7 minutes.
>> 
>> Then I did the same with a much smaller file (1 GB) and still got MR=3min,
>> Linux=7seconds)
>> 
>> Does anyone know of a better/faster way to do grep via streaming?
>> 
>> Is there a better, more optimized version written in Java or Python?
>> 
>> Last, why would the method I am using take so long?  I've determined that
>> some of the time is write time (output) from the mappers... but could it
>> really be that much overhead due to read time?
>> 
>> Thanks for your help!
>> --
>> Theodore Van Rooy
>> http://greentheo.scroggles.com
>> 
>> 
> 


Re: Fastest way to do grep via hadoop streaming

Posted by Theodore Van Rooy <mu...@gmail.com>.
Thanks for the response, very informative!  I'll spend some time looking
through the streaming code and try to get an even better understanding of
the streaming process.

On Tue, Mar 18, 2008 at 4:44 PM, Joydeep Sen Sarma <js...@facebook.com>
wrote:

> i hope this is not an error in setup - but many multiples worse is not
> surprising (but not nice).
>
> just think about the number of times hadoop will copy/scan data around (as
> opposed to 'grep' - which is probably ultra optimized by this time) ..
>
> - starting from getting bytes out of a file - they will first be buffered
> in a java buffered stream (copy #1)
> - then the buffered stream will be scanned for lines worth of data and
> then copied into a Text (#2)
> - the Text will then be written out to a buffered output stream (#3) to
> the streaming script.
> - perhaps, someone will tell me why the buffered output stream is flushed
> every iteration by Streaming - but it is:
>        clientOut_.flush();
>  in any case - that's likely a system call every single line of input data
> that copies into kernel space (#4)
>
> once the data comes out of grep - we get another bunch - but who cares -
> it's 2% of the data.
>
> i don't know the dfs stack well enough to count copies there - but we can
> probably bet that there's quite a few there as well. (for one - we will be
> scanning the data at least once to do the crc check)
>
> with 4 threads pounding the cpu and so much copying going around (and this
> is not counting that java itself is reputedly memory intensive) - we are
> probably memory bound by this time (which shows up as cpu bound).
>
> sigh.
>
>
>
>
> -----Original Message-----
> From: Theodore Van Rooy [mailto:munkey906@gmail.com]
> Sent: Tue 3/18/2008 3:09 PM
> To: core-user@hadoop.apache.org
> Subject: Fastest way to do grep via hadoop streaming
>
> I've been benchmarking hadoop streaming against just regular old command
> line grep.
>
> I set the job to run 4 tasks at a time per box, with one box (with 4
> processors).  The file is a 54 Gb file with <100 bytes per line (DFS block
> size 128 MB).  I grep an item that shows up in about 2% of the lines in
> the
> data set.
>
> And then I set
> -mapper "/bin/grep myregexp"
> -numReduceTasks 0
>
> MapReduce gives me a time to complete on average of about 45 minutes.
>
> Command Line Unix gives me a time to complete of about 7 minutes.
>
> Then I did the same with a much smaller file (1 GB) and still got MR=3min,
> Linux=7seconds)
>
> Does anyone know of a better/faster way to do grep via streaming?
>
> Is there a better, more optimized version written in Java or Python?
>
> Last, why would the method I am using take so long?  I've determined that
> some of the time is write time (output) from the mappers... but could it
> really be that much overhead due to read time?
>
> Thanks for your help!
> --
> Theodore Van Rooy
> http://greentheo.scroggles.com
>
>


-- 
Theodore Van Rooy
http://greentheo.scroggles.com

RE: Fastest way to do grep via hadoop streaming

Posted by Joydeep Sen Sarma <js...@facebook.com>.
i hope this is not an error in setup - but many multiples worse is not surprising (but not nice).

just think about the number of times hadoop will copy/scan data around (as opposed to 'grep' - which is probably ultra optimized by this time) ..

- starting from getting bytes out of a file - they will first be buffered in a java buffered stream (copy #1)
- then the buffered stream will be scanned for lines worth of data and then copied into a Text (#2)
- the Text will then be written out to a buffered output stream (#3) to the streaming script.
- perhaps, someone will tell me why the buffered output stream is flushed every iteration by Streaming - but it is:
        clientOut_.flush();
  in any case - that's likely a system call every single line of input data that copies into kernel space (#4)

once the data comes out of grep - we get another bunch - but who cares - it's 2% of the data.

i don't know the dfs stack well enough to count copies there - but we can probably bet that there's quite a few there as well. (for one - we will be scanning the data at least once to do the crc check)

with 4 threads pounding the cpu and so much copying going around (and this is not counting that java itself is reputedly memory intensive) - we are probably memory bound by this time (which shows up as cpu bound).

sigh.




-----Original Message-----
From: Theodore Van Rooy [mailto:munkey906@gmail.com]
Sent: Tue 3/18/2008 3:09 PM
To: core-user@hadoop.apache.org
Subject: Fastest way to do grep via hadoop streaming
 
I've been benchmarking hadoop streaming against just regular old command
line grep.

I set the job to run 4 tasks at a time per box, with one box (with 4
processors).  The file is a 54 Gb file with <100 bytes per line (DFS block
size 128 MB).  I grep an item that shows up in about 2% of the lines in the
data set.

And then I set
-mapper "/bin/grep myregexp"
-numReduceTasks 0

MapReduce gives me a time to complete on average of about 45 minutes.

Command Line Unix gives me a time to complete of about 7 minutes.

Then I did the same with a much smaller file (1 GB) and still got MR=3min,
Linux=7seconds)

Does anyone know of a better/faster way to do grep via streaming?

Is there a better, more optimized version written in Java or Python?

Last, why would the method I am using take so long?  I've determined that
some of the time is write time (output) from the mappers... but could it
really be that much overhead due to read time?

Thanks for your help!
-- 
Theodore Van Rooy
http://greentheo.scroggles.com