You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Manoj Babu <ma...@gmail.com> on 2012/10/17 04:27:10 UTC

Reg LZO compression

Hi All,

When using lzo compression the file size drastically reduced and the no of
mappers is reduced but the overall execution time is increased, I assume
that because mappers deals with same amount of data.

Is this the expected behavior?

Cheers!
Manoj.

Re: Reg LZO compression

Posted by lohit <lo...@gmail.com>.

As Robert said, If you job is mainly IO intensive and CPU are idle, then
having lzo would improve your overal job performance.
In your case it looks like the job you are running is not IO bound and
seems to take up CPU in compressing/decompressing the data.
It also depends on the kind of data. Some dataset might not be compressible
(eg random data) , in those cases you would end up wasting CPU cycles and
it is better to turn off compression for such jobs.

2012/10/16 Robert Dyer <ps...@gmail.com>

> Hi Manoj,
>
> If the data is the same for both tests and the number of mappers is
> fewer, then each mapper has more (uncompressed) data to process.  Thus
> each mapper should take longer and overall execution time should
> increase.
>
> As a simple example: if your data is 128MB uncompressed it may use 2
> mappers, each processing 64MB of data (1 HDFS block per map task).
> However, if you compress the data and it is now say 60MB, then one map
> task will get the entire input file, decompress the data (to 128MB),
> and process it.
>
> On Tue, Oct 16, 2012 at 9:27 PM, Manoj Babu <ma...@gmail.com> wrote:
> > Hi All,
> >
> > When using lzo compression the file size drastically reduced and the no
> of
> > mappers is reduced but the overall execution time is increased, I assume
> > that because mappers deals with same amount of data.
> >
> > Is this the expected behavior?
> >
> > Cheers!
> > Manoj.
> >
>

-- 
Have a Nice Day!
Lohit

Re: Reg LZO compression

Posted by lohit <lo...@gmail.com>.

As Robert said, If you job is mainly IO intensive and CPU are idle, then
having lzo would improve your overal job performance.
In your case it looks like the job you are running is not IO bound and
seems to take up CPU in compressing/decompressing the data.
It also depends on the kind of data. Some dataset might not be compressible
(eg random data) , in those cases you would end up wasting CPU cycles and
it is better to turn off compression for such jobs.

2012/10/16 Robert Dyer <ps...@gmail.com>

> Hi Manoj,
>
> If the data is the same for both tests and the number of mappers is
> fewer, then each mapper has more (uncompressed) data to process.  Thus
> each mapper should take longer and overall execution time should
> increase.
>
> As a simple example: if your data is 128MB uncompressed it may use 2
> mappers, each processing 64MB of data (1 HDFS block per map task).
> However, if you compress the data and it is now say 60MB, then one map
> task will get the entire input file, decompress the data (to 128MB),
> and process it.
>
> On Tue, Oct 16, 2012 at 9:27 PM, Manoj Babu <ma...@gmail.com> wrote:
> > Hi All,
> >
> > When using lzo compression the file size drastically reduced and the no
> of
> > mappers is reduced but the overall execution time is increased, I assume
> > that because mappers deals with same amount of data.
> >
> > Is this the expected behavior?
> >
> > Cheers!
> > Manoj.
> >
>

-- 
Have a Nice Day!
Lohit

Re: Reg LZO compression

Posted by lohit <lo...@gmail.com>.

As Robert said, If you job is mainly IO intensive and CPU are idle, then
having lzo would improve your overal job performance.
In your case it looks like the job you are running is not IO bound and
seems to take up CPU in compressing/decompressing the data.
It also depends on the kind of data. Some dataset might not be compressible
(eg random data) , in those cases you would end up wasting CPU cycles and
it is better to turn off compression for such jobs.

2012/10/16 Robert Dyer <ps...@gmail.com>

> Hi Manoj,
>
> If the data is the same for both tests and the number of mappers is
> fewer, then each mapper has more (uncompressed) data to process.  Thus
> each mapper should take longer and overall execution time should
> increase.
>
> As a simple example: if your data is 128MB uncompressed it may use 2
> mappers, each processing 64MB of data (1 HDFS block per map task).
> However, if you compress the data and it is now say 60MB, then one map
> task will get the entire input file, decompress the data (to 128MB),
> and process it.
>
> On Tue, Oct 16, 2012 at 9:27 PM, Manoj Babu <ma...@gmail.com> wrote:
> > Hi All,
> >
> > When using lzo compression the file size drastically reduced and the no
> of
> > mappers is reduced but the overall execution time is increased, I assume
> > that because mappers deals with same amount of data.
> >
> > Is this the expected behavior?
> >
> > Cheers!
> > Manoj.
> >
>

-- 
Have a Nice Day!
Lohit

Re: Reg LZO compression

Posted by lohit <lo...@gmail.com>.

As Robert said, If you job is mainly IO intensive and CPU are idle, then
having lzo would improve your overal job performance.
In your case it looks like the job you are running is not IO bound and
seems to take up CPU in compressing/decompressing the data.
It also depends on the kind of data. Some dataset might not be compressible
(eg random data) , in those cases you would end up wasting CPU cycles and
it is better to turn off compression for such jobs.

2012/10/16 Robert Dyer <ps...@gmail.com>

> Hi Manoj,
>
> If the data is the same for both tests and the number of mappers is
> fewer, then each mapper has more (uncompressed) data to process.  Thus
> each mapper should take longer and overall execution time should
> increase.
>
> As a simple example: if your data is 128MB uncompressed it may use 2
> mappers, each processing 64MB of data (1 HDFS block per map task).
> However, if you compress the data and it is now say 60MB, then one map
> task will get the entire input file, decompress the data (to 128MB),
> and process it.
>
> On Tue, Oct 16, 2012 at 9:27 PM, Manoj Babu <ma...@gmail.com> wrote:
> > Hi All,
> >
> > When using lzo compression the file size drastically reduced and the no
> of
> > mappers is reduced but the overall execution time is increased, I assume
> > that because mappers deals with same amount of data.
> >
> > Is this the expected behavior?
> >
> > Cheers!
> > Manoj.
> >
>

-- 
Have a Nice Day!
Lohit

Re: Reg LZO compression

Posted by Robert Dyer <ps...@gmail.com>.

Hi Manoj,

If the data is the same for both tests and the number of mappers is
fewer, then each mapper has more (uncompressed) data to process.  Thus
each mapper should take longer and overall execution time should
increase.

As a simple example: if your data is 128MB uncompressed it may use 2
mappers, each processing 64MB of data (1 HDFS block per map task).
However, if you compress the data and it is now say 60MB, then one map
task will get the entire input file, decompress the data (to 128MB),
and process it.

On Tue, Oct 16, 2012 at 9:27 PM, Manoj Babu <ma...@gmail.com> wrote:
> Hi All,
>
> When using lzo compression the file size drastically reduced and the no of
> mappers is reduced but the overall execution time is increased, I assume
> that because mappers deals with same amount of data.
>
> Is this the expected behavior?
>
> Cheers!
> Manoj.
>

Re: Reg LZO compression

Posted by Robert Dyer <ps...@gmail.com>.

Hi Manoj,

If the data is the same for both tests and the number of mappers is
fewer, then each mapper has more (uncompressed) data to process.  Thus
each mapper should take longer and overall execution time should
increase.

As a simple example: if your data is 128MB uncompressed it may use 2
mappers, each processing 64MB of data (1 HDFS block per map task).
However, if you compress the data and it is now say 60MB, then one map
task will get the entire input file, decompress the data (to 128MB),
and process it.

On Tue, Oct 16, 2012 at 9:27 PM, Manoj Babu <ma...@gmail.com> wrote:
> Hi All,
>
> When using lzo compression the file size drastically reduced and the no of
> mappers is reduced but the overall execution time is increased, I assume
> that because mappers deals with same amount of data.
>
> Is this the expected behavior?
>
> Cheers!
> Manoj.
>

Re: Reg LZO compression

Posted by Robert Dyer <ps...@gmail.com>.

Hi Manoj,

If the data is the same for both tests and the number of mappers is
fewer, then each mapper has more (uncompressed) data to process.  Thus
each mapper should take longer and overall execution time should
increase.

As a simple example: if your data is 128MB uncompressed it may use 2
mappers, each processing 64MB of data (1 HDFS block per map task).
However, if you compress the data and it is now say 60MB, then one map
task will get the entire input file, decompress the data (to 128MB),
and process it.

On Tue, Oct 16, 2012 at 9:27 PM, Manoj Babu <ma...@gmail.com> wrote:
> Hi All,
>
> When using lzo compression the file size drastically reduced and the no of
> mappers is reduced but the overall execution time is increased, I assume
> that because mappers deals with same amount of data.
>
> Is this the expected behavior?
>
> Cheers!
> Manoj.
>

Re: Reg LZO compression

Posted by Robert Dyer <ps...@gmail.com>.

Hi Manoj,

If the data is the same for both tests and the number of mappers is
fewer, then each mapper has more (uncompressed) data to process.  Thus
each mapper should take longer and overall execution time should
increase.

As a simple example: if your data is 128MB uncompressed it may use 2
mappers, each processing 64MB of data (1 HDFS block per map task).
However, if you compress the data and it is now say 60MB, then one map
task will get the entire input file, decompress the data (to 128MB),
and process it.

On Tue, Oct 16, 2012 at 9:27 PM, Manoj Babu <ma...@gmail.com> wrote:
> Hi All,
>
> When using lzo compression the file size drastically reduced and the no of
> mappers is reduced but the overall execution time is increased, I assume
> that because mappers deals with same amount of data.
>
> Is this the expected behavior?
>
> Cheers!
> Manoj.
>