You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Sean Bigdatafun <se...@gmail.com> on 2011/01/31 09:26:10 UTC

Is a Block compressed (GZIP) SequenceFile splittable in MR operation?

GZIP is not splittable. Does that mean a GZIP block compressed sequencefile
can't take advantage of MR parallelism?

How to control the size of block to be compressed in SequenceFile?

-- 
--Sean

Re: Is a Block compressed (GZIP) SequenceFile splittable in MR operation?

Posted by Harsh J <qw...@gmail.com>.

On Mon, Jan 31, 2011 at 1:56 PM, Sean Bigdatafun
<se...@gmail.com> wrote:
> How to control the size of block to be compressed in SequenceFile?

Specified when creating a SequenceFile.Writer object. See the various
SequenceFile.createWriter()

-- 
Harsh J
www.harshj.com

Re: Is a Block compressed (GZIP) SequenceFile splittable in MR operation?

Posted by Harsh J <qw...@gmail.com>.

Hello,

On Mon, Jan 31, 2011 at 10:41 PM, Sean Bigdatafun
<se...@gmail.com> wrote:
>
>
> On Mon, Jan 31, 2011 at 12:36 AM, Niels Basjes <Ni...@basjes.nl> wrote:
>>
>> Hi,
>>
>> 2011/1/31 Sean Bigdatafun <se...@gmail.com>:
>> > GZIP is not splittable.
>>
>> Correct, gzip is a stream compression system which effectively means
>> you can only start at the beginning of the data with decompressing.
>>
>> > Does that mean a GZIP block compressed sequencefile can't take advantage
>> > of MR parallelism?
>>
>> AFAIK it should be splittable in the same blocks as the compression was
>> done.
>
> Splittable within the same block?

> Normally, each mapper would pick a HDFS block (64MB in an HDFS with default
> configuration) of a 1GB file for map processing, should the file not GZIP
> compressed --- this is a scenario for an unpressed file.
> But as GZIP is not splittable, if/how can a mapper pick a block? (if it
> can't, then we can't utilize the Mapreduce framework for the parallelism).
> Can you give more answer?
>

The base fact is that GZip is not a splittable compression algorithm,
but SequenceFiles can be written with a set 'block size' for its
records, and can also be Block-Compressed with a chosen algorithm.
SequenceFile draws its own 'block' boundaries and thus can let you
achieve a splittable file with GZip compression applied in its made-up
splits.

-- 
Harsh J
www.harshj.com

Re: Is a Block compressed (GZIP) SequenceFile splittable in MR operation?

Posted by Sean Bigdatafun <se...@gmail.com>.

On Mon, Jan 31, 2011 at 12:36 AM, Niels Basjes <Ni...@basjes.nl> wrote:

> Hi,
>
> 2011/1/31 Sean Bigdatafun <se...@gmail.com>:
> > GZIP is not splittable.
>
> Correct, gzip is a stream compression system which effectively means
> you can only start at the beginning of the data with decompressing.
>
> > Does that mean a GZIP block compressed sequencefile can't take advantage
> of MR parallelism?
>
> AFAIK it should be splittable in the same blocks as the compression was
> done.
>
Splittable within the same block?

Normally, each mapper would pick a HDFS block (64MB in an HDFS with default
configuration) of a 1GB file for map processing, should the file not GZIP
compressed --- this is a scenario for an unpressed file.

But as GZIP is not splittable, if/how can a mapper pick a block? (if it
can't, then we can't utilize the Mapreduce framework for the parallelism).

Can you give more answer?

>
> > How to control the size of block to be compressed in SequenceFile?
>
> Can't help you with that one.
>
> --
> Met vriendelijke groeten,
>
> Niels Basjes
>

-- 
--Sean

Re: Is a Block compressed (GZIP) SequenceFile splittable in MR operation?

Posted by Sean Bigdatafun <se...@gmail.com>.

On Mon, Jan 31, 2011 at 12:36 AM, Niels Basjes <Ni...@basjes.nl> wrote:

> Hi,
>
> 2011/1/31 Sean Bigdatafun <se...@gmail.com>:
> > GZIP is not splittable.
>
> Correct, gzip is a stream compression system which effectively means
> you can only start at the beginning of the data with decompressing.
>
> > Does that mean a GZIP block compressed sequencefile can't take advantage
> of MR parallelism?
>
> AFAIK it should be splittable in the same blocks as the compression was
> done.
>
Splittable within the same block?

Normally, each mapper would pick a HDFS block (64MB in an HDFS with default
configuration) of a 1GB file for map processing, should the file not GZIP
compressed --- this is a scenario for an unpressed file.

But as GZIP is not splittable, if/how can a mapper pick a block? (if it
can't, then we can't utilize the Mapreduce framework for the parallelism).

Can you give more answer?

>
> > How to control the size of block to be compressed in SequenceFile?
>
> Can't help you with that one.
>
> --
> Met vriendelijke groeten,
>
> Niels Basjes
>

-- 
--Sean

Re: Is a Block compressed (GZIP) SequenceFile splittable in MR operation?

Posted by Niels Basjes <Ni...@basjes.nl>.

Hi,

2011/1/31 Sean Bigdatafun <se...@gmail.com>:
> GZIP is not splittable.

Correct, gzip is a stream compression system which effectively means
you can only start at the beginning of the data with decompressing.

> Does that mean a GZIP block compressed sequencefile can't take advantage of MR parallelism?

AFAIK it should be splittable in the same blocks as the compression was done.

> How to control the size of block to be compressed in SequenceFile?

Can't help you with that one.

-- 
Met vriendelijke groeten,

Niels Basjes