You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Saurabh Nanda <sa...@gmail.com> on 2010/02/01 06:03:17 UTC

SequenceFile compression on Amazon EMR not very good

 Hi,

The size of my Gzipped weblog files is about 35MB. However, upon enabling
block compression, and inserting the logs into another Hive table
(sequencefile), the file size bloats up to about 233MB. I've done similar
processing on a local Hadoop/Hive cluster, and while the compressions is not
as good as gzipping, it still is not this bad. What could be going wrong?

I looked at the header of the resulting file and here's what it says:

SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec

Does Amazon Elastic MapReduce behave differently or am I doing something
wrong?

Saurabh.
-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: SequenceFile compression on Amazon EMR not very good

Posted by Zheng Shao <zs...@gmail.com>.

hive.exec.compress.output controls whether or not to compress hive
output. (This overrides mapred.output.compress in Hive).

All other compression flags are from hadoop. Please see
http://hadoop.apache.org/common/docs/r0.18.0/hadoop-default.html

Zheng

On Fri, Feb 19, 2010 at 5:53 AM, Saurabh Nanda <sa...@gmail.com> wrote:
> And also hive.exec.compress.*. So that makes it three sets of configuration
> variables:
>
> mapred.output.compress.*
> io.seqfile.compress.*
> hive.exec.compress.*
>
> What's the relationship between these configuration parameters and which
> ones should I set to achieve a well compress output table?
>
> Saurabh.
>
> On Fri, Feb 19, 2010 at 7:16 PM, Saurabh Nanda <sa...@gmail.com>
> wrote:
>>
>> I'm confused here Zheng. There are two sets of configuration variables.
>> Those starting with io.* and those starting with mapred.*. For making sure
>> that the final output table is compressed, which ones do I have to set?
>>
>> Saurabh.
>>
>> On Fri, Feb 19, 2010 at 12:37 AM, Zheng Shao <zs...@gmail.com> wrote:
>>>
>>> Did you also:
>>>
>>> SET mapred.output.compression.codec=org.apache....GZipCode;
>>>
>>> Zheng
>>>
>>> On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda <sa...@gmail.com>
>>> wrote:
>>> > Hi Zheng,
>>> >
>>> > I cross checked. I am setting the following in my Hive script before
>>> > the
>>> > INSERT command:
>>> >
>>> > SET io.seqfile.compression.type=BLOCK;
>>> > SET hive.exec.compress.output=true;
>>> >
>>> > A 132 MB (gzipped) input file going through a cleanup and getting
>>> > populated
>>> > in a sequencefile table is growing to 432 MB. What could be going
>>> > wrong?
>>> >
>>> > Saurabh.
>>> >
>>> > On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <sa...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Thanks, Zheng. Will do some more tests and get back.
>>> >>
>>> >> Saurabh.
>>> >>
>>> >> On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <zs...@gmail.com> wrote:
>>> >>>
>>> >>> I would first check whether it is really the block compression or
>>> >>> record compression.
>>> >>> Also maybe the block size is too small but I am not sure that is
>>> >>> tunable in SequenceFile or not.
>>> >>>
>>> >>> Zheng
>>> >>>
>>> >>> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda
>>> >>> <sa...@gmail.com>
>>> >>> wrote:
>>> >>> > Hi,
>>> >>> >
>>> >>> > The size of my Gzipped weblog files is about 35MB. However, upon
>>> >>> > enabling
>>> >>> > block compression, and inserting the logs into another Hive table
>>> >>> > (sequencefile), the file size bloats up to about 233MB. I've done
>>> >>> > similar
>>> >>> > processing on a local Hadoop/Hive cluster, and while the
>>> >>> > compressions
>>> >>> > is not
>>> >>> > as good as gzipping, it still is not this bad. What could be going
>>> >>> > wrong?
>>> >>> >
>>> >>> > I looked at the header of the resulting file and here's what it
>>> >>> > says:
>>> >>> >
>>> >>> >
>>> >>> >
>>> >>> > SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
>>> >>> >
>>> >>> > Does Amazon Elastic MapReduce behave differently or am I doing
>>> >>> > something
>>> >>> > wrong?
>>> >>> >
>>> >>> > Saurabh.
>>> >>> > --
>>> >>> > http://nandz.blogspot.com
>>> >>> > http://foodieforlife.blogspot.com
>>> >>> >
>>> >>>
>>> >>>
>>> >>>
>>> >>> --
>>> >>> Yours,
>>> >>> Zheng
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> http://nandz.blogspot.com
>>> >> http://foodieforlife.blogspot.com
>>> >
>>> >
>>> >
>>> > --
>>> > http://nandz.blogspot.com
>>> > http://foodieforlife.blogspot.com
>>> >
>>>
>>>
>>>
>>> --
>>> Yours,
>>> Zheng
>>
>>
>>
>> --
>> http://nandz.blogspot.com
>> http://foodieforlife.blogspot.com
>
>
>
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
Yours,
Zheng

Re: SequenceFile compression on Amazon EMR not very good

Posted by Saurabh Nanda <sa...@gmail.com>.

And also hive.exec.compress.*. So that makes it three sets of configuration
variables:

mapred.output.compress.*
io.seqfile.compress.*
hive.exec.compress.*

What's the relationship between these configuration parameters and which
ones should I set to achieve a well compress output table?

Saurabh.

On Fri, Feb 19, 2010 at 7:16 PM, Saurabh Nanda <sa...@gmail.com>wrote:

> I'm confused here Zheng. There are two sets of configuration variables.
> Those starting with io.* and those starting with mapred.*. For making sure
> that the final output table is compressed, which ones do I have to set?
>
> Saurabh.
>
>
> On Fri, Feb 19, 2010 at 12:37 AM, Zheng Shao <zs...@gmail.com> wrote:
>
>> Did you also:
>>
>> SET mapred.output.compression.codec=org.apache....GZipCode;
>>
>> Zheng
>>
>> On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda <sa...@gmail.com>
>> wrote:
>> > Hi Zheng,
>> >
>> > I cross checked. I am setting the following in my Hive script before the
>> > INSERT command:
>> >
>> > SET io.seqfile.compression.type=BLOCK;
>> > SET hive.exec.compress.output=true;
>> >
>> > A 132 MB (gzipped) input file going through a cleanup and getting
>> populated
>> > in a sequencefile table is growing to 432 MB. What could be going wrong?
>> >
>> > Saurabh.
>> >
>> > On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <sa...@gmail.com>
>> > wrote:
>> >>
>> >> Thanks, Zheng. Will do some more tests and get back.
>> >>
>> >> Saurabh.
>> >>
>> >> On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <zs...@gmail.com> wrote:
>> >>>
>> >>> I would first check whether it is really the block compression or
>> >>> record compression.
>> >>> Also maybe the block size is too small but I am not sure that is
>> >>> tunable in SequenceFile or not.
>> >>>
>> >>> Zheng
>> >>>
>> >>> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <
>> saurabhnanda@gmail.com>
>> >>> wrote:
>> >>> > Hi,
>> >>> >
>> >>> > The size of my Gzipped weblog files is about 35MB. However, upon
>> >>> > enabling
>> >>> > block compression, and inserting the logs into another Hive table
>> >>> > (sequencefile), the file size bloats up to about 233MB. I've done
>> >>> > similar
>> >>> > processing on a local Hadoop/Hive cluster, and while the
>> compressions
>> >>> > is not
>> >>> > as good as gzipping, it still is not this bad. What could be going
>> >>> > wrong?
>> >>> >
>> >>> > I looked at the header of the resulting file and here's what it
>> says:
>> >>> >
>> >>> >
>> >>> >
>> SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
>> >>> >
>> >>> > Does Amazon Elastic MapReduce behave differently or am I doing
>> >>> > something
>> >>> > wrong?
>> >>> >
>> >>> > Saurabh.
>> >>> > --
>> >>> > http://nandz.blogspot.com
>> >>> > http://foodieforlife.blogspot.com
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Yours,
>> >>> Zheng
>> >>
>> >>
>> >>
>> >> --
>> >> http://nandz.blogspot.com
>> >> http://foodieforlife.blogspot.com
>> >
>> >
>> >
>> > --
>> > http://nandz.blogspot.com
>> > http://foodieforlife.blogspot.com
>> >
>>
>>
>>
>> --
>> Yours,
>> Zheng
>>
>
>
>
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: SequenceFile compression on Amazon EMR not very good

Posted by Saurabh Nanda <sa...@gmail.com>.

I'm confused here Zheng. There are two sets of configuration variables.
Those starting with io.* and those starting with mapred.*. For making sure
that the final output table is compressed, which ones do I have to set?

Saurabh.

On Fri, Feb 19, 2010 at 12:37 AM, Zheng Shao <zs...@gmail.com> wrote:

> Did you also:
>
> SET mapred.output.compression.codec=org.apache....GZipCode;
>
> Zheng
>
> On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda <sa...@gmail.com>
> wrote:
> > Hi Zheng,
> >
> > I cross checked. I am setting the following in my Hive script before the
> > INSERT command:
> >
> > SET io.seqfile.compression.type=BLOCK;
> > SET hive.exec.compress.output=true;
> >
> > A 132 MB (gzipped) input file going through a cleanup and getting
> populated
> > in a sequencefile table is growing to 432 MB. What could be going wrong?
> >
> > Saurabh.
> >
> > On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <sa...@gmail.com>
> > wrote:
> >>
> >> Thanks, Zheng. Will do some more tests and get back.
> >>
> >> Saurabh.
> >>
> >> On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <zs...@gmail.com> wrote:
> >>>
> >>> I would first check whether it is really the block compression or
> >>> record compression.
> >>> Also maybe the block size is too small but I am not sure that is
> >>> tunable in SequenceFile or not.
> >>>
> >>> Zheng
> >>>
> >>> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <saurabhnanda@gmail.com
> >
> >>> wrote:
> >>> > Hi,
> >>> >
> >>> > The size of my Gzipped weblog files is about 35MB. However, upon
> >>> > enabling
> >>> > block compression, and inserting the logs into another Hive table
> >>> > (sequencefile), the file size bloats up to about 233MB. I've done
> >>> > similar
> >>> > processing on a local Hadoop/Hive cluster, and while the compressions
> >>> > is not
> >>> > as good as gzipping, it still is not this bad. What could be going
> >>> > wrong?
> >>> >
> >>> > I looked at the header of the resulting file and here's what it says:
> >>> >
> >>> >
> >>> >
> SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
> >>> >
> >>> > Does Amazon Elastic MapReduce behave differently or am I doing
> >>> > something
> >>> > wrong?
> >>> >
> >>> > Saurabh.
> >>> > --
> >>> > http://nandz.blogspot.com
> >>> > http://foodieforlife.blogspot.com
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Yours,
> >>> Zheng
> >>
> >>
> >>
> >> --
> >> http://nandz.blogspot.com
> >> http://foodieforlife.blogspot.com
> >
> >
> >
> > --
> > http://nandz.blogspot.com
> > http://foodieforlife.blogspot.com
> >
>
>
>
> --
> Yours,
> Zheng
>



-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: SequenceFile compression on Amazon EMR not very good

Posted by Zheng Shao <zs...@gmail.com>.

Did you also:

SET mapred.output.compression.codec=org.apache....GZipCode;

Zheng

On Thu, Feb 18, 2010 at 8:25 AM, Saurabh Nanda <sa...@gmail.com> wrote:
> Hi Zheng,
>
> I cross checked. I am setting the following in my Hive script before the
> INSERT command:
>
> SET io.seqfile.compression.type=BLOCK;
> SET hive.exec.compress.output=true;
>
> A 132 MB (gzipped) input file going through a cleanup and getting populated
> in a sequencefile table is growing to 432 MB. What could be going wrong?
>
> Saurabh.
>
> On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <sa...@gmail.com>
> wrote:
>>
>> Thanks, Zheng. Will do some more tests and get back.
>>
>> Saurabh.
>>
>> On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <zs...@gmail.com> wrote:
>>>
>>> I would first check whether it is really the block compression or
>>> record compression.
>>> Also maybe the block size is too small but I am not sure that is
>>> tunable in SequenceFile or not.
>>>
>>> Zheng
>>>
>>> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <sa...@gmail.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > The size of my Gzipped weblog files is about 35MB. However, upon
>>> > enabling
>>> > block compression, and inserting the logs into another Hive table
>>> > (sequencefile), the file size bloats up to about 233MB. I've done
>>> > similar
>>> > processing on a local Hadoop/Hive cluster, and while the compressions
>>> > is not
>>> > as good as gzipping, it still is not this bad. What could be going
>>> > wrong?
>>> >
>>> > I looked at the header of the resulting file and here's what it says:
>>> >
>>> >
>>> > SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
>>> >
>>> > Does Amazon Elastic MapReduce behave differently or am I doing
>>> > something
>>> > wrong?
>>> >
>>> > Saurabh.
>>> > --
>>> > http://nandz.blogspot.com
>>> > http://foodieforlife.blogspot.com
>>> >
>>>
>>>
>>>
>>> --
>>> Yours,
>>> Zheng
>>
>>
>>
>> --
>> http://nandz.blogspot.com
>> http://foodieforlife.blogspot.com
>
>
>
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
Yours,
Zheng

Re: SequenceFile compression on Amazon EMR not very good

Posted by Saurabh Nanda <sa...@gmail.com>.

Hi Zheng,

I cross checked. I am setting the following in my Hive script before the
INSERT command:

SET io.seqfile.compression.type=BLOCK;
SET hive.exec.compress.output=true;

A 132 MB (gzipped) input file going through a cleanup and getting populated
in a sequencefile table is growing to 432 MB. What could be going wrong?

Saurabh.

On Wed, Feb 3, 2010 at 2:26 PM, Saurabh Nanda <sa...@gmail.com>wrote:

> Thanks, Zheng. Will do some more tests and get back.
>
> Saurabh.
>
>
> On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <zs...@gmail.com> wrote:
>
>> I would first check whether it is really the block compression or
>> record compression.
>> Also maybe the block size is too small but I am not sure that is
>> tunable in SequenceFile or not.
>>
>> Zheng
>>
>> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <sa...@gmail.com>
>> wrote:
>> > Hi,
>> >
>> > The size of my Gzipped weblog files is about 35MB. However, upon
>> enabling
>> > block compression, and inserting the logs into another Hive table
>> > (sequencefile), the file size bloats up to about 233MB. I've done
>> similar
>> > processing on a local Hadoop/Hive cluster, and while the compressions is
>> not
>> > as good as gzipping, it still is not this bad. What could be going
>> wrong?
>> >
>> > I looked at the header of the resulting file and here's what it says:
>> >
>> >
>> SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
>> >
>> > Does Amazon Elastic MapReduce behave differently or am I doing something
>> > wrong?
>> >
>> > Saurabh.
>> > --
>> > http://nandz.blogspot.com
>> > http://foodieforlife.blogspot.com
>> >
>>
>>
>>
>> --
>> Yours,
>> Zheng
>>
>
>
>
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: SequenceFile compression on Amazon EMR not very good

Posted by Saurabh Nanda <sa...@gmail.com>.

Thanks, Zheng. Will do some more tests and get back.

Saurabh.

On Mon, Feb 1, 2010 at 1:22 PM, Zheng Shao <zs...@gmail.com> wrote:

> I would first check whether it is really the block compression or
> record compression.
> Also maybe the block size is too small but I am not sure that is
> tunable in SequenceFile or not.
>
> Zheng
>
> On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <sa...@gmail.com>
> wrote:
> > Hi,
> >
> > The size of my Gzipped weblog files is about 35MB. However, upon enabling
> > block compression, and inserting the logs into another Hive table
> > (sequencefile), the file size bloats up to about 233MB. I've done similar
> > processing on a local Hadoop/Hive cluster, and while the compressions is
> not
> > as good as gzipping, it still is not this bad. What could be going wrong?
> >
> > I looked at the header of the resulting file and here's what it says:
> >
> >
> SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
> >
> > Does Amazon Elastic MapReduce behave differently or am I doing something
> > wrong?
> >
> > Saurabh.
> > --
> > http://nandz.blogspot.com
> > http://foodieforlife.blogspot.com
> >
>
>
>
> --
> Yours,
> Zheng
>



-- 
http://nandz.blogspot.com
http://foodieforlife.blogspot.com

Re: SequenceFile compression on Amazon EMR not very good

Posted by Zheng Shao <zs...@gmail.com>.

I would first check whether it is really the block compression or
record compression.
Also maybe the block size is too small but I am not sure that is
tunable in SequenceFile or not.

Zheng

On Sun, Jan 31, 2010 at 9:03 PM, Saurabh Nanda <sa...@gmail.com> wrote:
> Hi,
>
> The size of my Gzipped weblog files is about 35MB. However, upon enabling
> block compression, and inserting the logs into another Hive table
> (sequencefile), the file size bloats up to about 233MB. I've done similar
> processing on a local Hadoop/Hive cluster, and while the compressions is not
> as good as gzipping, it still is not this bad. What could be going wrong?
>
> I looked at the header of the resulting file and here's what it says:
>
> SEQ^F"org.apache.hadoop.io.BytesWritable^Yorg.apache.hadoop.io.Text^A^@'org.apache.hadoop.io.compress.GzipCodec
>
> Does Amazon Elastic MapReduce behave differently or am I doing something
> wrong?
>
> Saurabh.
> --
> http://nandz.blogspot.com
> http://foodieforlife.blogspot.com
>



-- 
Yours,
Zheng