You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Brent Miller <br...@gmail.com> on 2010/02/16 22:43:52 UTC

Help with Compressed Storage

Hello, I've seen issues similar to this one come up once or twice before,
but I haven't ever seen a solution to the problem that I'm having. I was
following the Compressed Storage page on the Hive Wiki
http://wiki.apache.org/hadoop/CompressedStorage and realized that the
sequence files that are created in the warehouse directory are actually
uncompressed and larger than than the originals.

For example, I have a table 'test1' who's input data looks something like:

0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
...

And after creating a second table 'test1_comp' that was crated with the
STORED AS SEQUENCEFILE directive and the compression options SET as
described in the wiki, I can look at the resultant sequence files and see
that they're just plain (uncompressed) text:

SEQ"org.apache.hadoop.io.BytesWritableorg.apache.hadoop.io.Text+�c�!Y�M
��Z^��=80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
...

I've tried messing around with different org.apache.hadoop.io.compress.*
options, but the sequence files always come out uncompressed. Has anybody
ever seen this or know away to keep the data compressed? Since the input
text is so uniform, we get huge space savings from compression and would
like to store the data this way if possible. I'm using Hadoop 20.1 and Hive
that I checked out from SVN about a week ago.

Thanks,
Brent

Re: Help with Compressed Storage

Posted by Zheng Shao <zs...@gmail.com>.

Try this one to see if it works:

hive -hiveconf io.compression.codecs=xxx,yyy,zzz

Zheng

On Wed, Feb 17, 2010 at 11:33 PM, prasenjit mukherjee
<pr...@gmail.com> wrote:
> Thanks for the pointer  that was indeed the problem.  The specific AMI I was
> using didnt include bzip2 codecs in their hadoop-site.xml.  Is there a way I
> can pass those parameters from hive, so that I dont need to manually change
> the file  ?
>
> -Thanks,
> Prasen
>
> On Thu, Feb 18, 2010 at 12:54 PM, Zheng Shao <zs...@gmail.com> wrote:
>>
>> Just remember that we need to have the BZipCodec class in the
>> following hadoop configuration:
>> Can you check?
>>
>> io.compression.codecs
>>
>> Zheng
>>
>>
>
>



-- 
Yours,
Zheng

Re: Help with Compressed Storage

Posted by prasenjit mukherjee <pr...@gmail.com>.

Thanks for the pointer  that was indeed the problem.  The specific AMI I was
using didnt include bzip2 codecs in their hadoop-site.xml.  Is there a way I
can pass those parameters from hive, so that I dont need to manually change
the file  ?

-Thanks,
Prasen

On Thu, Feb 18, 2010 at 12:54 PM, Zheng Shao <zs...@gmail.com> wrote:

> Just remember that we need to have the BZipCodec class in the
> following hadoop configuration:
> Can you check?
>
> io.compression.codecs
>
> Zheng
>
>
>

Re: Help with Compressed Storage

Posted by Zheng Shao <zs...@gmail.com>.

Just remember that we need to have the BZipCodec class in the
following hadoop configuration:
Can you check?

io.compression.codecs

Zheng

On Wed, Feb 17, 2010 at 11:21 PM, prasenjit mukherjee
<pr...@gmail.com> wrote:
> So this is the command I ran, first with  with small.gz (which worked fine)
> and  then with small.bz2 ( which didnt work )  :
>
> drop table small_table;
> CREATE  TABLE small_table(id1 string, id2 string, id3 string) ROW FORMAT
> DELIMITED FIELDS TERMINATED BY ',';
> LOAD DATA LOCAL INPATH '/root/data/small.gz' OVERWRITE INTO TABLE
> small_table;
> select * from small_table limit 1;
>
> For gz files I do see the following lines in hive_debug :
> 10/02/18 01:59:23 DEBUG ipc.RPC: Call: getBlockLocations 1
> 10/02/18 01:59:23 DEBUG util.NativeCodeLoader: Trying to load the
> custom-built native-hadoop library...
> 10/02/18 01:59:23 DEBUG util.NativeCodeLoader: Failed to load native-hadoop
> with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
> 10/02/18 01:59:23 DEBUG util.NativeCodeLoader:
> java.library.path=/usr/java/jdk1.6.0_14/jre/lib/amd64/server:/usr/java/jdk1.6.0_14/jre/lib/amd64:/usr/java/jdk1.6.0_14/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib
> 10/02/18 01:59:23 WARN util.NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 10/02/18 01:59:23 DEBUG fs.FSInputChecker: DFSClient readChunk got seqno 0
> offsetInBlock 0 lastPacketInBlock true packetLen 88
> aid1     bid2     cid3
>
> But for bzip files there is none :
> 10/02/18 01:57:18 DEBUG ipc.RPC: Call: getBlockLocations 2
> 10/02/18 01:57:18 DEBUG fs.FSInputChecker: DFSClient readChunk got seqno 0
> offsetInBlock 0 lastPacketInBlock true packetLen 85
> 10/02/18 01:57:18 WARN lazy.LazyStruct: Missing fields! Expected 3 fields
> but only got 1! Ignoring similar problems.
> BZh91AY&SYǧ    �"Y @ ><  TP?* �"��SFL� c����ѶѶ�$� �
>                                                      �w��U�)„�=8O�
> NULL    NULL
>
>
> Let me know if you still need the debug files. Attached are the small.gz and
> small.bzip2 files.
>
> Thanks and appreciate,
> -Prasen
>
> On Thu, Feb 18, 2010 at 11:52 AM, Zheng Shao <zs...@gmail.com> wrote:
>>
>> There is no special setting for bz2.
>>
>> Can you get the debug log?
>>
>> Zheng
>>
>> On Wed, Feb 17, 2010 at 9:02 PM, prasenjit mukherjee
>> <pm...@quattrowireless.com> wrote:
>> > So I tried the same with  .gz files and it worked. I am using the
>> > following
>> > hadoop version :Hadoop 0.20.1+169.56 with cloudera's ami-2359bf4a. I
>> > thought
>> > that hadoop0.20 does support bz2 compression, hence same should work
>> > with
>> > hive as well.
>> >
>> > Interesting note is that Pig works fine on the same bz2 data.  Is there
>> > any
>> > tweaking/config setup I need to do for hive to take bz2 files as input ?
>> >
>> > On Thu, Feb 18, 2010 at 8:31 AM, prasenjit mukherjee
>> > <pm...@quattrowireless.com> wrote:
>> >>
>> >> I have a similar issue with bz2 files. I have the hadoop directories :
>> >>
>> >> /ip/data/ : containing unzipped text files ( foo1.txt, foo2.txt )
>> >> /ip/datacompressed/ : containing same files bzipped (  foo1.bz2,
>> >> foo2.bz2
>> >> )
>> >>
>> >> CREATE EXTERNAL TABLE tx_log(id1 string, id2 string, id3 string)
>> >>    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002'
>> >>    LOCATION '/ip/datacompressed/';
>> >> SELECT *  FROM tx_log limit 1;
>> >>
>> >> The command works fine with LOCATION '/ip/data/' but doesnt work with
>> >> LOCATION '/ip/datacompressed/'
>> >>
>> >> Any pointers ? I thought ( like Pig  ) hive automatically detects .bz2
>> >> extensions and applies appropriate decompression. Am I wrong ?
>> >>
>> >> -Prasen
>> >>
>> >>
>> >> On Thu, Feb 18, 2010 at 3:04 AM, Zheng Shao <zs...@gmail.com> wrote:
>> >>>
>> >>> I just corrected the wiki page. It will also be a good idea to support
>> >>> case-insensitive boolean values in the code.
>> >>>
>> >>> Zheng
>> >>>
>> >>> On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller
>> >>> <br...@gmail.com>
>> >>> wrote:
>> >>> > Thanks Adam, that works for me as well.
>> >>> > It seems that the property for hive.exec.compress.output is case
>> >>> > sensitive,
>> >>> > and when it is set to TRUE (as it is on the compressed storage page
>> >>> > on
>> >>> > the
>> >>> > wiki) it is ignored by hive.
>> >>> >
>> >>> > -Brent
>> >>> >
>> >>> > On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell <ad...@immunet.com>
>> >>> > wrote:
>> >>> >>
>> >>> >> Adding these to my hive-site.xml file worked fine:
>> >>> >>
>> >>> >>  <property>
>> >>> >>        <name>hive.exec.compress.output</name>
>> >>> >>        <value>true</value>
>> >>> >>        <description>Compress output</description>
>> >>> >>  </property>
>> >>> >>
>> >>> >>  <property>
>> >>> >>        <name>mapred.output.compression.type</name>
>> >>> >>        <value>BLOCK</value>
>> >>> >>        <description>Block compression</description>
>> >>> >>  </property>
>> >>> >>
>> >>> >>
>> >>> >> On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller
>> >>> >> <br...@gmail.com>
>> >>> >> wrote:
>> >>> >> > Hello, I've seen issues similar to this one come up once or twice
>> >>> >> > before,
>> >>> >> > but I haven't ever seen a solution to the problem that I'm
>> >>> >> > having. I
>> >>> >> > was
>> >>> >> > following the Compressed Storage page on the Hive
>> >>> >> > Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized
>> >>> >> > that
>> >>> >> > the
>> >>> >> > sequence files that are created in the warehouse directory are
>> >>> >> > actually
>> >>> >> > uncompressed and larger than than the originals.
>> >>> >> > For example, I have a table 'test1' who's input data looks
>> >>> >> > something
>> >>> >> > like:
>> >>> >> > 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
>> >>> >> > 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
>> >>> >> > 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
>> >>> >> > ...
>> >>> >> > And after creating a second table 'test1_comp' that was crated
>> >>> >> > with
>> >>> >> > the
>> >>> >> > STORED AS SEQUENCEFILE directive and the compression options SET
>> >>> >> > as
>> >>> >> > described in the wiki, I can look at the resultant sequence files
>> >>> >> > and
>> >>> >> > see
>> >>> >> > that they're just plain (uncompressed) text:
>> >>> >> > SEQ "org.apache.hadoop.io.BytesWritable
>> >>> >> > org.apache.hadoop.io.Text+�c�!Y�M ��
>> >>> >> > Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=
>> >>> >> > 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
>> >>> >> > 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
>> >>> >> > 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
>> >>> >> > ...
>> >>> >> > I've tried messing around with
>> >>> >> > different org.apache.hadoop.io.compress.*
>> >>> >> > options, but the sequence files always come out uncompressed. Has
>> >>> >> > anybody
>> >>> >> > ever seen this or know away to keep the data compressed? Since
>> >>> >> > the
>> >>> >> > input
>> >>> >> > text is so uniform, we get huge space savings from compression
>> >>> >> > and
>> >>> >> > would
>> >>> >> > like to store the data this way if possible. I'm using Hadoop
>> >>> >> > 20.1
>> >>> >> > and
>> >>> >> > Hive
>> >>> >> > that I checked out from SVN about a week ago.
>> >>> >> > Thanks,
>> >>> >> > Brent
>> >>> >>
>> >>> >>
>> >>> >>
>> >>> >> --
>> >>> >> Adam J. O'Donnell, Ph.D.
>> >>> >> Immunet Corporation
>> >>> >> Cell: +1 (267) 251-0070
>> >>> >
>> >>> >
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> Yours,
>> >>> Zheng
>> >>
>> >
>> >
>>
>>
>>
>> --
>> Yours,
>> Zheng
>
>



-- 
Yours,
Zheng

Re: Help with Compressed Storage

Posted by prasenjit mukherjee <pr...@gmail.com>.

So this is the command I ran, first with  with small.gz (which worked fine)
and  then with small.bz2 ( which didnt work )  :

drop table small_table;
CREATE  TABLE small_table(id1 string, id2 string, id3 string) ROW FORMAT
DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA LOCAL INPATH '/root/data/small.gz' OVERWRITE INTO TABLE
small_table;
select * from small_table limit 1;

For gz files I do see the following lines in hive_debug :
10/02/18 01:59:23 DEBUG ipc.RPC: Call: getBlockLocations 1
10/02/18 01:59:23 DEBUG util.NativeCodeLoader: Trying to load the
custom-built native-hadoop library...
10/02/18 01:59:23 DEBUG util.NativeCodeLoader: Failed to load native-hadoop
with error: java.lang.UnsatisfiedLinkError: no hadoop in java.library.path
10/02/18 01:59:23 DEBUG util.NativeCodeLoader:
java.library.path=/usr/java/jdk1.6.0_14/jre/lib/amd64/server:/usr/java/jdk1.6.0_14/jre/lib/amd64:/usr/java/jdk1.6.0_14/jre/../lib/amd64:/usr/java/packages/lib/amd64:/lib:/usr/lib
10/02/18 01:59:23 WARN util.NativeCodeLoader: Unable to load native-hadoop
library for your platform... using builtin-java classes where applicable
10/02/18 01:59:23 DEBUG fs.FSInputChecker: DFSClient readChunk got seqno 0
offsetInBlock 0 lastPacketInBlock true packetLen 88
aid1     bid2     cid3

But for bzip files there is none :
10/02/18 01:57:18 DEBUG ipc.RPC: Call: getBlockLocations 2
10/02/18 01:57:18 DEBUG fs.FSInputChecker: DFSClient readChunk got seqno 0
offsetInBlock 0 lastPacketInBlock true packetLen 85
10/02/18 01:57:18 WARN lazy.LazyStruct: Missing fields! Expected 3 fields
but only got 1! Ignoring similar problems.
BZh91AY&SYǧ    �"Y@><  TP?*�"��SFL�c����ѶѶ�$��
                                                     �w��U�)�=8O�
NULL    NULL


Let me know if you still need the debug files. Attached are the small.gz and
small.bzip2 files.

Thanks and appreciate,
-Prasen

On Thu, Feb 18, 2010 at 11:52 AM, Zheng Shao <zs...@gmail.com> wrote:

> There is no special setting for bz2.
>
> Can you get the debug log?
>
> Zheng
>
> On Wed, Feb 17, 2010 at 9:02 PM, prasenjit mukherjee
> <pm...@quattrowireless.com> wrote:
> > So I tried the same with  .gz files and it worked. I am using the
> following
> > hadoop version :Hadoop 0.20.1+169.56 with cloudera's ami-2359bf4a. I
> thought
> > that hadoop0.20 does support bz2 compression, hence same should work with
> > hive as well.
> >
> > Interesting note is that Pig works fine on the same bz2 data.  Is there
> any
> > tweaking/config setup I need to do for hive to take bz2 files as input ?
> >
> > On Thu, Feb 18, 2010 at 8:31 AM, prasenjit mukherjee
> > <pm...@quattrowireless.com> wrote:
> >>
> >> I have a similar issue with bz2 files. I have the hadoop directories :
> >>
> >> /ip/data/ : containing unzipped text files ( foo1.txt, foo2.txt )
> >> /ip/datacompressed/ : containing same files bzipped (  foo1.bz2,
> foo2.bz2
> >> )
> >>
> >> CREATE EXTERNAL TABLE tx_log(id1 string, id2 string, id3 string)
> >>    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002'
> >>    LOCATION '/ip/datacompressed/';
> >> SELECT *  FROM tx_log limit 1;
> >>
> >> The command works fine with LOCATION '/ip/data/' but doesnt work with
> >> LOCATION '/ip/datacompressed/'
> >>
> >> Any pointers ? I thought ( like Pig  ) hive automatically detects .bz2
> >> extensions and applies appropriate decompression. Am I wrong ?
> >>
> >> -Prasen
> >>
> >>
> >> On Thu, Feb 18, 2010 at 3:04 AM, Zheng Shao <zs...@gmail.com> wrote:
> >>>
> >>> I just corrected the wiki page. It will also be a good idea to support
> >>> case-insensitive boolean values in the code.
> >>>
> >>> Zheng
> >>>
> >>> On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller <
> brentalanmiller@gmail.com>
> >>> wrote:
> >>> > Thanks Adam, that works for me as well.
> >>> > It seems that the property for hive.exec.compress.output is case
> >>> > sensitive,
> >>> > and when it is set to TRUE (as it is on the compressed storage page
> on
> >>> > the
> >>> > wiki) it is ignored by hive.
> >>> >
> >>> > -Brent
> >>> >
> >>> > On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell <ad...@immunet.com>
> >>> > wrote:
> >>> >>
> >>> >> Adding these to my hive-site.xml file worked fine:
> >>> >>
> >>> >>  <property>
> >>> >>        <name>hive.exec.compress.output</name>
> >>> >>        <value>true</value>
> >>> >>        <description>Compress output</description>
> >>> >>  </property>
> >>> >>
> >>> >>  <property>
> >>> >>        <name>mapred.output.compression.type</name>
> >>> >>        <value>BLOCK</value>
> >>> >>        <description>Block compression</description>
> >>> >>  </property>
> >>> >>
> >>> >>
> >>> >> On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller
> >>> >> <br...@gmail.com>
> >>> >> wrote:
> >>> >> > Hello, I've seen issues similar to this one come up once or twice
> >>> >> > before,
> >>> >> > but I haven't ever seen a solution to the problem that I'm having.
> I
> >>> >> > was
> >>> >> > following the Compressed Storage page on the Hive
> >>> >> > Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized
> >>> >> > that
> >>> >> > the
> >>> >> > sequence files that are created in the warehouse directory are
> >>> >> > actually
> >>> >> > uncompressed and larger than than the originals.
> >>> >> > For example, I have a table 'test1' who's input data looks
> something
> >>> >> > like:
> >>> >> > 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
> >>> >> > 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
> >>> >> > 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
> >>> >> > ...
> >>> >> > And after creating a second table 'test1_comp' that was crated
> with
> >>> >> > the
> >>> >> > STORED AS SEQUENCEFILE directive and the compression options SET
> as
> >>> >> > described in the wiki, I can look at the resultant sequence files
> >>> >> > and
> >>> >> > see
> >>> >> > that they're just plain (uncompressed) text:
> >>> >> > SEQ "org.apache.hadoop.io.BytesWritable
> >>> >> > org.apache.hadoop.io.Text+�c�!Y�M ��
> >>> >> > Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=
> >>> >> > 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
> >>> >> > 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
> >>> >> > 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
> >>> >> > ...
> >>> >> > I've tried messing around with
> >>> >> > different org.apache.hadoop.io.compress.*
> >>> >> > options, but the sequence files always come out uncompressed. Has
> >>> >> > anybody
> >>> >> > ever seen this or know away to keep the data compressed? Since the
> >>> >> > input
> >>> >> > text is so uniform, we get huge space savings from compression and
> >>> >> > would
> >>> >> > like to store the data this way if possible. I'm using Hadoop 20.1
> >>> >> > and
> >>> >> > Hive
> >>> >> > that I checked out from SVN about a week ago.
> >>> >> > Thanks,
> >>> >> > Brent
> >>> >>
> >>> >>
> >>> >>
> >>> >> --
> >>> >> Adam J. O'Donnell, Ph.D.
> >>> >> Immunet Corporation
> >>> >> Cell: +1 (267) 251-0070
> >>> >
> >>> >
> >>>
> >>>
> >>>
> >>> --
> >>> Yours,
> >>> Zheng
> >>
> >
> >
>
>
>
> --
> Yours,
> Zheng
>

Re: Help with Compressed Storage

Posted by Zheng Shao <zs...@gmail.com>.

There is no special setting for bz2.

Can you get the debug log?

Zheng

On Wed, Feb 17, 2010 at 9:02 PM, prasenjit mukherjee
<pm...@quattrowireless.com> wrote:
> So I tried the same with  .gz files and it worked. I am using the following
> hadoop version :Hadoop 0.20.1+169.56 with cloudera's ami-2359bf4a. I thought
> that hadoop0.20 does support bz2 compression, hence same should work with
> hive as well.
>
> Interesting note is that Pig works fine on the same bz2 data.  Is there any
> tweaking/config setup I need to do for hive to take bz2 files as input ?
>
> On Thu, Feb 18, 2010 at 8:31 AM, prasenjit mukherjee
> <pm...@quattrowireless.com> wrote:
>>
>> I have a similar issue with bz2 files. I have the hadoop directories :
>>
>> /ip/data/ : containing unzipped text files ( foo1.txt, foo2.txt )
>> /ip/datacompressed/ : containing same files bzipped (  foo1.bz2, foo2.bz2
>> )
>>
>> CREATE EXTERNAL TABLE tx_log(id1 string, id2 string, id3 string)
>>    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002'
>>    LOCATION '/ip/datacompressed/';
>> SELECT *  FROM tx_log limit 1;
>>
>> The command works fine with LOCATION '/ip/data/' but doesnt work with
>> LOCATION '/ip/datacompressed/'
>>
>> Any pointers ? I thought ( like Pig  ) hive automatically detects .bz2
>> extensions and applies appropriate decompression. Am I wrong ?
>>
>> -Prasen
>>
>>
>> On Thu, Feb 18, 2010 at 3:04 AM, Zheng Shao <zs...@gmail.com> wrote:
>>>
>>> I just corrected the wiki page. It will also be a good idea to support
>>> case-insensitive boolean values in the code.
>>>
>>> Zheng
>>>
>>> On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller <br...@gmail.com>
>>> wrote:
>>> > Thanks Adam, that works for me as well.
>>> > It seems that the property for hive.exec.compress.output is case
>>> > sensitive,
>>> > and when it is set to TRUE (as it is on the compressed storage page on
>>> > the
>>> > wiki) it is ignored by hive.
>>> >
>>> > -Brent
>>> >
>>> > On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell <ad...@immunet.com>
>>> > wrote:
>>> >>
>>> >> Adding these to my hive-site.xml file worked fine:
>>> >>
>>> >>  <property>
>>> >>        <name>hive.exec.compress.output</name>
>>> >>        <value>true</value>
>>> >>        <description>Compress output</description>
>>> >>  </property>
>>> >>
>>> >>  <property>
>>> >>        <name>mapred.output.compression.type</name>
>>> >>        <value>BLOCK</value>
>>> >>        <description>Block compression</description>
>>> >>  </property>
>>> >>
>>> >>
>>> >> On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller
>>> >> <br...@gmail.com>
>>> >> wrote:
>>> >> > Hello, I've seen issues similar to this one come up once or twice
>>> >> > before,
>>> >> > but I haven't ever seen a solution to the problem that I'm having. I
>>> >> > was
>>> >> > following the Compressed Storage page on the Hive
>>> >> > Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized
>>> >> > that
>>> >> > the
>>> >> > sequence files that are created in the warehouse directory are
>>> >> > actually
>>> >> > uncompressed and larger than than the originals.
>>> >> > For example, I have a table 'test1' who's input data looks something
>>> >> > like:
>>> >> > 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
>>> >> > 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
>>> >> > 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
>>> >> > ...
>>> >> > And after creating a second table 'test1_comp' that was crated with
>>> >> > the
>>> >> > STORED AS SEQUENCEFILE directive and the compression options SET as
>>> >> > described in the wiki, I can look at the resultant sequence files
>>> >> > and
>>> >> > see
>>> >> > that they're just plain (uncompressed) text:
>>> >> > SEQ "org.apache.hadoop.io.BytesWritable
>>> >> > org.apache.hadoop.io.Text+�c�!Y�M ��
>>> >> > Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=
>>> >> > 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
>>> >> > 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
>>> >> > 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
>>> >> > ...
>>> >> > I've tried messing around with
>>> >> > different org.apache.hadoop.io.compress.*
>>> >> > options, but the sequence files always come out uncompressed. Has
>>> >> > anybody
>>> >> > ever seen this or know away to keep the data compressed? Since the
>>> >> > input
>>> >> > text is so uniform, we get huge space savings from compression and
>>> >> > would
>>> >> > like to store the data this way if possible. I'm using Hadoop 20.1
>>> >> > and
>>> >> > Hive
>>> >> > that I checked out from SVN about a week ago.
>>> >> > Thanks,
>>> >> > Brent
>>> >>
>>> >>
>>> >>
>>> >> --
>>> >> Adam J. O'Donnell, Ph.D.
>>> >> Immunet Corporation
>>> >> Cell: +1 (267) 251-0070
>>> >
>>> >
>>>
>>>
>>>
>>> --
>>> Yours,
>>> Zheng
>>
>
>



-- 
Yours,
Zheng

Re: Help with Compressed Storage

Posted by prasenjit mukherjee <pm...@quattrowireless.com>.

So I tried the same with  .gz files and it worked. I am using the following
hadoop version :Hadoop 0.20.1+169.56 with cloudera's ami-2359bf4a. I thought
that hadoop0.20 does support bz2 compression, hence same should work with
hive as well.

Interesting note is that Pig works fine on the same bz2 data.  Is there any
tweaking/config setup I need to do for hive to take bz2 files as input ?

On Thu, Feb 18, 2010 at 8:31 AM, prasenjit mukherjee <
pmukherjee@quattrowireless.com> wrote:

> I have a similar issue with bz2 files. I have the hadoop directories :
>
> /ip/data/ : containing unzipped text files ( foo1.txt, foo2.txt )
> /ip/datacompressed/ : containing same files bzipped (  foo1.bz2, foo2.bz2 )
>
>
> CREATE EXTERNAL TABLE tx_log(id1 string, id2 string, id3 string)
>    ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002'
>    LOCATION '/ip/datacompressed/';
> SELECT *  FROM tx_log limit 1;
>
> The command works fine with LOCATION '/ip/data/' but doesnt work with
> LOCATION '/ip/datacompressed/'
>
> Any pointers ? I thought ( like Pig  ) hive automatically detects .bz2
> extensions and applies appropriate decompression. Am I wrong ?
>
> -Prasen
>
>
>
> On Thu, Feb 18, 2010 at 3:04 AM, Zheng Shao <zs...@gmail.com> wrote:
>
>> I just corrected the wiki page. It will also be a good idea to support
>> case-insensitive boolean values in the code.
>>
>> Zheng
>>
>> On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller <br...@gmail.com>
>> wrote:
>> > Thanks Adam, that works for me as well.
>> > It seems that the property for hive.exec.compress.output is case
>> sensitive,
>> > and when it is set to TRUE (as it is on the compressed storage page on
>> the
>> > wiki) it is ignored by hive.
>> >
>> > -Brent
>> >
>> > On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell <ad...@immunet.com>
>> wrote:
>> >>
>> >> Adding these to my hive-site.xml file worked fine:
>> >>
>> >>  <property>
>> >>        <name>hive.exec.compress.output</name>
>> >>        <value>true</value>
>> >>        <description>Compress output</description>
>> >>  </property>
>> >>
>> >>  <property>
>> >>        <name>mapred.output.compression.type</name>
>> >>        <value>BLOCK</value>
>> >>        <description>Block compression</description>
>> >>  </property>
>> >>
>> >>
>> >> On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller <
>> brentalanmiller@gmail.com>
>> >> wrote:
>> >> > Hello, I've seen issues similar to this one come up once or twice
>> >> > before,
>> >> > but I haven't ever seen a solution to the problem that I'm having. I
>> was
>> >> > following the Compressed Storage page on the Hive
>> >> > Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized
>> that
>> >> > the
>> >> > sequence files that are created in the warehouse directory are
>> actually
>> >> > uncompressed and larger than than the originals.
>> >> > For example, I have a table 'test1' who's input data looks something
>> >> > like:
>> >> > 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
>> >> > 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
>> >> > 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
>> >> > ...
>> >> > And after creating a second table 'test1_comp' that was crated with
>> the
>> >> > STORED AS SEQUENCEFILE directive and the compression options SET as
>> >> > described in the wiki, I can look at the resultant sequence files and
>> >> > see
>> >> > that they're just plain (uncompressed) text:
>> >> > SEQ "org.apache.hadoop.io.BytesWritable
>> >> > org.apache.hadoop.io.Text+�c�!Y�M ��
>> >> > Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=
>> >> > 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
>> >> > 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
>> >> > 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
>> >> > ...
>> >> > I've tried messing around with
>> different org.apache.hadoop.io.compress.*
>> >> > options, but the sequence files always come out uncompressed. Has
>> >> > anybody
>> >> > ever seen this or know away to keep the data compressed? Since the
>> input
>> >> > text is so uniform, we get huge space savings from compression and
>> would
>> >> > like to store the data this way if possible. I'm using Hadoop 20.1
>> and
>> >> > Hive
>> >> > that I checked out from SVN about a week ago.
>> >> > Thanks,
>> >> > Brent
>> >>
>> >>
>> >>
>> >> --
>> >> Adam J. O'Donnell, Ph.D.
>> >> Immunet Corporation
>> >> Cell: +1 (267) 251-0070
>> >
>> >
>>
>>
>>
>> --
>> Yours,
>> Zheng
>>
>
>

Re: Help with Compressed Storage

Posted by prasenjit mukherjee <pm...@quattrowireless.com>.

I have a similar issue with bz2 files. I have the hadoop directories :

/ip/data/ : containing unzipped text files ( foo1.txt, foo2.txt )
/ip/datacompressed/ : containing same files bzipped (  foo1.bz2, foo2.bz2 )

CREATE EXTERNAL TABLE tx_log(id1 string, id2 string, id3 string)
   ROW FORMAT DELIMITED FIELDS TERMINATED BY '\002'
   LOCATION '/ip/datacompressed/';
SELECT *  FROM tx_log limit 1;

The command works fine with LOCATION '/ip/data/' but doesnt work with
LOCATION '/ip/datacompressed/'

Any pointers ? I thought ( like Pig  ) hive automatically detects .bz2
extensions and applies appropriate decompression. Am I wrong ?

-Prasen


On Thu, Feb 18, 2010 at 3:04 AM, Zheng Shao <zs...@gmail.com> wrote:

> I just corrected the wiki page. It will also be a good idea to support
> case-insensitive boolean values in the code.
>
> Zheng
>
> On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller <br...@gmail.com>
> wrote:
> > Thanks Adam, that works for me as well.
> > It seems that the property for hive.exec.compress.output is case
> sensitive,
> > and when it is set to TRUE (as it is on the compressed storage page on
> the
> > wiki) it is ignored by hive.
> >
> > -Brent
> >
> > On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell <ad...@immunet.com>
> wrote:
> >>
> >> Adding these to my hive-site.xml file worked fine:
> >>
> >>  <property>
> >>        <name>hive.exec.compress.output</name>
> >>        <value>true</value>
> >>        <description>Compress output</description>
> >>  </property>
> >>
> >>  <property>
> >>        <name>mapred.output.compression.type</name>
> >>        <value>BLOCK</value>
> >>        <description>Block compression</description>
> >>  </property>
> >>
> >>
> >> On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller <
> brentalanmiller@gmail.com>
> >> wrote:
> >> > Hello, I've seen issues similar to this one come up once or twice
> >> > before,
> >> > but I haven't ever seen a solution to the problem that I'm having. I
> was
> >> > following the Compressed Storage page on the Hive
> >> > Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized
> that
> >> > the
> >> > sequence files that are created in the warehouse directory are
> actually
> >> > uncompressed and larger than than the originals.
> >> > For example, I have a table 'test1' who's input data looks something
> >> > like:
> >> > 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
> >> > 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
> >> > 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
> >> > ...
> >> > And after creating a second table 'test1_comp' that was crated with
> the
> >> > STORED AS SEQUENCEFILE directive and the compression options SET as
> >> > described in the wiki, I can look at the resultant sequence files and
> >> > see
> >> > that they're just plain (uncompressed) text:
> >> > SEQ "org.apache.hadoop.io.BytesWritable
> >> > org.apache.hadoop.io.Text+�c�!Y�M ��
> >> > Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=
> >> > 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
> >> > 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
> >> > 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
> >> > ...
> >> > I've tried messing around with
> different org.apache.hadoop.io.compress.*
> >> > options, but the sequence files always come out uncompressed. Has
> >> > anybody
> >> > ever seen this or know away to keep the data compressed? Since the
> input
> >> > text is so uniform, we get huge space savings from compression and
> would
> >> > like to store the data this way if possible. I'm using Hadoop 20.1 and
> >> > Hive
> >> > that I checked out from SVN about a week ago.
> >> > Thanks,
> >> > Brent
> >>
> >>
> >>
> >> --
> >> Adam J. O'Donnell, Ph.D.
> >> Immunet Corporation
> >> Cell: +1 (267) 251-0070
> >
> >
>
>
>
> --
> Yours,
> Zheng
>

Re: Help with Compressed Storage

Posted by Zheng Shao <zs...@gmail.com>.

I just corrected the wiki page. It will also be a good idea to support
case-insensitive boolean values in the code.

Zheng

On Wed, Feb 17, 2010 at 9:27 AM, Brent Miller <br...@gmail.com> wrote:
> Thanks Adam, that works for me as well.
> It seems that the property for hive.exec.compress.output is case sensitive,
> and when it is set to TRUE (as it is on the compressed storage page on the
> wiki) it is ignored by hive.
>
> -Brent
>
> On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell <ad...@immunet.com> wrote:
>>
>> Adding these to my hive-site.xml file worked fine:
>>
>>  <property>
>>        <name>hive.exec.compress.output</name>
>>        <value>true</value>
>>        <description>Compress output</description>
>>  </property>
>>
>>  <property>
>>        <name>mapred.output.compression.type</name>
>>        <value>BLOCK</value>
>>        <description>Block compression</description>
>>  </property>
>>
>>
>> On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller <br...@gmail.com>
>> wrote:
>> > Hello, I've seen issues similar to this one come up once or twice
>> > before,
>> > but I haven't ever seen a solution to the problem that I'm having. I was
>> > following the Compressed Storage page on the Hive
>> > Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized that
>> > the
>> > sequence files that are created in the warehouse directory are actually
>> > uncompressed and larger than than the originals.
>> > For example, I have a table 'test1' who's input data looks something
>> > like:
>> > 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
>> > 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
>> > 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
>> > ...
>> > And after creating a second table 'test1_comp' that was crated with the
>> > STORED AS SEQUENCEFILE directive and the compression options SET as
>> > described in the wiki, I can look at the resultant sequence files and
>> > see
>> > that they're just plain (uncompressed) text:
>> > SEQ "org.apache.hadoop.io.BytesWritable
>> > org.apache.hadoop.io.Text+�c�!Y�M ��
>> > Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=
>> > 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
>> > 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
>> > 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
>> > ...
>> > I've tried messing around with different org.apache.hadoop.io.compress.*
>> > options, but the sequence files always come out uncompressed. Has
>> > anybody
>> > ever seen this or know away to keep the data compressed? Since the input
>> > text is so uniform, we get huge space savings from compression and would
>> > like to store the data this way if possible. I'm using Hadoop 20.1 and
>> > Hive
>> > that I checked out from SVN about a week ago.
>> > Thanks,
>> > Brent
>>
>>
>>
>> --
>> Adam J. O'Donnell, Ph.D.
>> Immunet Corporation
>> Cell: +1 (267) 251-0070
>
>



-- 
Yours,
Zheng

Re: Help with Compressed Storage

Posted by Brent Miller <br...@gmail.com>.

Thanks Adam, that works for me as well.

It seems that the property for hive.exec.compress.output is case sensitive,
and when it is set to TRUE (as it is on the compressed storage page on the
wiki) it is ignored by hive.

-Brent

On Tue, Feb 16, 2010 at 4:24 PM, Adam O'Donnell <ad...@immunet.com> wrote:

> Adding these to my hive-site.xml file worked fine:
>
>  <property>
>        <name>hive.exec.compress.output</name>
>        <value>true</value>
>        <description>Compress output</description>
>  </property>
>
>  <property>
>        <name>mapred.output.compression.type</name>
>        <value>BLOCK</value>
>        <description>Block compression</description>
>  </property>
>
>
> On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller <br...@gmail.com>
> wrote:
> > Hello, I've seen issues similar to this one come up once or twice before,
> > but I haven't ever seen a solution to the problem that I'm having. I was
> > following the Compressed Storage page on the Hive
> > Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized that
> the
> > sequence files that are created in the warehouse directory are actually
> > uncompressed and larger than than the originals.
> > For example, I have a table 'test1' who's input data looks something
> like:
> > 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
> > 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
> > 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
> > ...
> > And after creating a second table 'test1_comp' that was crated with the
> > STORED AS SEQUENCEFILE directive and the compression options SET as
> > described in the wiki, I can look at the resultant sequence files and see
> > that they're just plain (uncompressed) text:
> > SEQ "org.apache.hadoop.io.BytesWritable org.apache.hadoop.io.Text+�c�!Y�M
> ��
> > Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=
> > 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
> > 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
> > 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
> > ...
> > I've tried messing around with different org.apache.hadoop.io.compress.*
> > options, but the sequence files always come out uncompressed. Has anybody
> > ever seen this or know away to keep the data compressed? Since the input
> > text is so uniform, we get huge space savings from compression and would
> > like to store the data this way if possible. I'm using Hadoop 20.1 and
> Hive
> > that I checked out from SVN about a week ago.
> > Thanks,
> > Brent
>
>
>
> --
> Adam J. O'Donnell, Ph.D.
> Immunet Corporation
> Cell: +1 (267) 251-0070
>

Re: Help with Compressed Storage

Posted by Adam O'Donnell <ad...@immunet.com>.

Adding these to my hive-site.xml file worked fine:

  <property>
        <name>hive.exec.compress.output</name>
        <value>true</value>
        <description>Compress output</description>
  </property>

  <property>
        <name>mapred.output.compression.type</name>
        <value>BLOCK</value>
        <description>Block compression</description>
  </property>


On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller <br...@gmail.com> wrote:
> Hello, I've seen issues similar to this one come up once or twice before,
> but I haven't ever seen a solution to the problem that I'm having. I was
> following the Compressed Storage page on the Hive
> Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized that the
> sequence files that are created in the warehouse directory are actually
> uncompressed and larger than than the originals.
> For example, I have a table 'test1' who's input data looks something like:
> 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
> 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
> 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
> ...
> And after creating a second table 'test1_comp' that was crated with the
> STORED AS SEQUENCEFILE directive and the compression options SET as
> described in the wiki, I can look at the resultant sequence files and see
> that they're just plain (uncompressed) text:
> SEQ "org.apache.hadoop.io.BytesWritable org.apache.hadoop.io.Text+�c�!Y�M ��
> Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=
> 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
> 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
> 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
> ...
> I've tried messing around with different org.apache.hadoop.io.compress.*
> options, but the sequence files always come out uncompressed. Has anybody
> ever seen this or know away to keep the data compressed? Since the input
> text is so uniform, we get huge space savings from compression and would
> like to store the data this way if possible. I'm using Hadoop 20.1 and Hive
> that I checked out from SVN about a week ago.
> Thanks,
> Brent



-- 
Adam J. O'Donnell, Ph.D.
Immunet Corporation
Cell: +1 (267) 251-0070

Re: Help with Compressed Storage

Posted by Brent Miller <br...@gmail.com>.

Thank you for the responses and I'm terribly sorry if I'm missing something
obvious here, but after going through google searches a second time and
reviewing your feedback, I'm still having issues with compressed storage not
seeming to work correctly.

The commands that I've been entering into the hive cli are:

SET mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET
mapred.map.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
SET hive.exec.compress.output=TRUE;
SET io.seqfile.compression.type=BLOCK;

CREATE TABLE test1_comp_gz (busId TINYINT, uId BIGINT, dStamp STRING, tStamp
STRING, canId STRING, dlc TINYINT, hexData STRING) PARTITIONED BY (bus
TINYINT, day STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY '\n' STORED AS SEQUENCEFILE;

INSERT OVERWRITE TABLE test1_comp_gz PARTITION (bus=0, day='2010-02-01')
SELECT busid, uid, dstamp, tstamp, canid, dlc, hexdata FROM test1 WHERE
bus=0 AND day='2010-02-01';

Is there something wrong here? I have the hive.exec.compress.output=true
line and I had tried adding a hive.exec.compress.intermediate=TRUE at one
point in time thinking it may have had something to do with HIVE-794 but
that seemed to have no effect.

Thanks again,
Brent

On Tue, Feb 16, 2010 at 2:32 PM, Yongqiang He <
heyongqiang@software.ict.ac.cn> wrote:

> Like Zheng said,
> Try set hive.exec.compress.output=true;
> "set hive.exec.compress.intermediate=true" is not recommended because of
> the
> cpu cost.
>
> Also in some cases, set hive.merge.mapfiles = false; will help getting a
> better compression.
>
>
> On 2/16/10 2:04 PM, "Zheng Shao" <zs...@gmail.com> wrote:
>
> > Try google "Hive compression":
> >
> > See
> >
> http://svn.apache.org/viewvc/hadoop/hive/trunk/common/src/java/org/apache/hado
> >
> op/hive/conf/HiveConf.java?p2=/hadoop/hive/trunk/common/src/java/org/apache/ha
> >
> doop/hive/conf/HiveConf.java&p1=/hadoop/hive/trunk/common/src/java/org/apache/
> >
> hadoop/hive/conf/HiveConf.java&r1=723687&r2=723686&view=diff&pathrev=723687
> >
> >     COMPRESSRESULT("hive.exec.compress.output", false),
> >     COMPRESSINTERMEDIATE("hive.exec.compress.intermediate", false),
> >
> > Hive uses different compression parameters than hadoop.
> >
> > Also, Hive support using different compressions for intermediate
> > results. See https://issues.apache.org/jira/browse/HIVE-759
> >
> >
> > Zheng
> >
> > On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller <brentalanmiller@gmail.com
> >
> > wrote:
> >> Hello, I've seen issues similar to this one come up once or twice
> before,
> >> but I haven't ever seen a solution to the problem that I'm having. I was
> >> following the Compressed Storage page on the Hive
> >> Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized that
> the
> >> sequence files that are created in the warehouse directory are actually
> >> uncompressed and larger than than the originals.
> >> For example, I have a table 'test1' who's input data looks something
> like:
> >> 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
> >> 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
> >> 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
> >> ...
> >> And after creating a second table 'test1_comp' that was crated with the
> >> STORED AS SEQUENCEFILE directive and the compression options SET as
> >> described in the wiki, I can look at the resultant sequence files and
> see
> >> that they're just plain (uncompressed) text:
> >> SEQ "org.apache.hadoop.io.BytesWritable
> org.apache.hadoop.io.Text+�c�!Y�M ��
> >> Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=
> >> 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
> >> 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
> >> 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
> >> ...
> >> I've tried messing around with different org.apache.hadoop.io.compress.*
> >> options, but the sequence files always come out uncompressed. Has
> anybody
> >> ever seen this or know away to keep the data compressed? Since the input
> >> text is so uniform, we get huge space savings from compression and would
> >> like to store the data this way if possible. I'm using Hadoop 20.1 and
> Hive
> >> that I checked out from SVN about a week ago.
> >> Thanks,
> >> Brent
> >
> >
>
>
>

Re: Help with Compressed Storage

Posted by Yongqiang He <he...@software.ict.ac.cn>.

Like Zheng said,
Try set hive.exec.compress.output=true;
"set hive.exec.compress.intermediate=true" is not recommended because of the
cpu cost.

Also in some cases, set hive.merge.mapfiles = false; will help getting a
better compression.


On 2/16/10 2:04 PM, "Zheng Shao" <zs...@gmail.com> wrote:

> Try google "Hive compression":
> 
> See 
> http://svn.apache.org/viewvc/hadoop/hive/trunk/common/src/java/org/apache/hado
> op/hive/conf/HiveConf.java?p2=/hadoop/hive/trunk/common/src/java/org/apache/ha
> doop/hive/conf/HiveConf.java&p1=/hadoop/hive/trunk/common/src/java/org/apache/
> hadoop/hive/conf/HiveConf.java&r1=723687&r2=723686&view=diff&pathrev=723687
> 
>     COMPRESSRESULT("hive.exec.compress.output", false),
>     COMPRESSINTERMEDIATE("hive.exec.compress.intermediate", false),
> 
> Hive uses different compression parameters than hadoop.
> 
> Also, Hive support using different compressions for intermediate
> results. See https://issues.apache.org/jira/browse/HIVE-759
> 
> 
> Zheng
> 
> On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller <br...@gmail.com>
> wrote:
>> Hello, I've seen issues similar to this one come up once or twice before,
>> but I haven't ever seen a solution to the problem that I'm having. I was
>> following the Compressed Storage page on the Hive
>> Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized that the
>> sequence files that are created in the warehouse directory are actually
>> uncompressed and larger than than the originals.
>> For example, I have a table 'test1' who's input data looks something like:
>> 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
>> 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
>> 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
>> ...
>> And after creating a second table 'test1_comp' that was crated with the
>> STORED AS SEQUENCEFILE directive and the compression options SET as
>> described in the wiki, I can look at the resultant sequence files and see
>> that they're just plain (uncompressed) text:
>> SEQ "org.apache.hadoop.io.BytesWritable org.apache.hadoop.io.Text+�c�!Y�M ��
>> Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=
>> 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
>> 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
>> 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
>> ...
>> I've tried messing around with different org.apache.hadoop.io.compress.*
>> options, but the sequence files always come out uncompressed. Has anybody
>> ever seen this or know away to keep the data compressed? Since the input
>> text is so uniform, we get huge space savings from compression and would
>> like to store the data this way if possible. I'm using Hadoop 20.1 and Hive
>> that I checked out from SVN about a week ago.
>> Thanks,
>> Brent
> 
>

Re: Help with Compressed Storage

Posted by Zheng Shao <zs...@gmail.com>.

Try google "Hive compression":

See http://svn.apache.org/viewvc/hadoop/hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java?p2=/hadoop/hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java&p1=/hadoop/hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java&r1=723687&r2=723686&view=diff&pathrev=723687

    COMPRESSRESULT("hive.exec.compress.output", false),
    COMPRESSINTERMEDIATE("hive.exec.compress.intermediate", false),

Hive uses different compression parameters than hadoop.

Also, Hive support using different compressions for intermediate
results. See https://issues.apache.org/jira/browse/HIVE-759


Zheng

On Tue, Feb 16, 2010 at 1:43 PM, Brent Miller <br...@gmail.com> wrote:
> Hello, I've seen issues similar to this one come up once or twice before,
> but I haven't ever seen a solution to the problem that I'm having. I was
> following the Compressed Storage page on the Hive
> Wiki http://wiki.apache.org/hadoop/CompressedStorage and realized that the
> sequence files that are created in the warehouse directory are actually
> uncompressed and larger than than the originals.
> For example, I have a table 'test1' who's input data looks something like:
> 0,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43
> 0,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43
> 0,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341
> ...
> And after creating a second table 'test1_comp' that was crated with the
> STORED AS SEQUENCEFILE directive and the compression options SET as
> described in the wiki, I can look at the resultant sequence files and see
> that they're just plain (uncompressed) text:
> SEQ "org.apache.hadoop.io.BytesWritable org.apache.hadoop.io.Text+�c�!Y�M ��
> Z^��= 80,1369962224,2010/02/01,00:00:00.101,0C030301,4,0000BD43=
> 80,1369962225,2010/02/01,00:00:00.101,0C030501,4,66268E43=
> 80,1369962226,2010/02/01,00:00:00.101,0C030701,4,041F3341=
> 80,1369962227,2010/02/01,00:00:00.101,0C030901,4,11360141=
> ...
> I've tried messing around with different org.apache.hadoop.io.compress.*
> options, but the sequence files always come out uncompressed. Has anybody
> ever seen this or know away to keep the data compressed? Since the input
> text is so uniform, we get huge space savings from compression and would
> like to store the data this way if possible. I'm using Hadoop 20.1 and Hive
> that I checked out from SVN about a week ago.
> Thanks,
> Brent



-- 
Yours,
Zheng