You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hive.apache.org by Edward Capriolo <ed...@gmail.com> on 2011/01/24 21:30:28 UTC

Can compression be used with ColumnarSerDe ?

I am trying to explore some use case that I believe are perfect for
the columnarSerDe, tables with 100+ columns where only one or two are
selected in a particular query.

CREATE TABLE (....)
ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
   STORED AS RCFile ;

My issue is my data from our source table, with gzip sequence files,
is much smaller then the ColumnarSerDe table and as a result any
performance gains are lost.

Any ideas?

Thank you,
Edward

Re: Can compression be used with ColumnarSerDe ?

Posted by yongqiang he <he...@gmail.com>.

@Zheng, right now no.  Will open a new jira once got some numbers.

On Tue, Jan 25, 2011 at 9:51 AM, Zheng Shao <zs...@gmail.com> wrote:
> Is there a jira about the design of the new file format?
>
> Sent from my iPhone
>
> On Jan 24, 2011, at 5:40 PM, yongqiang he <he...@gmail.com> wrote:
>
>> 4M is a good number based on a lot of experiments. Increase the number
>> will reduce the file size, but the saving will increase very slow
>> after the block size goes beyond 4M, but the perf/memory will increase
>> a lot.
>>
>> FYI, we are trying a new columnar file format. And it seems 4M is no
>> longer the magic number for this new file format. But still do not
>> have some number now.
>>
>> On Mon, Jan 24, 2011 at 5:03 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>> On Mon, Jan 24, 2011 at 7:51 PM, yongqiang he <he...@gmail.com> wrote:
>>>> Yes. It only support block compression. (No record level compression support.)
>>>> You can use the config 'hive.io.rcfile.record.buffer.size' to specify
>>>> the block size (before compression). The default is 4MB.
>>>>
>>>> Thanks
>>>> Yongqiang
>>>> On Mon, Jan 24, 2011 at 4:44 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>>>> On Mon, Jan 24, 2011 at 4:42 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>>>>> On Mon, Jan 24, 2011 at 4:14 PM, yongqiang he <he...@gmail.com> wrote:
>>>>>>> How did you upload the data to the new table?
>>>>>>> You can get the data compressed by doing a insert overwrite to the
>>>>>>> destination table with setting "hive.exec.compress.output" to true.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Yongqiang
>>>>>>> On Mon, Jan 24, 2011 at 12:30 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>>>>>>> I am trying to explore some use case that I believe are perfect for
>>>>>>>> the columnarSerDe, tables with 100+ columns where only one or two are
>>>>>>>> selected in a particular query.
>>>>>>>>
>>>>>>>> CREATE TABLE (....)
>>>>>>>> ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
>>>>>>>>   STORED AS RCFile ;
>>>>>>>>
>>>>>>>> My issue is my data from our source table, with gzip sequence files,
>>>>>>>> is much smaller then the ColumnarSerDe table and as a result any
>>>>>>>> performance gains are lost.
>>>>>>>>
>>>>>>>> Any ideas?
>>>>>>>>
>>>>>>>> Thank you,
>>>>>>>> Edward
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>> Thank you! That was a RTFM question.
>>>>>>
>>>>>>  set hive.exec.dynamic.partition.mode=nonstrict;
>>>>>> set hive.exec.compress.output=true;
>>>>>> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
>>>>>>
>>>>>> I was unclear about  'STORED AS RCFile' since normally you would need
>>>>>> to use ' STORED AS SEQUENCEFILE'
>>>>>>
>>>>>> However http://hive.apache.org/docs/r0.6.0/api/org/apache/hadoop/hive/ql/io/RCFile.html
>>>>>> explains this well. RCFILE is a special type of sequence file.
>>>>>>
>>>>>> I did get it working. Looks good compression for my table was smaller
>>>>>> then using GZIP BLOCK Sequence file. Query time was slightly better in
>>>>>> limited testing. Cool stuff.
>>>>>>
>>>>>> Edward
>>>>>>
>>>>>
>>>>> Do rcfiles support a blocksize for compression like other compressed
>>>>> sequence files?
>>>>>
>>>>
>>>
>>> Great. Do you have any suggestions or hints on how to tune this. Any
>>> information on what the ceiling or the floor might be?
>>>
>>> Edward
>>>
>

Re: Can compression be used with ColumnarSerDe ?

Posted by Zheng Shao <zs...@gmail.com>.

Is there a jira about the design of the new file format?

Sent from my iPhone

On Jan 24, 2011, at 5:40 PM, yongqiang he <he...@gmail.com> wrote:

> 4M is a good number based on a lot of experiments. Increase the number
> will reduce the file size, but the saving will increase very slow
> after the block size goes beyond 4M, but the perf/memory will increase
> a lot.
> 
> FYI, we are trying a new columnar file format. And it seems 4M is no
> longer the magic number for this new file format. But still do not
> have some number now.
> 
> On Mon, Jan 24, 2011 at 5:03 PM, Edward Capriolo <ed...@gmail.com> wrote:
>> On Mon, Jan 24, 2011 at 7:51 PM, yongqiang he <he...@gmail.com> wrote:
>>> Yes. It only support block compression. (No record level compression support.)
>>> You can use the config 'hive.io.rcfile.record.buffer.size' to specify
>>> the block size (before compression). The default is 4MB.
>>> 
>>> Thanks
>>> Yongqiang
>>> On Mon, Jan 24, 2011 at 4:44 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>>> On Mon, Jan 24, 2011 at 4:42 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>>>> On Mon, Jan 24, 2011 at 4:14 PM, yongqiang he <he...@gmail.com> wrote:
>>>>>> How did you upload the data to the new table?
>>>>>> You can get the data compressed by doing a insert overwrite to the
>>>>>> destination table with setting "hive.exec.compress.output" to true.
>>>>>> 
>>>>>> Thanks
>>>>>> Yongqiang
>>>>>> On Mon, Jan 24, 2011 at 12:30 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>>>>>> I am trying to explore some use case that I believe are perfect for
>>>>>>> the columnarSerDe, tables with 100+ columns where only one or two are
>>>>>>> selected in a particular query.
>>>>>>> 
>>>>>>> CREATE TABLE (....)
>>>>>>> ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
>>>>>>>   STORED AS RCFile ;
>>>>>>> 
>>>>>>> My issue is my data from our source table, with gzip sequence files,
>>>>>>> is much smaller then the ColumnarSerDe table and as a result any
>>>>>>> performance gains are lost.
>>>>>>> 
>>>>>>> Any ideas?
>>>>>>> 
>>>>>>> Thank you,
>>>>>>> Edward
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> Thank you! That was a RTFM question.
>>>>> 
>>>>>  set hive.exec.dynamic.partition.mode=nonstrict;
>>>>> set hive.exec.compress.output=true;
>>>>> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
>>>>> 
>>>>> I was unclear about  'STORED AS RCFile' since normally you would need
>>>>> to use ' STORED AS SEQUENCEFILE'
>>>>> 
>>>>> However http://hive.apache.org/docs/r0.6.0/api/org/apache/hadoop/hive/ql/io/RCFile.html
>>>>> explains this well. RCFILE is a special type of sequence file.
>>>>> 
>>>>> I did get it working. Looks good compression for my table was smaller
>>>>> then using GZIP BLOCK Sequence file. Query time was slightly better in
>>>>> limited testing. Cool stuff.
>>>>> 
>>>>> Edward
>>>>> 
>>>> 
>>>> Do rcfiles support a blocksize for compression like other compressed
>>>> sequence files?
>>>> 
>>> 
>> 
>> Great. Do you have any suggestions or hints on how to tune this. Any
>> information on what the ceiling or the floor might be?
>> 
>> Edward
>>

Re: Can compression be used with ColumnarSerDe ?

Posted by yongqiang he <he...@gmail.com>.

4M is a good number based on a lot of experiments. Increase the number
will reduce the file size, but the saving will increase very slow
after the block size goes beyond 4M, but the perf/memory will increase
a lot.

FYI, we are trying a new columnar file format. And it seems 4M is no
longer the magic number for this new file format. But still do not
have some number now.

On Mon, Jan 24, 2011 at 5:03 PM, Edward Capriolo <ed...@gmail.com> wrote:
> On Mon, Jan 24, 2011 at 7:51 PM, yongqiang he <he...@gmail.com> wrote:
>> Yes. It only support block compression. (No record level compression support.)
>> You can use the config 'hive.io.rcfile.record.buffer.size' to specify
>> the block size (before compression). The default is 4MB.
>>
>> Thanks
>> Yongqiang
>> On Mon, Jan 24, 2011 at 4:44 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>> On Mon, Jan 24, 2011 at 4:42 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>>> On Mon, Jan 24, 2011 at 4:14 PM, yongqiang he <he...@gmail.com> wrote:
>>>>> How did you upload the data to the new table?
>>>>> You can get the data compressed by doing a insert overwrite to the
>>>>> destination table with setting "hive.exec.compress.output" to true.
>>>>>
>>>>> Thanks
>>>>> Yongqiang
>>>>> On Mon, Jan 24, 2011 at 12:30 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>>>>> I am trying to explore some use case that I believe are perfect for
>>>>>> the columnarSerDe, tables with 100+ columns where only one or two are
>>>>>> selected in a particular query.
>>>>>>
>>>>>> CREATE TABLE (....)
>>>>>> ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
>>>>>>   STORED AS RCFile ;
>>>>>>
>>>>>> My issue is my data from our source table, with gzip sequence files,
>>>>>> is much smaller then the ColumnarSerDe table and as a result any
>>>>>> performance gains are lost.
>>>>>>
>>>>>> Any ideas?
>>>>>>
>>>>>> Thank you,
>>>>>> Edward
>>>>>>
>>>>>
>>>>
>>>> Thank you! That was a RTFM question.
>>>>
>>>>  set hive.exec.dynamic.partition.mode=nonstrict;
>>>> set hive.exec.compress.output=true;
>>>> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
>>>>
>>>> I was unclear about  'STORED AS RCFile' since normally you would need
>>>> to use ' STORED AS SEQUENCEFILE'
>>>>
>>>> However http://hive.apache.org/docs/r0.6.0/api/org/apache/hadoop/hive/ql/io/RCFile.html
>>>> explains this well. RCFILE is a special type of sequence file.
>>>>
>>>> I did get it working. Looks good compression for my table was smaller
>>>> then using GZIP BLOCK Sequence file. Query time was slightly better in
>>>> limited testing. Cool stuff.
>>>>
>>>> Edward
>>>>
>>>
>>> Do rcfiles support a blocksize for compression like other compressed
>>> sequence files?
>>>
>>
>
> Great. Do you have any suggestions or hints on how to tune this. Any
> information on what the ceiling or the floor might be?
>
> Edward
>

Re: Can compression be used with ColumnarSerDe ?

Posted by Edward Capriolo <ed...@gmail.com>.

On Mon, Jan 24, 2011 at 7:51 PM, yongqiang he <he...@gmail.com> wrote:
> Yes. It only support block compression. (No record level compression support.)
> You can use the config 'hive.io.rcfile.record.buffer.size' to specify
> the block size (before compression). The default is 4MB.
>
> Thanks
> Yongqiang
> On Mon, Jan 24, 2011 at 4:44 PM, Edward Capriolo <ed...@gmail.com> wrote:
>> On Mon, Jan 24, 2011 at 4:42 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>> On Mon, Jan 24, 2011 at 4:14 PM, yongqiang he <he...@gmail.com> wrote:
>>>> How did you upload the data to the new table?
>>>> You can get the data compressed by doing a insert overwrite to the
>>>> destination table with setting "hive.exec.compress.output" to true.
>>>>
>>>> Thanks
>>>> Yongqiang
>>>> On Mon, Jan 24, 2011 at 12:30 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>>>> I am trying to explore some use case that I believe are perfect for
>>>>> the columnarSerDe, tables with 100+ columns where only one or two are
>>>>> selected in a particular query.
>>>>>
>>>>> CREATE TABLE (....)
>>>>> ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
>>>>>   STORED AS RCFile ;
>>>>>
>>>>> My issue is my data from our source table, with gzip sequence files,
>>>>> is much smaller then the ColumnarSerDe table and as a result any
>>>>> performance gains are lost.
>>>>>
>>>>> Any ideas?
>>>>>
>>>>> Thank you,
>>>>> Edward
>>>>>
>>>>
>>>
>>> Thank you! That was a RTFM question.
>>>
>>>  set hive.exec.dynamic.partition.mode=nonstrict;
>>> set hive.exec.compress.output=true;
>>> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
>>>
>>> I was unclear about  'STORED AS RCFile' since normally you would need
>>> to use ' STORED AS SEQUENCEFILE'
>>>
>>> However http://hive.apache.org/docs/r0.6.0/api/org/apache/hadoop/hive/ql/io/RCFile.html
>>> explains this well. RCFILE is a special type of sequence file.
>>>
>>> I did get it working. Looks good compression for my table was smaller
>>> then using GZIP BLOCK Sequence file. Query time was slightly better in
>>> limited testing. Cool stuff.
>>>
>>> Edward
>>>
>>
>> Do rcfiles support a blocksize for compression like other compressed
>> sequence files?
>>
>

Great. Do you have any suggestions or hints on how to tune this. Any
information on what the ceiling or the floor might be?

Edward

Re: Can compression be used with ColumnarSerDe ?

Posted by yongqiang he <he...@gmail.com>.

Yes. It only support block compression. (No record level compression support.)
You can use the config 'hive.io.rcfile.record.buffer.size' to specify
the block size (before compression). The default is 4MB.

Thanks
Yongqiang
On Mon, Jan 24, 2011 at 4:44 PM, Edward Capriolo <ed...@gmail.com> wrote:
> On Mon, Jan 24, 2011 at 4:42 PM, Edward Capriolo <ed...@gmail.com> wrote:
>> On Mon, Jan 24, 2011 at 4:14 PM, yongqiang he <he...@gmail.com> wrote:
>>> How did you upload the data to the new table?
>>> You can get the data compressed by doing a insert overwrite to the
>>> destination table with setting "hive.exec.compress.output" to true.
>>>
>>> Thanks
>>> Yongqiang
>>> On Mon, Jan 24, 2011 at 12:30 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>>> I am trying to explore some use case that I believe are perfect for
>>>> the columnarSerDe, tables with 100+ columns where only one or two are
>>>> selected in a particular query.
>>>>
>>>> CREATE TABLE (....)
>>>> ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
>>>>   STORED AS RCFile ;
>>>>
>>>> My issue is my data from our source table, with gzip sequence files,
>>>> is much smaller then the ColumnarSerDe table and as a result any
>>>> performance gains are lost.
>>>>
>>>> Any ideas?
>>>>
>>>> Thank you,
>>>> Edward
>>>>
>>>
>>
>> Thank you! That was a RTFM question.
>>
>>  set hive.exec.dynamic.partition.mode=nonstrict;
>> set hive.exec.compress.output=true;
>> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
>>
>> I was unclear about  'STORED AS RCFile' since normally you would need
>> to use ' STORED AS SEQUENCEFILE'
>>
>> However http://hive.apache.org/docs/r0.6.0/api/org/apache/hadoop/hive/ql/io/RCFile.html
>> explains this well. RCFILE is a special type of sequence file.
>>
>> I did get it working. Looks good compression for my table was smaller
>> then using GZIP BLOCK Sequence file. Query time was slightly better in
>> limited testing. Cool stuff.
>>
>> Edward
>>
>
> Do rcfiles support a blocksize for compression like other compressed
> sequence files?
>

Re: Can compression be used with ColumnarSerDe ?

Posted by Edward Capriolo <ed...@gmail.com>.

On Mon, Jan 24, 2011 at 4:42 PM, Edward Capriolo <ed...@gmail.com> wrote:
> On Mon, Jan 24, 2011 at 4:14 PM, yongqiang he <he...@gmail.com> wrote:
>> How did you upload the data to the new table?
>> You can get the data compressed by doing a insert overwrite to the
>> destination table with setting "hive.exec.compress.output" to true.
>>
>> Thanks
>> Yongqiang
>> On Mon, Jan 24, 2011 at 12:30 PM, Edward Capriolo <ed...@gmail.com> wrote:
>>> I am trying to explore some use case that I believe are perfect for
>>> the columnarSerDe, tables with 100+ columns where only one or two are
>>> selected in a particular query.
>>>
>>> CREATE TABLE (....)
>>> ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
>>>   STORED AS RCFile ;
>>>
>>> My issue is my data from our source table, with gzip sequence files,
>>> is much smaller then the ColumnarSerDe table and as a result any
>>> performance gains are lost.
>>>
>>> Any ideas?
>>>
>>> Thank you,
>>> Edward
>>>
>>
>
> Thank you! That was a RTFM question.
>
>  set hive.exec.dynamic.partition.mode=nonstrict;
> set hive.exec.compress.output=true;
> set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;
>
> I was unclear about  'STORED AS RCFile' since normally you would need
> to use ' STORED AS SEQUENCEFILE'
>
> However http://hive.apache.org/docs/r0.6.0/api/org/apache/hadoop/hive/ql/io/RCFile.html
> explains this well. RCFILE is a special type of sequence file.
>
> I did get it working. Looks good compression for my table was smaller
> then using GZIP BLOCK Sequence file. Query time was slightly better in
> limited testing. Cool stuff.
>
> Edward
>

Do rcfiles support a blocksize for compression like other compressed
sequence files?

Re: Can compression be used with ColumnarSerDe ?

Posted by Edward Capriolo <ed...@gmail.com>.

On Mon, Jan 24, 2011 at 4:14 PM, yongqiang he <he...@gmail.com> wrote:
> How did you upload the data to the new table?
> You can get the data compressed by doing a insert overwrite to the
> destination table with setting "hive.exec.compress.output" to true.
>
> Thanks
> Yongqiang
> On Mon, Jan 24, 2011 at 12:30 PM, Edward Capriolo <ed...@gmail.com> wrote:
>> I am trying to explore some use case that I believe are perfect for
>> the columnarSerDe, tables with 100+ columns where only one or two are
>> selected in a particular query.
>>
>> CREATE TABLE (....)
>> ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
>>   STORED AS RCFile ;
>>
>> My issue is my data from our source table, with gzip sequence files,
>> is much smaller then the ColumnarSerDe table and as a result any
>> performance gains are lost.
>>
>> Any ideas?
>>
>> Thank you,
>> Edward
>>
>

Thank you! That was a RTFM question.

 set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.compress.output=true;
set mapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec;

I was unclear about  'STORED AS RCFile' since normally you would need
to use ' STORED AS SEQUENCEFILE'

However http://hive.apache.org/docs/r0.6.0/api/org/apache/hadoop/hive/ql/io/RCFile.html
explains this well. RCFILE is a special type of sequence file.

I did get it working. Looks good compression for my table was smaller
then using GZIP BLOCK Sequence file. Query time was slightly better in
limited testing. Cool stuff.

Edward

Re: Can compression be used with ColumnarSerDe ?

Posted by yongqiang he <he...@gmail.com>.

How did you upload the data to the new table?
You can get the data compressed by doing a insert overwrite to the
destination table with setting "hive.exec.compress.output" to true.

Thanks
Yongqiang
On Mon, Jan 24, 2011 at 12:30 PM, Edward Capriolo <ed...@gmail.com> wrote:
> I am trying to explore some use case that I believe are perfect for
> the columnarSerDe, tables with 100+ columns where only one or two are
> selected in a particular query.
>
> CREATE TABLE (....)
> ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"
>   STORED AS RCFile ;
>
> My issue is my data from our source table, with gzip sequence files,
> is much smaller then the ColumnarSerDe table and as a result any
> performance gains are lost.
>
> Any ideas?
>
> Thank you,
> Edward
>