You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Zheng Shao (JIRA)" <ji...@apache.org> on 2009/05/20 10:17:45 UTC

[jira] Issue Comment Edited: (HIVE-352) Make Hive support column based storage

    [ https://issues.apache.org/jira/browse/HIVE-352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12711054#action_12711054 ] 

Zheng Shao edited comment on HIVE-352 at 5/20/09 1:16 AM:
----------------------------------------------------------

Some more test results using some real log data. I also tried 2 levels of gzip by adding a new GzipCodec1 (level 1). GzipCodec is taking the default which should be level 6.
Gzip compression levels have a big impact on both the running time and compressed size:

The time and output_file_size shown is the average map task running time and output_file_size. I've changed mapred.min.split.size to make sure a single mapper is always processing the same data in different tests.

{code}
InputFileFormat -> OutputFileFormat: time output_file_size
Seqfile GZIP 6 -> Seqfile GZIP 1: 1'25'' 182MB
Seqfile GZIP 1 -> Seqfile GZIP 6: 2'05'' 134MB

set hive.io.rcfile.record.buffer.size=4194304;
Rcfile GZIP 6 -> Rcfile GZIP 6: 2'50'' 104MB
Rcfile GZIP 6 -> Rcfile GZIP 1: 1'55'' 130MB
{code}

>From this, Rcfile GZIP 1 beats Seqfile GZIP 6 in both time and space: 8% less time and 3% less space.
I believe there are still time performance potentials in Rcfile (removing synchronized methods, etc) that we can exploit.

These results shows that Rcfile can still beat Seqfile even if we select all fields.
If we only select a subset of the fields, then Rcfile will definitely beat Seqfile badly after HIVE-460 and HIVE-461 are in.



A local test with the same data (uncompressed text, about 910MB) shows GZIP 1 compression takes about half time of GZIP 6 compression .
The compression/decompression is using the command line utility gzip and gunzip.
Note that I've warmed up the cache, and the disk reading time is less than 1'' (by doing cat file > /dev/null)
All times reported are wall time, but user time is within 1''-2'' difference.
CPU is dual AMD Opteron 270 (2 x 2 core at 2GHz)

{code}
GZIP 1 compression: 22'' decompression: 7.3's 
GZIP 6 compression: 49'' decompression: 6.4's
time wc uncompressed: 9.3''
time awk 'END {print NR;}' uncompressed: 2.8''
{code}

These numbers are probably the lower bounds of the running time we can ever achieve.

      was (Author: zshao):
    Some more test results using some real log data. I also tried 2 levels of gzip by adding a new GzipCodec1 (level 1). GzipCodec is taking the default which should be level 6.
Gzip compression levels have a big impact on both the running time and compressed size:

The time and output_file_size shown is the average map task running time and output_file_size. I've changed mapred.min.split.size to make sure a single mapper is always processing the same data in different tests.

{code}
InputFileFormat -> OutputFileFormat: time output_file_size
Seqfile GZIP 6 -> Seqfile GZIP 1: 1'25'' 182MB
Seqfile GZIP 1 -> Seqfile GZIP 6: 2'05'' 134MB

set hive.io.rcfile.record.buffer.size=4194304;
Rcfile GZIP 6 -> Rcfile GZIP 6: 2'50'' 104MB
Rcfile GZIP 6 -> Rcfile GZIP 1: 1'55'' 130MB
{code}

>From this, Rcfile GZIP 1 beats Seqfile GZIP 6 in both time and space: 8% less time and 3% less space.
I believe there are still time performance potentials in Rcfile (removing synchronized methods, etc) that we can exploit.

These results tells other Rcfile can still beat Seqfile even if we select all fields.
If we only select a subset of the fields, then Rcfile will definitely beat Seqfile badly after HIVE-460 and HIVE-461 are in.



A local test with the same data (uncompressed text, about 910MB) shows GZIP 1 compression takes about half time of GZIP 6 compression .
The compression/decompression is using the command line utility gzip and gunzip.
Note that I've warmed up the cache, and the disk reading time is less than 1'' (by doing cat file > /dev/null)
All times reported are wall time, but user time is within 1''-2'' difference.
CPU is dual AMD Opteron 270 (2 x 2 core at 2GHz)

{code}
GZIP 1 compression: 22'' decompression: 7.3's 
GZIP 6 compression: 49'' decompression: 6.4's
time wc uncompressed: 9.3''
time awk 'END {print NR;}' uncompressed: 2.8''
{code}

  
> Make Hive support column based storage
> --------------------------------------
>
>                 Key: HIVE-352
>                 URL: https://issues.apache.org/jira/browse/HIVE-352
>             Project: Hadoop Hive
>          Issue Type: New Feature
>            Reporter: He Yongqiang
>            Assignee: He Yongqiang
>             Fix For: 0.4.0
>
>         Attachments: 4-22 performace2.txt, 4-22 performance.txt, 4-22 progress.txt, hive-352-2009-4-15.patch, hive-352-2009-4-16.patch, hive-352-2009-4-17.patch, hive-352-2009-4-19.patch, hive-352-2009-4-22-2.patch, hive-352-2009-4-22.patch, hive-352-2009-4-23.patch, hive-352-2009-4-27.patch, hive-352-2009-4-30-2.patch, hive-352-2009-4-30-3.patch, hive-352-2009-4-30-4.patch, hive-352-2009-5-1-3.patch, hive-352-2009-5-1.patch, HIve-352-draft-2009-03-28.patch, Hive-352-draft-2009-03-30.patch
>
>
> column based storage has been proven a better storage layout for OLAP. 
> Hive does a great job on raw row oriented storage. In this issue, we will enhance hive to support column based storage. 
> Acctually we have done some work on column based storage on top of hdfs, i think it will need some review and refactoring to port it to Hive.
> Any thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.