You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hive.apache.org by "Krishna Kumar (JIRA)" <ji...@apache.org> on 2011/04/06 19:27:05 UTC

[jira] [Created] (HIVE-2097) Explore mechanisms for better compression with RC Files

Explore mechanisms for better compression with RC Files
-------------------------------------------------------

                 Key: HIVE-2097
                 URL: https://issues.apache.org/jira/browse/HIVE-2097
             Project: Hive
          Issue Type: Improvement
          Components: Query Processor, Serializers/Deserializers
            Reporter: Krishna Kumar
            Priority: Minor


Optimization of the compression mechanisms used by RC File to be explored.

Some initial ideas
 
1. More efficient serialization/deserialization based on type-specific and storage-specific knowledge.
 
   For instance, storing sorted numeric values efficiently using some delta coding techniques

2. More efficient compression based on type-specific and storage-specific knowledge

   Enable compression codecs to be specified based on types or individual columns

3. Reordering the on-disk storage for better compression efficiency.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2097) Explore mechanisms for better compression with RC Files

Posted by "Krishna Kumar (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13158484#comment-13158484 ] 

Krishna Kumar commented on HIVE-2097:
-------------------------------------

Thanks Alex for the suggestions.

Just to be sure we are on the same page, I believe you are talking about #3 approach above given in the description which aligns with the ideas in the comment from He Yongqiang. I have been working on implementing #1 and #2 currently.

Re #3 approaches, column grouping and row reordering are the general idea, but I do not understand your point re column selectivity. Why should selectivity play a role here where any grouping/reordering is done for better compression? There are two effects which we can exploit for better compression within column grouping (a) when the values in the two columns are similar and (b) where the values are correlated, that is, using conditional probabilities for better compression. In either case, my hope was that we would be able to create type-specific compressors for structs/maps etc which can exploit these features, i.e., a struct/map acts as a column group for compression purposes.


                
> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>
>                 Key: HIVE-2097
>                 URL: https://issues.apache.org/jira/browse/HIVE-2097
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
>  
> 1. More efficient serialization/deserialization based on type-specific and storage-specific knowledge.
>  
>    For instance, storing sorted numeric values efficiently using some delta coding techniques
> 2. More efficient compression based on type-specific and storage-specific knowledge
>    Enable compression codecs to be specified based on types or individual columns
> 3. Reordering the on-disk storage for better compression efficiency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2097) Explore mechanisms for better compression with RC Files

Posted by "alex gemini (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160734#comment-13160734 ] 

alex gemini commented on HIVE-2097:
-----------------------------------

I'm not quit sure I explain the columnar database execution strategy clearly,I hope the following material will help:
#1 http://www.infoq.com/news/2011/09/nosqlnow-columnar-databases
#2 http://www.oscon.com/oscon2010/public/schedule/detail/15561
#3 http://www.vertica.com/2010/05/26/why-verticas-compression-is-better/
#4 http://www.vertica.com/2011/09/01/the-power-of-projections-part-1/
Good luck. 
                
> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>
>                 Key: HIVE-2097
>                 URL: https://issues.apache.org/jira/browse/HIVE-2097
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
>  
> 1. More efficient serialization/deserialization based on type-specific and storage-specific knowledge.
>  
>    For instance, storing sorted numeric values efficiently using some delta coding techniques
> 2. More efficient compression based on type-specific and storage-specific knowledge
>    Enable compression codecs to be specified based on types or individual columns
> 3. Reordering the on-disk storage for better compression efficiency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (HIVE-2097) Explore mechanisms for better compression with RC Files

Posted by "Krishna Kumar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Krishna Kumar reassigned HIVE-2097:
-----------------------------------

    Assignee: Krishna Kumar

> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>
>                 Key: HIVE-2097
>                 URL: https://issues.apache.org/jira/browse/HIVE-2097
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
>  
> 1. More efficient serialization/deserialization based on type-specific and storage-specific knowledge.
>  
>    For instance, storing sorted numeric values efficiently using some delta coding techniques
> 2. More efficient compression based on type-specific and storage-specific knowledge
>    Enable compression codecs to be specified based on types or individual columns
> 3. Reordering the on-disk storage for better compression efficiency.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2097) Explore mechanisms for better compression with RC Files

Posted by "alex gemini (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160687#comment-13160687 ] 

alex gemini commented on HIVE-2097:
-----------------------------------

selectivity play an important role in columnar database is because they use run-length encoding compression to compress most dimension-attribute column,for example,we have a log table:create table (gender,age,region,message),we know that the selectivity order is :gender=1/2 > age= 1/20  >1/300, we can order table column like #1(gender,age,region,message) or #2(region,age,gender,message). for #1,we only need (2 + 2*20 + 2*20*300 +num_of_message) to store all the record in one dfs block, but if we organized table like #2,we will need (300 + 300*20 + 300*20*2 + num_of_message),discard num_of_message,the #1 is only take 66% of space #2 required,only difference is because run-length encoding will take more efficiently space when we organize table base on selectivity.
                
> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>
>                 Key: HIVE-2097
>                 URL: https://issues.apache.org/jira/browse/HIVE-2097
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
>  
> 1. More efficient serialization/deserialization based on type-specific and storage-specific knowledge.
>  
>    For instance, storing sorted numeric values efficiently using some delta coding techniques
> 2. More efficient compression based on type-specific and storage-specific knowledge
>    Enable compression codecs to be specified based on types or individual columns
> 3. Reordering the on-disk storage for better compression efficiency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2097) Explore mechanisms for better compression with RC Files

Posted by "Krishna Kumar (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13016451#comment-13016451 ] 

Krishna Kumar commented on HIVE-2097:
-------------------------------------

Comment hijacked from HIVE-2065:

He Yongqiang added a comment - 31/Mar/11 23:13

we examined column groups, and sort the data internally based on one column in one column group. (But we did not try different compressions across column groups.) Tried this with 3-4 tables, and we see ~20% storage savings on one table compared the previous RCFile. The main problems for this approach is that it is hard to find out the correct/most efficient column group definitions.
One example, table tbl_1 has 20 columns, and user can define:

col_1,col_2,col_11,col_13:0;col_3,col_4,col_15,col_16:1;

This will put col_1, col_2,col_11, col_13 into one column group, and reorder that column group based on sorting col_1 (0 is the first column in this column group), and put col_3, col_4, col_15,col_16 into another column group, and reorder this column group based on sorting col_4, and finally put all other columns into the default column group with original order.
And should be easy to allow different compression codec for different column groups.

The main block issue for this approach is have a full set of utils to find out the best column group definition.



> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>
>                 Key: HIVE-2097
>                 URL: https://issues.apache.org/jira/browse/HIVE-2097
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
>  
> 1. More efficient serialization/deserialization based on type-specific and storage-specific knowledge.
>  
>    For instance, storing sorted numeric values efficiently using some delta coding techniques
> 2. More efficient compression based on type-specific and storage-specific knowledge
>    Enable compression codecs to be specified based on types or individual columns
> 3. Reordering the on-disk storage for better compression efficiency.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (HIVE-2097) Explore mechanisms for better compression with RC Files

Posted by "Krishna Kumar (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Krishna Kumar updated HIVE-2097:
--------------------------------

    Attachment: datacomp.tar.gz

unrefactored source for all the implemented compression codecs
                
> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>
>                 Key: HIVE-2097
>                 URL: https://issues.apache.org/jira/browse/HIVE-2097
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>         Attachments: datacomp.tar.gz
>
>
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
>  
> 1. More efficient serialization/deserialization based on type-specific and storage-specific knowledge.
>  
>    For instance, storing sorted numeric values efficiently using some delta coding techniques
> 2. More efficient compression based on type-specific and storage-specific knowledge
>    Enable compression codecs to be specified based on types or individual columns
> 3. Reordering the on-disk storage for better compression efficiency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2097) Explore mechanisms for better compression with RC Files

Posted by "alex gemini (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160726#comment-13160726 ] 

alex gemini commented on HIVE-2097:
-----------------------------------

another issue is about efficient serialization/deserialization,for the same example above,assume every gender,age,region have 100 message equally store in one dfs block,in gender column we store value like this:{'male'}[1-60k]{'female'}[60k+1 - 120k],age column look like this:{21}[1-3k]{22}[3k+1 - 6k]{23}[6k+1 - 9k],and region column is like:{'LA'}[1-300]{'NY'}[301-600].
When we issue a query on a single table like :select sum(age) from logs where region='LA' and age=30,we count every column represented at 'select,where,group' clause,so we know the last column means lowest selectivity(in this example is region),we find the region value={'LA'}[(1-300),(30k+1 - 30k+300),(60k+1 -60k+300)....] and 'NY' value={'NY'}[(301-600),(30k+301 - 30k+600),(60k+301 -60k+600)....] 
we just need to deserialization it but don't need to decompression it because we know the lowest selectivity column,then we organize inputSplit's key like {[age='21'][region='LA']} and value is {(1-300),(30k+1 - 30k+300),(60k+1 -60k+300)....},this inputSplit key and value is unique per dfs block because we already sort column by selectivity,the lowest selectivity column presented at (select,where,group) must be unique. 
                
> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>
>                 Key: HIVE-2097
>                 URL: https://issues.apache.org/jira/browse/HIVE-2097
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
>  
> 1. More efficient serialization/deserialization based on type-specific and storage-specific knowledge.
>  
>    For instance, storing sorted numeric values efficiently using some delta coding techniques
> 2. More efficient compression based on type-specific and storage-specific knowledge
>    Enable compression codecs to be specified based on types or individual columns
> 3. Reordering the on-disk storage for better compression efficiency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-2097) Explore mechanisms for better compression with RC Files

Posted by "alex gemini (Commented) (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13157705#comment-13157705 ] 

alex gemini commented on HIVE-2097:
-----------------------------------

a few suggestion:
In columnar database,they always organize column order in the "high selectivity come first" way,then In each column,they store each value in sorted way.
case 1:if we already know the pattern of each column in big datasets,for example we can calculated in database to get a sample column distribution.we need to know the distinct value of each column value.in create database statement 
create table a
(col1,col2,col3,col4,col5,col6,xxx)
TBLPROPERTIES (col1_sample=0.001,col2_sample_0.01,col3_sample=0.5,col4_sample=0.02,col5_sample=0.002,col6_sample=0.005)
when we organize column group,we know which column is most high selectivity.in this example,the selectivity order of table a is : col3>col4>col2>col6>col5>col1 ,so we can organize column group like (col3,col4,col2),(col6,col5),col1
case 2:if we didn't know the table properties when we create table.we can just store them like normally,then provide a utility like hive --service rcfile_reorder 'some_hive_table_here', when execute this command,submit several mapreduce job to calculate the selectivity of each column and store them in metastore.then decompression each rcfile to reorganized them in a more space efficience column group.
hope this help.
                
> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>
>                 Key: HIVE-2097
>                 URL: https://issues.apache.org/jira/browse/HIVE-2097
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
>  
> 1. More efficient serialization/deserialization based on type-specific and storage-specific knowledge.
>  
>    For instance, storing sorted numeric values efficiently using some delta coding techniques
> 2. More efficient compression based on type-specific and storage-specific knowledge
>    Enable compression codecs to be specified based on types or individual columns
> 3. Reordering the on-disk storage for better compression efficiency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira