You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "alex gemini (Commented) (JIRA)" <ji...@apache.org> on 2011/12/01 07:49:40 UTC

[jira] [Commented] (HIVE-2097) Explore mechanisms for better compression with RC Files

    [ https://issues.apache.org/jira/browse/HIVE-2097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13160687#comment-13160687 ] 

alex gemini commented on HIVE-2097:
-----------------------------------

selectivity play an important role in columnar database is because they use run-length encoding compression to compress most dimension-attribute column,for example,we have a log table:create table (gender,age,region,message),we know that the selectivity order is :gender=1/2 > age= 1/20  >1/300, we can order table column like #1(gender,age,region,message) or #2(region,age,gender,message). for #1,we only need (2 + 2*20 + 2*20*300 +num_of_message) to store all the record in one dfs block, but if we organized table like #2,we will need (300 + 300*20 + 300*20*2 + num_of_message),discard num_of_message,the #1 is only take 66% of space #2 required,only difference is because run-length encoding will take more efficiently space when we organize table base on selectivity.
                
> Explore mechanisms for better compression with RC Files
> -------------------------------------------------------
>
>                 Key: HIVE-2097
>                 URL: https://issues.apache.org/jira/browse/HIVE-2097
>             Project: Hive
>          Issue Type: Improvement
>          Components: Query Processor, Serializers/Deserializers
>            Reporter: Krishna Kumar
>            Assignee: Krishna Kumar
>            Priority: Minor
>
> Optimization of the compression mechanisms used by RC File to be explored.
> Some initial ideas
>  
> 1. More efficient serialization/deserialization based on type-specific and storage-specific knowledge.
>  
>    For instance, storing sorted numeric values efficiently using some delta coding techniques
> 2. More efficient compression based on type-specific and storage-specific knowledge
>    Enable compression codecs to be specified based on types or individual columns
> 3. Reordering the on-disk storage for better compression efficiency.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira