You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@hive.apache.org by "Yongqiang He (JIRA)" <ji...@apache.org> on 2009/06/26 07:53:07 UTC

[jira] Commented: (HIVE-461) Optimize RCFile reading by using column pruning results

    [ https://issues.apache.org/jira/browse/HIVE-461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12724394#action_12724394 ] 

Yongqiang He commented on HIVE-461:
-----------------------------------

please remove input20.q.out in the patch when test and commit.
input20.q.out always has wrong result in my local.

> Optimize RCFile reading by using column pruning results
> -------------------------------------------------------
>
>                 Key: HIVE-461
>                 URL: https://issues.apache.org/jira/browse/HIVE-461
>             Project: Hadoop Hive
>          Issue Type: Improvement
>          Components: Serializers/Deserializers
>    Affects Versions: 0.4.0
>            Reporter: Zheng Shao
>            Assignee: Yongqiang He
>         Attachments: hive-461-2009-05-26.patch, hive-461-2009-06-26.patch
>
>
> RCFile is a column-based file format introduced in HIVE-352. Column-based storage has shown better compression ratio. On our internal data set (30 columns, most of them are short integer strings), we are seeing gzip-compressed RCFile to be 20%+ smaller than gzip-compressed SequenceFile.
> RCFIle also has the potential to improve the reading efficiency a lot since it compresses each column separately.
> We should integrate RCFile with the column pruning results from Hive to make the reading faster.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.