You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@orc.apache.org by "Owen O'Malley (JIRA)" <ji...@apache.org> on 2016/04/08 18:33:25 UTC

[jira] [Commented] (ORC-41) Using referenced columns for improved compression

    [ https://issues.apache.org/jira/browse/ORC-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15232449#comment-15232449 ] 

Owen O'Malley commented on ORC-41:
----------------------------------

In general it is hard to do cross column compression, because the columns need to be independent of each other so that we don't need to read the bytes for columns that we don't need. One option to consider is implementing dictionary encoding for structs, which would help a lot for denormalized data.

> Using referenced columns for improved compression
> -------------------------------------------------
>
>                 Key: ORC-41
>                 URL: https://issues.apache.org/jira/browse/ORC-41
>             Project: Orc
>          Issue Type: Improvement
>            Reporter: Charles Pritchard
>
> Many data sets I work with have one column which essentially references another, with one column being a bigint and one column being a string. It is always a case that the value of the integer field determines the value of the string field.
> I also work with data sets where one bigint field is always going to determine the value of another bigint field, likely in a tree.
> There is an opportunity to achieve better compression by identifying these use cases and adding in additional logic for such cross-column/dictionary lookups.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)