You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by "Fernando Pereira (JIRA)" <ji...@apache.org> on 2017/11/23 13:31:00 UTC

[jira] [Commented] (PARQUET-845) Efficient storage for several INT_8 and INT_16

    [ https://issues.apache.org/jira/browse/PARQUET-845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16264329#comment-16264329 ] 

Fernando Pereira commented on PARQUET-845:
------------------------------------------

I'm coming back to this issue, so hopefully we close it either as invalid or as a feature request :)
I didn't understand your comment regarding the API. Since we have logical types INT8, INT16, etc, that sounds fine for me.

My question was regarding efficient storage, and whether parquet by default already choses efficient encoders +by default+. (?)
If the uses doesn't specify any encoding, does parquet-cpp use the most advanced Delta Encodings for e.g. INT8 logical types? Of it falls back to the PLAIN encoding and uses 32 physical bits?
Thanks

> Efficient storage for several INT_8 and INT_16
> ----------------------------------------------
>
>                 Key: PARQUET-845
>                 URL: https://issues.apache.org/jira/browse/PARQUET-845
>             Project: Parquet
>          Issue Type: Wish
>            Reporter: Fernando Pereira
>            Priority: Minor
>
> In very large datasets, aggregating several INT8 into INT32 fields (or byte array) can make a big difference.
> In parquet, efficient algorithms exist for INT32, so if the LogicalType is INT_8 the encoded int might take up only one byte.
> However further optimizations could be made by allowing the user to better specify the types.
> What about BYTE_ARRAY logical type, backed by FIXED_LEN_BYTE_ARRAY type (or eventually INT_32)?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)