You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Sergio Peña (JIRA)" <ji...@apache.org> on 2016/01/06 18:33:39 UTC
[jira] [Commented] (PARQUET-417) Questionable encoding decisions

    [ https://issues.apache.org/jira/browse/PARQUET-417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15085888#comment-15085888 ] 

Sergio Peña commented on PARQUET-417:
-------------------------------------

Hi [~banjiewen], Several people were on vacation due to the holidays, that's why you got a slow response on the dev@ email. The issue you're reporting is not a bug but you might be using a different encoding version of Parquet.

Currently, Parquet has two encoding versions, PARQUET_1_0 and PARQUET_2_0. PARQUET_2_0 is an experimental feature where different types of encodings are applied per column type such the ones you are mentioning and also mentioned in https://github.com/apache/parquet-format/blob/master/Encodings.md. Only parquet 2.x versions have PARQUET_2_0 enabled by default. Parquet 1.x versions have PARQUET_1_0 enabled by default, but PARQUET_2_0 should be supported I think.

How are you writing your data to Parquet? Did you write your own application, or using Hive, Impala, or anything else?

Btw, I will close this ticket, and move the conversation to the dev@ list.

> Questionable encoding decisions
> -------------------------------
>
>                 Key: PARQUET-417
>                 URL: https://issues.apache.org/jira/browse/PARQUET-417
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Benjamin Anderson
>            Priority: Minor
>
> (Opening a ticket here because my mail to dev@ disappeared
> and there doesn't seem to be any other way to contact Parquet
> devs - feel free to redirect me somewhere else)
> I'm working on a small Parquet project and encountering
> some surprising results with regard to encoding decisions.
> My dataset consists of ~1.5MM log lines parsed to an Avro schema and
> written to a Parquet file via AvroParquetWriter. According to its log
> output, Parquet is writing all int/long columns out with either
> [BIT_PACKED, PLAIN] or [BIT_PACKED, PLAIN_DICTIONARY]. This surprised
> me - at least one of those columns is a monotonic epoch value that should be
> quite amenable to the DELTA_BINARY_PACKED. What's the best way to
> understand Parquet's encoding choices?
> Secondary question: Is  DELTA_BINARY_PACKED supported for INT64
> columns? The documentation[1] says it is, but the code[2] suggests
> otherwise.
> Cheers.
> [1]: https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5
> [2]: https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/Encoding.java#L166-L168



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)