You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by "Benjamin Anderson (JIRA)" <ji...@apache.org> on 2016/01/05 03:20:39 UTC

[jira] [Created] (PARQUET-417) Questionable encoding decisions

Benjamin Anderson created PARQUET-417:
-----------------------------------------

             Summary: Questionable encoding decisions
                 Key: PARQUET-417
                 URL: https://issues.apache.org/jira/browse/PARQUET-417
             Project: Parquet
          Issue Type: Bug
            Reporter: Benjamin Anderson
            Priority: Minor


(Opening a ticket here because my mail to dev@ disappeared
and there doesn't seem to be any other way to contact Parquet
devs - feel free to redirect me somewhere else)

I'm working on a small Parquet project and encountering
some surprising results with regard to encoding decisions.

My dataset consists of ~1.5MM log lines parsed to an Avro schema and
written to a Parquet file via AvroParquetWriter. According to its log
output, Parquet is writing all int/long columns out with either
[BIT_PACKED, PLAIN] or [BIT_PACKED, PLAIN_DICTIONARY]. This surprised
me - at least one of those columns is a monotonic epoch value that should be
quite amenable to the DELTA_BINARY_PACKED. What's the best way to
understand Parquet's encoding choices?

Secondary question: Is  DELTA_BINARY_PACKED supported for INT64
columns? The documentation[1] says it is, but the code[2] suggests
otherwise.

Cheers.

[1]: https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5
[2]: https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/Encoding.java#L166-L168



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)