You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Benjamin Anderson <b...@banjiewen.net> on 2015/12/31 02:34:47 UTC

Encoding decisions

Hi there - I'm working on a small Parquet project and encountering
some surprising results with regard to encoding decisions.

My dataset consists of ~1.5MM log lines parsed to an Avro schema and
written to a Parquet file via AvroParquetWriter. According to its log
output, Parquet is writing all int/long columns out with either
[BIT_PACKED, PLAIN] or [BIT_PACKED, PLAIN_DICTIONARY]. This surprised
me - at least one of those columns is an epoch value that should be
quite amenable to the DELTA_BINARY_PACKED. What's the best way to
understand Parquet's encoding choices?

Secondary question: Is  DELTA_BINARY_PACKED supported for INT64
columns? The documentation[1] says it is, but the code[2] suggests
otherwise.

Cheers,
--
b

[1]: https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5
[2]: https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/Encoding.java#L166-L168

Re: Encoding decisions

Posted by Sergio Pena <se...@cloudera.com>.

1. To clarify, in the PARQUET_1_0 encoding version _all_ columns in a
schema must use the same encoding?

Each column has its own encoding, however, most of the columns in
PARQUET_1_0 use the same encoding. When dictionary
is enabled, dictionary encoding will be used on each page only if the
dictionary page (per row group) hasn't grown bigger than
the ParquetProperties.DEFAULT_DICTIONARY_PAGE_SIZE.

2. How can I enable the PARQUET_2_0 encoding version? Or
alternatively, is there a maven repo with 2.x artifacts floating
around?

You can use PARQUET_2_0 when creating the ParquetWriter. Just pass
WriterVersion.PARQUET_2_0 to the constructor parameters
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L220

, or the builder parameters.
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetWriter.java#L482

On Wed, Jan 6, 2016 at 12:52 PM, Benjamin Anderson <b...@banjiewen.net> wrote:

> Hi Sergio - I'm writing my own application using the AvroParquetWriter
> with Parquet@1.8.1. A gist of my application is at [1].
>
> Two questions for you:
>
> 1. To clarify, in the PARQUET_1_0 encoding version _all_ columns in a
> schema must use the same encoding?
> 2. How can I enable the PARQUET_2_0 encoding version? Or
> alternatively, is there a maven repo with 2.x artifacts floating
> around?
>
> Cheers,
> --
> b
>
> [1]: https://gist.github.com/banjiewen/c6a5d4af0854764d54d2
>
> On Wed, Jan 6, 2016 at 9:34 AM, Sergio Pena <se...@cloudera.com>
> wrote:
> > Hi Benjamin, Several people were on vacation due to the holidays, that's
> > why you got a slow response on the dev@ email. The issue you're
> reporting
> > is not a bug but you might be using a different encoding version of
> Parquet.
> >
> > Currently, Parquet has two encoding versions, PARQUET_1_0 and
> PARQUET_2_0.
> > PARQUET_2_0 is an experimental feature where different types of encodings
> > are applied per column type such the ones you are mentioning and also
> > mentioned in
> > https://github.com/apache/parquet-format/blob/master/Encodings.md. Only
> > parquet 2.x versions have PARQUET_2_0 enabled by default. Parquet 1.x
> > versions have PARQUET_1_0 enabled by default, but PARQUET_2_0 should be
> > supported I think.
> >
> > How are you writing your data to Parquet? Did you write your own
> > application, or using Hive, Impala, or anything else?
> >
> > On Tue, Jan 5, 2016 at 4:39 PM, Nong Li <no...@gmail.com> wrote:
> >
> >> Have we enabled the 2.0 encodings?
> >>
> >> On Wed, Dec 30, 2015 at 5:34 PM, Benjamin Anderson <b...@banjiewen.net>
> >> wrote:
> >>
> >> > Hi there - I'm working on a small Parquet project and encountering
> >> > some surprising results with regard to encoding decisions.
> >> >
> >> > My dataset consists of ~1.5MM log lines parsed to an Avro schema and
> >> > written to a Parquet file via AvroParquetWriter. According to its log
> >> > output, Parquet is writing all int/long columns out with either
> >> > [BIT_PACKED, PLAIN] or [BIT_PACKED, PLAIN_DICTIONARY]. This surprised
> >> > me - at least one of those columns is an epoch value that should be
> >> > quite amenable to the DELTA_BINARY_PACKED. What's the best way to
> >> > understand Parquet's encoding choices?
> >> >
> >> > Secondary question: Is  DELTA_BINARY_PACKED supported for INT64
> >> > columns? The documentation[1] says it is, but the code[2] suggests
> >> > otherwise.
> >> >
> >> > Cheers,
> >> > --
> >> > b
> >> >
> >> > [1]:
> >> >
> >>
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5
> >> > [2]:
> >> >
> >>
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/Encoding.java#L166-L168
> >> >
> >>
>

Re: Encoding decisions

Posted by Benjamin Anderson <b...@banjiewen.net>.

Hi Sergio - I'm writing my own application using the AvroParquetWriter
with Parquet@1.8.1. A gist of my application is at [1].

Two questions for you:

1. To clarify, in the PARQUET_1_0 encoding version _all_ columns in a
schema must use the same encoding?
2. How can I enable the PARQUET_2_0 encoding version? Or
alternatively, is there a maven repo with 2.x artifacts floating
around?

Cheers,
--
b

[1]: https://gist.github.com/banjiewen/c6a5d4af0854764d54d2

On Wed, Jan 6, 2016 at 9:34 AM, Sergio Pena <se...@cloudera.com> wrote:
> Hi Benjamin, Several people were on vacation due to the holidays, that's
> why you got a slow response on the dev@ email. The issue you're reporting
> is not a bug but you might be using a different encoding version of Parquet.
>
> Currently, Parquet has two encoding versions, PARQUET_1_0 and PARQUET_2_0.
> PARQUET_2_0 is an experimental feature where different types of encodings
> are applied per column type such the ones you are mentioning and also
> mentioned in
> https://github.com/apache/parquet-format/blob/master/Encodings.md. Only
> parquet 2.x versions have PARQUET_2_0 enabled by default. Parquet 1.x
> versions have PARQUET_1_0 enabled by default, but PARQUET_2_0 should be
> supported I think.
>
> How are you writing your data to Parquet? Did you write your own
> application, or using Hive, Impala, or anything else?
>
> On Tue, Jan 5, 2016 at 4:39 PM, Nong Li <no...@gmail.com> wrote:
>
>> Have we enabled the 2.0 encodings?
>>
>> On Wed, Dec 30, 2015 at 5:34 PM, Benjamin Anderson <b...@banjiewen.net>
>> wrote:
>>
>> > Hi there - I'm working on a small Parquet project and encountering
>> > some surprising results with regard to encoding decisions.
>> >
>> > My dataset consists of ~1.5MM log lines parsed to an Avro schema and
>> > written to a Parquet file via AvroParquetWriter. According to its log
>> > output, Parquet is writing all int/long columns out with either
>> > [BIT_PACKED, PLAIN] or [BIT_PACKED, PLAIN_DICTIONARY]. This surprised
>> > me - at least one of those columns is an epoch value that should be
>> > quite amenable to the DELTA_BINARY_PACKED. What's the best way to
>> > understand Parquet's encoding choices?
>> >
>> > Secondary question: Is  DELTA_BINARY_PACKED supported for INT64
>> > columns? The documentation[1] says it is, but the code[2] suggests
>> > otherwise.
>> >
>> > Cheers,
>> > --
>> > b
>> >
>> > [1]:
>> >
>> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5
>> > [2]:
>> >
>> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/Encoding.java#L166-L168
>> >
>>

Re: Encoding decisions

Posted by Sergio Pena <se...@cloudera.com>.

Hi Benjamin, Several people were on vacation due to the holidays, that's
why you got a slow response on the dev@ email. The issue you're reporting
is not a bug but you might be using a different encoding version of Parquet.

Currently, Parquet has two encoding versions, PARQUET_1_0 and PARQUET_2_0.
PARQUET_2_0 is an experimental feature where different types of encodings
are applied per column type such the ones you are mentioning and also
mentioned in
https://github.com/apache/parquet-format/blob/master/Encodings.md. Only
parquet 2.x versions have PARQUET_2_0 enabled by default. Parquet 1.x
versions have PARQUET_1_0 enabled by default, but PARQUET_2_0 should be
supported I think.

How are you writing your data to Parquet? Did you write your own
application, or using Hive, Impala, or anything else?

On Tue, Jan 5, 2016 at 4:39 PM, Nong Li <no...@gmail.com> wrote:

> Have we enabled the 2.0 encodings?
>
> On Wed, Dec 30, 2015 at 5:34 PM, Benjamin Anderson <b...@banjiewen.net>
> wrote:
>
> > Hi there - I'm working on a small Parquet project and encountering
> > some surprising results with regard to encoding decisions.
> >
> > My dataset consists of ~1.5MM log lines parsed to an Avro schema and
> > written to a Parquet file via AvroParquetWriter. According to its log
> > output, Parquet is writing all int/long columns out with either
> > [BIT_PACKED, PLAIN] or [BIT_PACKED, PLAIN_DICTIONARY]. This surprised
> > me - at least one of those columns is an epoch value that should be
> > quite amenable to the DELTA_BINARY_PACKED. What's the best way to
> > understand Parquet's encoding choices?
> >
> > Secondary question: Is  DELTA_BINARY_PACKED supported for INT64
> > columns? The documentation[1] says it is, but the code[2] suggests
> > otherwise.
> >
> > Cheers,
> > --
> > b
> >
> > [1]:
> >
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5
> > [2]:
> >
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/Encoding.java#L166-L168
> >
>

Re: Encoding decisions

Posted by Nong Li <no...@gmail.com>.

Have we enabled the 2.0 encodings?

On Wed, Dec 30, 2015 at 5:34 PM, Benjamin Anderson <b...@banjiewen.net> wrote:

> Hi there - I'm working on a small Parquet project and encountering
> some surprising results with regard to encoding decisions.
>
> My dataset consists of ~1.5MM log lines parsed to an Avro schema and
> written to a Parquet file via AvroParquetWriter. According to its log
> output, Parquet is writing all int/long columns out with either
> [BIT_PACKED, PLAIN] or [BIT_PACKED, PLAIN_DICTIONARY]. This surprised
> me - at least one of those columns is an epoch value that should be
> quite amenable to the DELTA_BINARY_PACKED. What's the best way to
> understand Parquet's encoding choices?
>
> Secondary question: Is  DELTA_BINARY_PACKED supported for INT64
> columns? The documentation[1] says it is, but the code[2] suggests
> otherwise.
>
> Cheers,
> --
> b
>
> [1]:
> https://github.com/apache/parquet-format/blob/master/Encodings.md#delta-encoding-delta_binary_packed--5
> [2]:
> https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/column/Encoding.java#L166-L168
>