You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@parquet.apache.org by Stefán Baxter <st...@activitystream.com> on 2016/02/04 22:35:45 UTC

MessageType :: Type :: Encoding options

Hi,

I'm using parquet-mr/parquet-avro to write parquet files.

I want to control/override the encoding type for a column and I find no
documentation or examples regarding that.

My schema (MessageType) is converted with AvroSchemaConverter and I wonder
how I can either set or hint columns to use a particular encoding option.
Is that possible?

Regards,
 -Stefán

Re: MessageType :: Type :: Encoding options

Posted by Ryan Blue <bl...@cloudera.com>.

Yes, I will make sure int64 delta makes it into master. I think Drill 
would need to update its Parquet version to take advantage.

rb

On 02/04/2016 11:07 PM, Stefán Baxter wrote:
> Hi Ryan,
>
> Can you tell me when the Big64 Delta encoding will be available as a part
> of your release and if Drill will need an updated Parquet version to read
> it?
>
> Regards,
>   -Stefan
>
> On Thu, Feb 4, 2016 at 11:25 PM, Stefán Baxter <st...@activitystream.com>
> wrote:
>
>>
>> great, and yes, I'm using the settings you provided me with :)
>>
>>   .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>>
>>
>>
>> On Thu, Feb 4, 2016 at 11:24 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>
>>> Delta int64 encoding isn't released yet. We have a PR that I'm on the
>>> hook for getting in. :)
>>>
>>> Also, it's one of the 2.0 format encodings, so you'll need that option
>>> turned on.
>>>
>>> rb
>>>
>>>
>>> On 02/04/2016 03:21 PM, Stefán Baxter wrote:
>>>
>>>> thnx.
>>>>
>>>> This a time-stamp field from a is a smaller sample using the new
>>>> settings:
>>>> Feb 4, 2016 11:06:43 PM INFO:
>>>> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 38,836B for
>>>> [occurred_at] INT64: 24,000 values, 38,783B raw, 38,783B comp, 1 pages,
>>>> encodings: [RLE_DICTIONARY, PLAIN], dic { 7,058 entries, 56,464B raw,
>>>> 7,058B comp}
>>>>
>>>> Any reason that comes to mind why this is not a integer delta? (time
>>>> between these entries is often a few seconds.
>>>>
>>>> -Stefan
>>>>
>>>>
>>>> On Thu, Feb 4, 2016 at 11:17 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>>>
>>>> You should be getting the underlying data back instead of Timestamp
>>>>> objects. You can pull in Avro 1.8.0 and use the conversions yourself
>>>>> rather
>>>>> than waiting for them to be included in the library.
>>>>>
>>>>> rb
>>>>>
>>>>>
>>>>> On 02/04/2016 03:14 PM, Stefán Baxter wrote:
>>>>>
>>>>> I'm not looking to turn it off, absolutely not, I'm looking to use it in
>>>>>> the most effective way :)
>>>>>>
>>>>>> Is there something I can do right now to force these fields to be
>>>>>> timestamp
>>>>>> fields in Parquet?
>>>>>>
>>>>>> Regards,
>>>>>>     -Stefan
>>>>>>
>>>>>> On Thu, Feb 4, 2016 at 11:03 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>>>>>
>>>>>> Got it. You can also turn off dictionary encoding with an option on the
>>>>>>
>>>>>>> builder.
>>>>>>>
>>>>>>> For timestamp, the support was just released in Avro 1.8.0 and
>>>>>>> there's a
>>>>>>> pending pull request for adding the same logical types API to
>>>>>>> parquet-avro:
>>>>>>> https://github.com/apache/parquet-mr/pull/318
>>>>>>>
>>>>>>> Once that's merged, you'll just have to add conversions to your data
>>>>>>> model
>>>>>>> like this:
>>>>>>>
>>>>>>>      GenericData model = new GenericData();
>>>>>>>      model.addLogicalTypeConversion(
>>>>>>>          new TimeConversions.TimestampConversion());
>>>>>>>
>>>>>>> Then pass that model into the builder.
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>>
>>>>>>> On 02/04/2016 02:54 PM, Stefán Baxter wrote:
>>>>>>>
>>>>>>> Hi Ryan,
>>>>>>>
>>>>>>>>
>>>>>>>> Thank you for taking the time.
>>>>>>>>
>>>>>>>> I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on
>>>>>>>> the
>>>>>>>> optional dictionary encoding it's used for almost
>>>>>>>> anything/everything. I
>>>>>>>> even have some time-stamp fields that is turned into a dictionary. (I
>>>>>>>> would
>>>>>>>> have guessed delta integer)
>>>>>>>>
>>>>>>>> I have a ~5M entries  in my test file and the dictionary based one
>>>>>>>> ends
>>>>>>>> up
>>>>>>>> 550mb and the non-dictionary based one ends up 790mb (still faster).
>>>>>>>>
>>>>>>>> So I started wondering if I could affect these decisions to compare
>>>>>>>> size,
>>>>>>>> speed etc. but I understand the rational behind automatic selection
>>>>>>>> it
>>>>>>>> just
>>>>>>>> deemed somewhat naive in that Drill scenario.
>>>>>>>>
>>>>>>>> Another matter... can you point me to an example that shows me how to
>>>>>>>> deal
>>>>>>>> with Avro having no timestamp fields and conversion to Parquet.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>      -Stefán
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <bl...@cloudera.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Stefán,
>>>>>>>>
>>>>>>>>
>>>>>>>>> The Schema converter will map Avro types to their Parquet
>>>>>>>>> equivalents,
>>>>>>>>> for
>>>>>>>>> which there really aren't really choices or options. The mapping is
>>>>>>>>> straight-forward, like long to int64.
>>>>>>>>>
>>>>>>>>> For the individual column encodings, Parquet chooses those
>>>>>>>>> automatically
>>>>>>>>> based on the column type and data. For example, dictionary encoding
>>>>>>>>> is
>>>>>>>>> used
>>>>>>>>> if it gets better results than plain encoding and integer columns
>>>>>>>>> always
>>>>>>>>> use the bit packing and run-length encoding hybrid. There aren't
>>>>>>>>> many
>>>>>>>>> choices you would make on a per-column basis here, either.
>>>>>>>>>
>>>>>>>>> There are two options you can control that affect encodings: the
>>>>>>>>> dictionary page size and whether to use the 2.0 encodings. The 2.0
>>>>>>>>> encodings are delta binary and delta integer, which can't be read by
>>>>>>>>> some
>>>>>>>>> older readers or by Apache Impala. They get great compression on
>>>>>>>>> certain
>>>>>>>>> types though. You can also control the maximum dictionary size,
>>>>>>>>> which
>>>>>>>>> could
>>>>>>>>> help if you have columns that should be dictionary-encoded but are
>>>>>>>>> falling
>>>>>>>>> back to plain encoding because the dictionary gets too big.
>>>>>>>>>
>>>>>>>>> Both of those options are exposed by the builder when you create a
>>>>>>>>> writer:
>>>>>>>>>
>>>>>>>>>       AvroParquetWriter.builder(outputPath)
>>>>>>>>>             .withSchema(schema)
>>>>>>>>>             .withDataModel(ReflectData.get())
>>>>>>>>>
>>>>>>>>> .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>>>>>>>>>             .withDictionaryPageSize(2*1024*1024)
>>>>>>>>>             .build();
>>>>>>>>>
>>>>>>>>> The default dictionary page size is 1MB.
>>>>>>>>>
>>>>>>>>> rb
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 02/04/2016 01:35 PM, Stefán Baxter wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> I'm using parquet-mr/parquet-avro to write parquet files.
>>>>>>>>>>
>>>>>>>>>> I want to control/override the encoding type for a column and I
>>>>>>>>>> find
>>>>>>>>>> no
>>>>>>>>>> documentation or examples regarding that.
>>>>>>>>>>
>>>>>>>>>> My schema (MessageType) is converted with AvroSchemaConverter and I
>>>>>>>>>> wonder
>>>>>>>>>> how I can either set or hint columns to use a particular encoding
>>>>>>>>>> option.
>>>>>>>>>> Is that possible?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>       -Stefán
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>> Ryan Blue
>>>>>>>>> Software Engineer
>>>>>>>>> Cloudera, Inc.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Cloudera, Inc.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Cloudera, Inc.
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Cloudera, Inc.
>>>
>>
>>
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: MessageType :: Type :: Encoding options

Posted by Stefán Baxter <st...@activitystream.com>.

Hi Ryan,

Can you tell me when the Big64 Delta encoding will be available as a part
of your release and if Drill will need an updated Parquet version to read
it?

Regards,
 -Stefan

On Thu, Feb 4, 2016 at 11:25 PM, Stefán Baxter <st...@activitystream.com>
wrote:

>
> great, and yes, I'm using the settings you provided me with :)
>
>  .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>
>
>
> On Thu, Feb 4, 2016 at 11:24 PM, Ryan Blue <bl...@cloudera.com> wrote:
>
>> Delta int64 encoding isn't released yet. We have a PR that I'm on the
>> hook for getting in. :)
>>
>> Also, it's one of the 2.0 format encodings, so you'll need that option
>> turned on.
>>
>> rb
>>
>>
>> On 02/04/2016 03:21 PM, Stefán Baxter wrote:
>>
>>> thnx.
>>>
>>> This a time-stamp field from a is a smaller sample using the new
>>> settings:
>>> Feb 4, 2016 11:06:43 PM INFO:
>>> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 38,836B for
>>> [occurred_at] INT64: 24,000 values, 38,783B raw, 38,783B comp, 1 pages,
>>> encodings: [RLE_DICTIONARY, PLAIN], dic { 7,058 entries, 56,464B raw,
>>> 7,058B comp}
>>>
>>> Any reason that comes to mind why this is not a integer delta? (time
>>> between these entries is often a few seconds.
>>>
>>> -Stefan
>>>
>>>
>>> On Thu, Feb 4, 2016 at 11:17 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>>
>>> You should be getting the underlying data back instead of Timestamp
>>>> objects. You can pull in Avro 1.8.0 and use the conversions yourself
>>>> rather
>>>> than waiting for them to be included in the library.
>>>>
>>>> rb
>>>>
>>>>
>>>> On 02/04/2016 03:14 PM, Stefán Baxter wrote:
>>>>
>>>> I'm not looking to turn it off, absolutely not, I'm looking to use it in
>>>>> the most effective way :)
>>>>>
>>>>> Is there something I can do right now to force these fields to be
>>>>> timestamp
>>>>> fields in Parquet?
>>>>>
>>>>> Regards,
>>>>>    -Stefan
>>>>>
>>>>> On Thu, Feb 4, 2016 at 11:03 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>>>>
>>>>> Got it. You can also turn off dictionary encoding with an option on the
>>>>>
>>>>>> builder.
>>>>>>
>>>>>> For timestamp, the support was just released in Avro 1.8.0 and
>>>>>> there's a
>>>>>> pending pull request for adding the same logical types API to
>>>>>> parquet-avro:
>>>>>> https://github.com/apache/parquet-mr/pull/318
>>>>>>
>>>>>> Once that's merged, you'll just have to add conversions to your data
>>>>>> model
>>>>>> like this:
>>>>>>
>>>>>>     GenericData model = new GenericData();
>>>>>>     model.addLogicalTypeConversion(
>>>>>>         new TimeConversions.TimestampConversion());
>>>>>>
>>>>>> Then pass that model into the builder.
>>>>>>
>>>>>> rb
>>>>>>
>>>>>>
>>>>>> On 02/04/2016 02:54 PM, Stefán Baxter wrote:
>>>>>>
>>>>>> Hi Ryan,
>>>>>>
>>>>>>>
>>>>>>> Thank you for taking the time.
>>>>>>>
>>>>>>> I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on
>>>>>>> the
>>>>>>> optional dictionary encoding it's used for almost
>>>>>>> anything/everything. I
>>>>>>> even have some time-stamp fields that is turned into a dictionary. (I
>>>>>>> would
>>>>>>> have guessed delta integer)
>>>>>>>
>>>>>>> I have a ~5M entries  in my test file and the dictionary based one
>>>>>>> ends
>>>>>>> up
>>>>>>> 550mb and the non-dictionary based one ends up 790mb (still faster).
>>>>>>>
>>>>>>> So I started wondering if I could affect these decisions to compare
>>>>>>> size,
>>>>>>> speed etc. but I understand the rational behind automatic selection
>>>>>>> it
>>>>>>> just
>>>>>>> deemed somewhat naive in that Drill scenario.
>>>>>>>
>>>>>>> Another matter... can you point me to an example that shows me how to
>>>>>>> deal
>>>>>>> with Avro having no timestamp fields and conversion to Parquet.
>>>>>>>
>>>>>>> Best regards,
>>>>>>>     -Stefán
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <bl...@cloudera.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Stefán,
>>>>>>>
>>>>>>>
>>>>>>>> The Schema converter will map Avro types to their Parquet
>>>>>>>> equivalents,
>>>>>>>> for
>>>>>>>> which there really aren't really choices or options. The mapping is
>>>>>>>> straight-forward, like long to int64.
>>>>>>>>
>>>>>>>> For the individual column encodings, Parquet chooses those
>>>>>>>> automatically
>>>>>>>> based on the column type and data. For example, dictionary encoding
>>>>>>>> is
>>>>>>>> used
>>>>>>>> if it gets better results than plain encoding and integer columns
>>>>>>>> always
>>>>>>>> use the bit packing and run-length encoding hybrid. There aren't
>>>>>>>> many
>>>>>>>> choices you would make on a per-column basis here, either.
>>>>>>>>
>>>>>>>> There are two options you can control that affect encodings: the
>>>>>>>> dictionary page size and whether to use the 2.0 encodings. The 2.0
>>>>>>>> encodings are delta binary and delta integer, which can't be read by
>>>>>>>> some
>>>>>>>> older readers or by Apache Impala. They get great compression on
>>>>>>>> certain
>>>>>>>> types though. You can also control the maximum dictionary size,
>>>>>>>> which
>>>>>>>> could
>>>>>>>> help if you have columns that should be dictionary-encoded but are
>>>>>>>> falling
>>>>>>>> back to plain encoding because the dictionary gets too big.
>>>>>>>>
>>>>>>>> Both of those options are exposed by the builder when you create a
>>>>>>>> writer:
>>>>>>>>
>>>>>>>>      AvroParquetWriter.builder(outputPath)
>>>>>>>>            .withSchema(schema)
>>>>>>>>            .withDataModel(ReflectData.get())
>>>>>>>>
>>>>>>>> .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>>>>>>>>            .withDictionaryPageSize(2*1024*1024)
>>>>>>>>            .build();
>>>>>>>>
>>>>>>>> The default dictionary page size is 1MB.
>>>>>>>>
>>>>>>>> rb
>>>>>>>>
>>>>>>>>
>>>>>>>> On 02/04/2016 01:35 PM, Stefán Baxter wrote:
>>>>>>>>
>>>>>>>> Hi,
>>>>>>>>
>>>>>>>>
>>>>>>>>> I'm using parquet-mr/parquet-avro to write parquet files.
>>>>>>>>>
>>>>>>>>> I want to control/override the encoding type for a column and I
>>>>>>>>> find
>>>>>>>>> no
>>>>>>>>> documentation or examples regarding that.
>>>>>>>>>
>>>>>>>>> My schema (MessageType) is converted with AvroSchemaConverter and I
>>>>>>>>> wonder
>>>>>>>>> how I can either set or hint columns to use a particular encoding
>>>>>>>>> option.
>>>>>>>>> Is that possible?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>      -Stefán
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>> Ryan Blue
>>>>>>>> Software Engineer
>>>>>>>> Cloudera, Inc.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Cloudera, Inc.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Cloudera, Inc.
>>>>
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Cloudera, Inc.
>>
>
>

Re: MessageType :: Type :: Encoding options

Posted by Stefán Baxter <st...@activitystream.com>.

great, and yes, I'm using the settings you provided me with :)

 .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)



On Thu, Feb 4, 2016 at 11:24 PM, Ryan Blue <bl...@cloudera.com> wrote:

> Delta int64 encoding isn't released yet. We have a PR that I'm on the hook
> for getting in. :)
>
> Also, it's one of the 2.0 format encodings, so you'll need that option
> turned on.
>
> rb
>
>
> On 02/04/2016 03:21 PM, Stefán Baxter wrote:
>
>> thnx.
>>
>> This a time-stamp field from a is a smaller sample using the new settings:
>> Feb 4, 2016 11:06:43 PM INFO:
>> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 38,836B for
>> [occurred_at] INT64: 24,000 values, 38,783B raw, 38,783B comp, 1 pages,
>> encodings: [RLE_DICTIONARY, PLAIN], dic { 7,058 entries, 56,464B raw,
>> 7,058B comp}
>>
>> Any reason that comes to mind why this is not a integer delta? (time
>> between these entries is often a few seconds.
>>
>> -Stefan
>>
>>
>> On Thu, Feb 4, 2016 at 11:17 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>
>> You should be getting the underlying data back instead of Timestamp
>>> objects. You can pull in Avro 1.8.0 and use the conversions yourself
>>> rather
>>> than waiting for them to be included in the library.
>>>
>>> rb
>>>
>>>
>>> On 02/04/2016 03:14 PM, Stefán Baxter wrote:
>>>
>>> I'm not looking to turn it off, absolutely not, I'm looking to use it in
>>>> the most effective way :)
>>>>
>>>> Is there something I can do right now to force these fields to be
>>>> timestamp
>>>> fields in Parquet?
>>>>
>>>> Regards,
>>>>    -Stefan
>>>>
>>>> On Thu, Feb 4, 2016 at 11:03 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>>>
>>>> Got it. You can also turn off dictionary encoding with an option on the
>>>>
>>>>> builder.
>>>>>
>>>>> For timestamp, the support was just released in Avro 1.8.0 and there's
>>>>> a
>>>>> pending pull request for adding the same logical types API to
>>>>> parquet-avro:
>>>>> https://github.com/apache/parquet-mr/pull/318
>>>>>
>>>>> Once that's merged, you'll just have to add conversions to your data
>>>>> model
>>>>> like this:
>>>>>
>>>>>     GenericData model = new GenericData();
>>>>>     model.addLogicalTypeConversion(
>>>>>         new TimeConversions.TimestampConversion());
>>>>>
>>>>> Then pass that model into the builder.
>>>>>
>>>>> rb
>>>>>
>>>>>
>>>>> On 02/04/2016 02:54 PM, Stefán Baxter wrote:
>>>>>
>>>>> Hi Ryan,
>>>>>
>>>>>>
>>>>>> Thank you for taking the time.
>>>>>>
>>>>>> I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on
>>>>>> the
>>>>>> optional dictionary encoding it's used for almost
>>>>>> anything/everything. I
>>>>>> even have some time-stamp fields that is turned into a dictionary. (I
>>>>>> would
>>>>>> have guessed delta integer)
>>>>>>
>>>>>> I have a ~5M entries  in my test file and the dictionary based one
>>>>>> ends
>>>>>> up
>>>>>> 550mb and the non-dictionary based one ends up 790mb (still faster).
>>>>>>
>>>>>> So I started wondering if I could affect these decisions to compare
>>>>>> size,
>>>>>> speed etc. but I understand the rational behind automatic selection it
>>>>>> just
>>>>>> deemed somewhat naive in that Drill scenario.
>>>>>>
>>>>>> Another matter... can you point me to an example that shows me how to
>>>>>> deal
>>>>>> with Avro having no timestamp fields and conversion to Parquet.
>>>>>>
>>>>>> Best regards,
>>>>>>     -Stefán
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>>>>>
>>>>>> Hi Stefán,
>>>>>>
>>>>>>
>>>>>>> The Schema converter will map Avro types to their Parquet
>>>>>>> equivalents,
>>>>>>> for
>>>>>>> which there really aren't really choices or options. The mapping is
>>>>>>> straight-forward, like long to int64.
>>>>>>>
>>>>>>> For the individual column encodings, Parquet chooses those
>>>>>>> automatically
>>>>>>> based on the column type and data. For example, dictionary encoding
>>>>>>> is
>>>>>>> used
>>>>>>> if it gets better results than plain encoding and integer columns
>>>>>>> always
>>>>>>> use the bit packing and run-length encoding hybrid. There aren't many
>>>>>>> choices you would make on a per-column basis here, either.
>>>>>>>
>>>>>>> There are two options you can control that affect encodings: the
>>>>>>> dictionary page size and whether to use the 2.0 encodings. The 2.0
>>>>>>> encodings are delta binary and delta integer, which can't be read by
>>>>>>> some
>>>>>>> older readers or by Apache Impala. They get great compression on
>>>>>>> certain
>>>>>>> types though. You can also control the maximum dictionary size, which
>>>>>>> could
>>>>>>> help if you have columns that should be dictionary-encoded but are
>>>>>>> falling
>>>>>>> back to plain encoding because the dictionary gets too big.
>>>>>>>
>>>>>>> Both of those options are exposed by the builder when you create a
>>>>>>> writer:
>>>>>>>
>>>>>>>      AvroParquetWriter.builder(outputPath)
>>>>>>>            .withSchema(schema)
>>>>>>>            .withDataModel(ReflectData.get())
>>>>>>>
>>>>>>> .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>>>>>>>            .withDictionaryPageSize(2*1024*1024)
>>>>>>>            .build();
>>>>>>>
>>>>>>> The default dictionary page size is 1MB.
>>>>>>>
>>>>>>> rb
>>>>>>>
>>>>>>>
>>>>>>> On 02/04/2016 01:35 PM, Stefán Baxter wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>>
>>>>>>>
>>>>>>>> I'm using parquet-mr/parquet-avro to write parquet files.
>>>>>>>>
>>>>>>>> I want to control/override the encoding type for a column and I find
>>>>>>>> no
>>>>>>>> documentation or examples regarding that.
>>>>>>>>
>>>>>>>> My schema (MessageType) is converted with AvroSchemaConverter and I
>>>>>>>> wonder
>>>>>>>> how I can either set or hint columns to use a particular encoding
>>>>>>>> option.
>>>>>>>> Is that possible?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>      -Stefán
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>> Ryan Blue
>>>>>>> Software Engineer
>>>>>>> Cloudera, Inc.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Cloudera, Inc.
>>>>>
>>>>>
>>>>>
>>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Cloudera, Inc.
>>>
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: MessageType :: Type :: Encoding options

Posted by Ryan Blue <bl...@cloudera.com>.

Delta int64 encoding isn't released yet. We have a PR that I'm on the 
hook for getting in. :)

Also, it's one of the 2.0 format encodings, so you'll need that option 
turned on.

rb

On 02/04/2016 03:21 PM, Stefán Baxter wrote:
> thnx.
>
> This a time-stamp field from a is a smaller sample using the new settings:
> Feb 4, 2016 11:06:43 PM INFO:
> org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 38,836B for
> [occurred_at] INT64: 24,000 values, 38,783B raw, 38,783B comp, 1 pages,
> encodings: [RLE_DICTIONARY, PLAIN], dic { 7,058 entries, 56,464B raw,
> 7,058B comp}
>
> Any reason that comes to mind why this is not a integer delta? (time
> between these entries is often a few seconds.
>
> -Stefan
>
>
> On Thu, Feb 4, 2016 at 11:17 PM, Ryan Blue <bl...@cloudera.com> wrote:
>
>> You should be getting the underlying data back instead of Timestamp
>> objects. You can pull in Avro 1.8.0 and use the conversions yourself rather
>> than waiting for them to be included in the library.
>>
>> rb
>>
>>
>> On 02/04/2016 03:14 PM, Stefán Baxter wrote:
>>
>>> I'm not looking to turn it off, absolutely not, I'm looking to use it in
>>> the most effective way :)
>>>
>>> Is there something I can do right now to force these fields to be
>>> timestamp
>>> fields in Parquet?
>>>
>>> Regards,
>>>    -Stefan
>>>
>>> On Thu, Feb 4, 2016 at 11:03 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>>
>>> Got it. You can also turn off dictionary encoding with an option on the
>>>> builder.
>>>>
>>>> For timestamp, the support was just released in Avro 1.8.0 and there's a
>>>> pending pull request for adding the same logical types API to
>>>> parquet-avro:
>>>> https://github.com/apache/parquet-mr/pull/318
>>>>
>>>> Once that's merged, you'll just have to add conversions to your data
>>>> model
>>>> like this:
>>>>
>>>>     GenericData model = new GenericData();
>>>>     model.addLogicalTypeConversion(
>>>>         new TimeConversions.TimestampConversion());
>>>>
>>>> Then pass that model into the builder.
>>>>
>>>> rb
>>>>
>>>>
>>>> On 02/04/2016 02:54 PM, Stefán Baxter wrote:
>>>>
>>>> Hi Ryan,
>>>>>
>>>>> Thank you for taking the time.
>>>>>
>>>>> I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on
>>>>> the
>>>>> optional dictionary encoding it's used for almost anything/everything. I
>>>>> even have some time-stamp fields that is turned into a dictionary. (I
>>>>> would
>>>>> have guessed delta integer)
>>>>>
>>>>> I have a ~5M entries  in my test file and the dictionary based one ends
>>>>> up
>>>>> 550mb and the non-dictionary based one ends up 790mb (still faster).
>>>>>
>>>>> So I started wondering if I could affect these decisions to compare
>>>>> size,
>>>>> speed etc. but I understand the rational behind automatic selection it
>>>>> just
>>>>> deemed somewhat naive in that Drill scenario.
>>>>>
>>>>> Another matter... can you point me to an example that shows me how to
>>>>> deal
>>>>> with Avro having no timestamp fields and conversion to Parquet.
>>>>>
>>>>> Best regards,
>>>>>     -Stefán
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>>>>
>>>>> Hi Stefán,
>>>>>
>>>>>>
>>>>>> The Schema converter will map Avro types to their Parquet equivalents,
>>>>>> for
>>>>>> which there really aren't really choices or options. The mapping is
>>>>>> straight-forward, like long to int64.
>>>>>>
>>>>>> For the individual column encodings, Parquet chooses those
>>>>>> automatically
>>>>>> based on the column type and data. For example, dictionary encoding is
>>>>>> used
>>>>>> if it gets better results than plain encoding and integer columns
>>>>>> always
>>>>>> use the bit packing and run-length encoding hybrid. There aren't many
>>>>>> choices you would make on a per-column basis here, either.
>>>>>>
>>>>>> There are two options you can control that affect encodings: the
>>>>>> dictionary page size and whether to use the 2.0 encodings. The 2.0
>>>>>> encodings are delta binary and delta integer, which can't be read by
>>>>>> some
>>>>>> older readers or by Apache Impala. They get great compression on
>>>>>> certain
>>>>>> types though. You can also control the maximum dictionary size, which
>>>>>> could
>>>>>> help if you have columns that should be dictionary-encoded but are
>>>>>> falling
>>>>>> back to plain encoding because the dictionary gets too big.
>>>>>>
>>>>>> Both of those options are exposed by the builder when you create a
>>>>>> writer:
>>>>>>
>>>>>>      AvroParquetWriter.builder(outputPath)
>>>>>>            .withSchema(schema)
>>>>>>            .withDataModel(ReflectData.get())
>>>>>>
>>>>>> .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>>>>>>            .withDictionaryPageSize(2*1024*1024)
>>>>>>            .build();
>>>>>>
>>>>>> The default dictionary page size is 1MB.
>>>>>>
>>>>>> rb
>>>>>>
>>>>>>
>>>>>> On 02/04/2016 01:35 PM, Stefán Baxter wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>>>
>>>>>>> I'm using parquet-mr/parquet-avro to write parquet files.
>>>>>>>
>>>>>>> I want to control/override the encoding type for a column and I find
>>>>>>> no
>>>>>>> documentation or examples regarding that.
>>>>>>>
>>>>>>> My schema (MessageType) is converted with AvroSchemaConverter and I
>>>>>>> wonder
>>>>>>> how I can either set or hint columns to use a particular encoding
>>>>>>> option.
>>>>>>> Is that possible?
>>>>>>>
>>>>>>> Regards,
>>>>>>>      -Stefán
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>> Ryan Blue
>>>>>> Software Engineer
>>>>>> Cloudera, Inc.
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Cloudera, Inc.
>>>>
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Cloudera, Inc.
>>
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: MessageType :: Type :: Encoding options

Posted by Stefán Baxter <st...@activitystream.com>.

thnx.

This a time-stamp field from a is a smaller sample using the new settings:
Feb 4, 2016 11:06:43 PM INFO:
org.apache.parquet.hadoop.ColumnChunkPageWriteStore: written 38,836B for
[occurred_at] INT64: 24,000 values, 38,783B raw, 38,783B comp, 1 pages,
encodings: [RLE_DICTIONARY, PLAIN], dic { 7,058 entries, 56,464B raw,
7,058B comp}

Any reason that comes to mind why this is not a integer delta? (time
between these entries is often a few seconds.

-Stefan


On Thu, Feb 4, 2016 at 11:17 PM, Ryan Blue <bl...@cloudera.com> wrote:

> You should be getting the underlying data back instead of Timestamp
> objects. You can pull in Avro 1.8.0 and use the conversions yourself rather
> than waiting for them to be included in the library.
>
> rb
>
>
> On 02/04/2016 03:14 PM, Stefán Baxter wrote:
>
>> I'm not looking to turn it off, absolutely not, I'm looking to use it in
>> the most effective way :)
>>
>> Is there something I can do right now to force these fields to be
>> timestamp
>> fields in Parquet?
>>
>> Regards,
>>   -Stefan
>>
>> On Thu, Feb 4, 2016 at 11:03 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>
>> Got it. You can also turn off dictionary encoding with an option on the
>>> builder.
>>>
>>> For timestamp, the support was just released in Avro 1.8.0 and there's a
>>> pending pull request for adding the same logical types API to
>>> parquet-avro:
>>> https://github.com/apache/parquet-mr/pull/318
>>>
>>> Once that's merged, you'll just have to add conversions to your data
>>> model
>>> like this:
>>>
>>>    GenericData model = new GenericData();
>>>    model.addLogicalTypeConversion(
>>>        new TimeConversions.TimestampConversion());
>>>
>>> Then pass that model into the builder.
>>>
>>> rb
>>>
>>>
>>> On 02/04/2016 02:54 PM, Stefán Baxter wrote:
>>>
>>> Hi Ryan,
>>>>
>>>> Thank you for taking the time.
>>>>
>>>> I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on
>>>> the
>>>> optional dictionary encoding it's used for almost anything/everything. I
>>>> even have some time-stamp fields that is turned into a dictionary. (I
>>>> would
>>>> have guessed delta integer)
>>>>
>>>> I have a ~5M entries  in my test file and the dictionary based one ends
>>>> up
>>>> 550mb and the non-dictionary based one ends up 790mb (still faster).
>>>>
>>>> So I started wondering if I could affect these decisions to compare
>>>> size,
>>>> speed etc. but I understand the rational behind automatic selection it
>>>> just
>>>> deemed somewhat naive in that Drill scenario.
>>>>
>>>> Another matter... can you point me to an example that shows me how to
>>>> deal
>>>> with Avro having no timestamp fields and conversion to Parquet.
>>>>
>>>> Best regards,
>>>>    -Stefán
>>>>
>>>>
>>>>
>>>> On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>>>
>>>> Hi Stefán,
>>>>
>>>>>
>>>>> The Schema converter will map Avro types to their Parquet equivalents,
>>>>> for
>>>>> which there really aren't really choices or options. The mapping is
>>>>> straight-forward, like long to int64.
>>>>>
>>>>> For the individual column encodings, Parquet chooses those
>>>>> automatically
>>>>> based on the column type and data. For example, dictionary encoding is
>>>>> used
>>>>> if it gets better results than plain encoding and integer columns
>>>>> always
>>>>> use the bit packing and run-length encoding hybrid. There aren't many
>>>>> choices you would make on a per-column basis here, either.
>>>>>
>>>>> There are two options you can control that affect encodings: the
>>>>> dictionary page size and whether to use the 2.0 encodings. The 2.0
>>>>> encodings are delta binary and delta integer, which can't be read by
>>>>> some
>>>>> older readers or by Apache Impala. They get great compression on
>>>>> certain
>>>>> types though. You can also control the maximum dictionary size, which
>>>>> could
>>>>> help if you have columns that should be dictionary-encoded but are
>>>>> falling
>>>>> back to plain encoding because the dictionary gets too big.
>>>>>
>>>>> Both of those options are exposed by the builder when you create a
>>>>> writer:
>>>>>
>>>>>     AvroParquetWriter.builder(outputPath)
>>>>>           .withSchema(schema)
>>>>>           .withDataModel(ReflectData.get())
>>>>>
>>>>> .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>>>>>           .withDictionaryPageSize(2*1024*1024)
>>>>>           .build();
>>>>>
>>>>> The default dictionary page size is 1MB.
>>>>>
>>>>> rb
>>>>>
>>>>>
>>>>> On 02/04/2016 01:35 PM, Stefán Baxter wrote:
>>>>>
>>>>> Hi,
>>>>>
>>>>>>
>>>>>> I'm using parquet-mr/parquet-avro to write parquet files.
>>>>>>
>>>>>> I want to control/override the encoding type for a column and I find
>>>>>> no
>>>>>> documentation or examples regarding that.
>>>>>>
>>>>>> My schema (MessageType) is converted with AvroSchemaConverter and I
>>>>>> wonder
>>>>>> how I can either set or hint columns to use a particular encoding
>>>>>> option.
>>>>>> Is that possible?
>>>>>>
>>>>>> Regards,
>>>>>>     -Stefán
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>> Ryan Blue
>>>>> Software Engineer
>>>>> Cloudera, Inc.
>>>>>
>>>>>
>>>>>
>>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Cloudera, Inc.
>>>
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: MessageType :: Type :: Encoding options

Posted by Ryan Blue <bl...@cloudera.com>.

You should be getting the underlying data back instead of Timestamp 
objects. You can pull in Avro 1.8.0 and use the conversions yourself 
rather than waiting for them to be included in the library.

rb

On 02/04/2016 03:14 PM, Stefán Baxter wrote:
> I'm not looking to turn it off, absolutely not, I'm looking to use it in
> the most effective way :)
>
> Is there something I can do right now to force these fields to be timestamp
> fields in Parquet?
>
> Regards,
>   -Stefan
>
> On Thu, Feb 4, 2016 at 11:03 PM, Ryan Blue <bl...@cloudera.com> wrote:
>
>> Got it. You can also turn off dictionary encoding with an option on the
>> builder.
>>
>> For timestamp, the support was just released in Avro 1.8.0 and there's a
>> pending pull request for adding the same logical types API to parquet-avro:
>> https://github.com/apache/parquet-mr/pull/318
>>
>> Once that's merged, you'll just have to add conversions to your data model
>> like this:
>>
>>    GenericData model = new GenericData();
>>    model.addLogicalTypeConversion(
>>        new TimeConversions.TimestampConversion());
>>
>> Then pass that model into the builder.
>>
>> rb
>>
>>
>> On 02/04/2016 02:54 PM, Stefán Baxter wrote:
>>
>>> Hi Ryan,
>>>
>>> Thank you for taking the time.
>>>
>>> I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on the
>>> optional dictionary encoding it's used for almost anything/everything. I
>>> even have some time-stamp fields that is turned into a dictionary. (I
>>> would
>>> have guessed delta integer)
>>>
>>> I have a ~5M entries  in my test file and the dictionary based one ends up
>>> 550mb and the non-dictionary based one ends up 790mb (still faster).
>>>
>>> So I started wondering if I could affect these decisions to compare size,
>>> speed etc. but I understand the rational behind automatic selection it
>>> just
>>> deemed somewhat naive in that Drill scenario.
>>>
>>> Another matter... can you point me to an example that shows me how to deal
>>> with Avro having no timestamp fields and conversion to Parquet.
>>>
>>> Best regards,
>>>    -Stefán
>>>
>>>
>>>
>>> On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>>
>>> Hi Stefán,
>>>>
>>>> The Schema converter will map Avro types to their Parquet equivalents,
>>>> for
>>>> which there really aren't really choices or options. The mapping is
>>>> straight-forward, like long to int64.
>>>>
>>>> For the individual column encodings, Parquet chooses those automatically
>>>> based on the column type and data. For example, dictionary encoding is
>>>> used
>>>> if it gets better results than plain encoding and integer columns always
>>>> use the bit packing and run-length encoding hybrid. There aren't many
>>>> choices you would make on a per-column basis here, either.
>>>>
>>>> There are two options you can control that affect encodings: the
>>>> dictionary page size and whether to use the 2.0 encodings. The 2.0
>>>> encodings are delta binary and delta integer, which can't be read by some
>>>> older readers or by Apache Impala. They get great compression on certain
>>>> types though. You can also control the maximum dictionary size, which
>>>> could
>>>> help if you have columns that should be dictionary-encoded but are
>>>> falling
>>>> back to plain encoding because the dictionary gets too big.
>>>>
>>>> Both of those options are exposed by the builder when you create a
>>>> writer:
>>>>
>>>>     AvroParquetWriter.builder(outputPath)
>>>>           .withSchema(schema)
>>>>           .withDataModel(ReflectData.get())
>>>>           .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>>>>           .withDictionaryPageSize(2*1024*1024)
>>>>           .build();
>>>>
>>>> The default dictionary page size is 1MB.
>>>>
>>>> rb
>>>>
>>>>
>>>> On 02/04/2016 01:35 PM, Stefán Baxter wrote:
>>>>
>>>> Hi,
>>>>>
>>>>> I'm using parquet-mr/parquet-avro to write parquet files.
>>>>>
>>>>> I want to control/override the encoding type for a column and I find no
>>>>> documentation or examples regarding that.
>>>>>
>>>>> My schema (MessageType) is converted with AvroSchemaConverter and I
>>>>> wonder
>>>>> how I can either set or hint columns to use a particular encoding
>>>>> option.
>>>>> Is that possible?
>>>>>
>>>>> Regards,
>>>>>     -Stefán
>>>>>
>>>>>
>>>>>
>>>> --
>>>> Ryan Blue
>>>> Software Engineer
>>>> Cloudera, Inc.
>>>>
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Cloudera, Inc.
>>
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: MessageType :: Type :: Encoding options

Posted by Stefán Baxter <st...@activitystream.com>.

I'm not looking to turn it off, absolutely not, I'm looking to use it in
the most effective way :)

Is there something I can do right now to force these fields to be timestamp
fields in Parquet?

Regards,
 -Stefan

On Thu, Feb 4, 2016 at 11:03 PM, Ryan Blue <bl...@cloudera.com> wrote:

> Got it. You can also turn off dictionary encoding with an option on the
> builder.
>
> For timestamp, the support was just released in Avro 1.8.0 and there's a
> pending pull request for adding the same logical types API to parquet-avro:
> https://github.com/apache/parquet-mr/pull/318
>
> Once that's merged, you'll just have to add conversions to your data model
> like this:
>
>   GenericData model = new GenericData();
>   model.addLogicalTypeConversion(
>       new TimeConversions.TimestampConversion());
>
> Then pass that model into the builder.
>
> rb
>
>
> On 02/04/2016 02:54 PM, Stefán Baxter wrote:
>
>> Hi Ryan,
>>
>> Thank you for taking the time.
>>
>> I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on the
>> optional dictionary encoding it's used for almost anything/everything. I
>> even have some time-stamp fields that is turned into a dictionary. (I
>> would
>> have guessed delta integer)
>>
>> I have a ~5M entries  in my test file and the dictionary based one ends up
>> 550mb and the non-dictionary based one ends up 790mb (still faster).
>>
>> So I started wondering if I could affect these decisions to compare size,
>> speed etc. but I understand the rational behind automatic selection it
>> just
>> deemed somewhat naive in that Drill scenario.
>>
>> Another matter... can you point me to an example that shows me how to deal
>> with Avro having no timestamp fields and conversion to Parquet.
>>
>> Best regards,
>>   -Stefán
>>
>>
>>
>> On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <bl...@cloudera.com> wrote:
>>
>> Hi Stefán,
>>>
>>> The Schema converter will map Avro types to their Parquet equivalents,
>>> for
>>> which there really aren't really choices or options. The mapping is
>>> straight-forward, like long to int64.
>>>
>>> For the individual column encodings, Parquet chooses those automatically
>>> based on the column type and data. For example, dictionary encoding is
>>> used
>>> if it gets better results than plain encoding and integer columns always
>>> use the bit packing and run-length encoding hybrid. There aren't many
>>> choices you would make on a per-column basis here, either.
>>>
>>> There are two options you can control that affect encodings: the
>>> dictionary page size and whether to use the 2.0 encodings. The 2.0
>>> encodings are delta binary and delta integer, which can't be read by some
>>> older readers or by Apache Impala. They get great compression on certain
>>> types though. You can also control the maximum dictionary size, which
>>> could
>>> help if you have columns that should be dictionary-encoded but are
>>> falling
>>> back to plain encoding because the dictionary gets too big.
>>>
>>> Both of those options are exposed by the builder when you create a
>>> writer:
>>>
>>>    AvroParquetWriter.builder(outputPath)
>>>          .withSchema(schema)
>>>          .withDataModel(ReflectData.get())
>>>          .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>>>          .withDictionaryPageSize(2*1024*1024)
>>>          .build();
>>>
>>> The default dictionary page size is 1MB.
>>>
>>> rb
>>>
>>>
>>> On 02/04/2016 01:35 PM, Stefán Baxter wrote:
>>>
>>> Hi,
>>>>
>>>> I'm using parquet-mr/parquet-avro to write parquet files.
>>>>
>>>> I want to control/override the encoding type for a column and I find no
>>>> documentation or examples regarding that.
>>>>
>>>> My schema (MessageType) is converted with AvroSchemaConverter and I
>>>> wonder
>>>> how I can either set or hint columns to use a particular encoding
>>>> option.
>>>> Is that possible?
>>>>
>>>> Regards,
>>>>    -Stefán
>>>>
>>>>
>>>>
>>> --
>>> Ryan Blue
>>> Software Engineer
>>> Cloudera, Inc.
>>>
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: MessageType :: Type :: Encoding options

Posted by Ryan Blue <bl...@cloudera.com>.

Got it. You can also turn off dictionary encoding with an option on the 
builder.

For timestamp, the support was just released in Avro 1.8.0 and there's a 
pending pull request for adding the same logical types API to 
parquet-avro: https://github.com/apache/parquet-mr/pull/318

Once that's merged, you'll just have to add conversions to your data 
model like this:

   GenericData model = new GenericData();
   model.addLogicalTypeConversion(
       new TimeConversions.TimestampConversion());

Then pass that model into the builder.

rb

On 02/04/2016 02:54 PM, Stefán Baxter wrote:
> Hi Ryan,
>
> Thank you for taking the time.
>
> I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on the
> optional dictionary encoding it's used for almost anything/everything. I
> even have some time-stamp fields that is turned into a dictionary. (I would
> have guessed delta integer)
>
> I have a ~5M entries  in my test file and the dictionary based one ends up
> 550mb and the non-dictionary based one ends up 790mb (still faster).
>
> So I started wondering if I could affect these decisions to compare size,
> speed etc. but I understand the rational behind automatic selection it just
> deemed somewhat naive in that Drill scenario.
>
> Another matter... can you point me to an example that shows me how to deal
> with Avro having no timestamp fields and conversion to Parquet.
>
> Best regards,
>   -Stefán
>
>
>
> On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <bl...@cloudera.com> wrote:
>
>> Hi Stefán,
>>
>> The Schema converter will map Avro types to their Parquet equivalents, for
>> which there really aren't really choices or options. The mapping is
>> straight-forward, like long to int64.
>>
>> For the individual column encodings, Parquet chooses those automatically
>> based on the column type and data. For example, dictionary encoding is used
>> if it gets better results than plain encoding and integer columns always
>> use the bit packing and run-length encoding hybrid. There aren't many
>> choices you would make on a per-column basis here, either.
>>
>> There are two options you can control that affect encodings: the
>> dictionary page size and whether to use the 2.0 encodings. The 2.0
>> encodings are delta binary and delta integer, which can't be read by some
>> older readers or by Apache Impala. They get great compression on certain
>> types though. You can also control the maximum dictionary size, which could
>> help if you have columns that should be dictionary-encoded but are falling
>> back to plain encoding because the dictionary gets too big.
>>
>> Both of those options are exposed by the builder when you create a writer:
>>
>>    AvroParquetWriter.builder(outputPath)
>>          .withSchema(schema)
>>          .withDataModel(ReflectData.get())
>>          .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>>          .withDictionaryPageSize(2*1024*1024)
>>          .build();
>>
>> The default dictionary page size is 1MB.
>>
>> rb
>>
>>
>> On 02/04/2016 01:35 PM, Stefán Baxter wrote:
>>
>>> Hi,
>>>
>>> I'm using parquet-mr/parquet-avro to write parquet files.
>>>
>>> I want to control/override the encoding type for a column and I find no
>>> documentation or examples regarding that.
>>>
>>> My schema (MessageType) is converted with AvroSchemaConverter and I wonder
>>> how I can either set or hint columns to use a particular encoding option.
>>> Is that possible?
>>>
>>> Regards,
>>>    -Stefán
>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Cloudera, Inc.
>>
>


-- 
Ryan Blue
Software Engineer
Cloudera, Inc.

Re: MessageType :: Type :: Encoding options

Posted by Stefán Baxter <st...@activitystream.com>.

Hi Ryan,

Thank you for taking the time.

I'm using Drill (1.5-SNAPSHOT) and I have noticed that when I turn on the
optional dictionary encoding it's used for almost anything/everything. I
even have some time-stamp fields that is turned into a dictionary. (I would
have guessed delta integer)

I have a ~5M entries  in my test file and the dictionary based one ends up
550mb and the non-dictionary based one ends up 790mb (still faster).

So I started wondering if I could affect these decisions to compare size,
speed etc. but I understand the rational behind automatic selection it just
deemed somewhat naive in that Drill scenario.

Another matter... can you point me to an example that shows me how to deal
with Avro having no timestamp fields and conversion to Parquet.

Best regards,
 -Stefán



On Thu, Feb 4, 2016 at 10:17 PM, Ryan Blue <bl...@cloudera.com> wrote:

> Hi Stefán,
>
> The Schema converter will map Avro types to their Parquet equivalents, for
> which there really aren't really choices or options. The mapping is
> straight-forward, like long to int64.
>
> For the individual column encodings, Parquet chooses those automatically
> based on the column type and data. For example, dictionary encoding is used
> if it gets better results than plain encoding and integer columns always
> use the bit packing and run-length encoding hybrid. There aren't many
> choices you would make on a per-column basis here, either.
>
> There are two options you can control that affect encodings: the
> dictionary page size and whether to use the 2.0 encodings. The 2.0
> encodings are delta binary and delta integer, which can't be read by some
> older readers or by Apache Impala. They get great compression on certain
> types though. You can also control the maximum dictionary size, which could
> help if you have columns that should be dictionary-encoded but are falling
> back to plain encoding because the dictionary gets too big.
>
> Both of those options are exposed by the builder when you create a writer:
>
>   AvroParquetWriter.builder(outputPath)
>         .withSchema(schema)
>         .withDataModel(ReflectData.get())
>         .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
>         .withDictionaryPageSize(2*1024*1024)
>         .build();
>
> The default dictionary page size is 1MB.
>
> rb
>
>
> On 02/04/2016 01:35 PM, Stefán Baxter wrote:
>
>> Hi,
>>
>> I'm using parquet-mr/parquet-avro to write parquet files.
>>
>> I want to control/override the encoding type for a column and I find no
>> documentation or examples regarding that.
>>
>> My schema (MessageType) is converted with AvroSchemaConverter and I wonder
>> how I can either set or hint columns to use a particular encoding option.
>> Is that possible?
>>
>> Regards,
>>   -Stefán
>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Cloudera, Inc.
>

Re: MessageType :: Type :: Encoding options

Posted by Ryan Blue <bl...@cloudera.com>.

Hi Stefán,

The Schema converter will map Avro types to their Parquet equivalents, 
for which there really aren't really choices or options. The mapping is 
straight-forward, like long to int64.

For the individual column encodings, Parquet chooses those automatically 
based on the column type and data. For example, dictionary encoding is 
used if it gets better results than plain encoding and integer columns 
always use the bit packing and run-length encoding hybrid. There aren't 
many choices you would make on a per-column basis here, either.

There are two options you can control that affect encodings: the 
dictionary page size and whether to use the 2.0 encodings. The 2.0 
encodings are delta binary and delta integer, which can't be read by 
some older readers or by Apache Impala. They get great compression on 
certain types though. You can also control the maximum dictionary size, 
which could help if you have columns that should be dictionary-encoded 
but are falling back to plain encoding because the dictionary gets too big.

Both of those options are exposed by the builder when you create a writer:

   AvroParquetWriter.builder(outputPath)
         .withSchema(schema)
         .withDataModel(ReflectData.get())
         .withWriterVersion(ParquetProperties.WriterVersion.PARQUET_2_0)
         .withDictionaryPageSize(2*1024*1024)
         .build();

The default dictionary page size is 1MB.

rb

On 02/04/2016 01:35 PM, Stefán Baxter wrote:
> Hi,
>
> I'm using parquet-mr/parquet-avro to write parquet files.
>
> I want to control/override the encoding type for a column and I find no
> documentation or examples regarding that.
>
> My schema (MessageType) is converted with AvroSchemaConverter and I wonder
> how I can either set or hint columns to use a particular encoding option.
> Is that possible?
>
> Regards,
>   -Stefán
>

-- 
Ryan Blue
Software Engineer
Cloudera, Inc.