You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@arrow.apache.org by Louis C <lc...@outlook.fr> on 2022/07/12 15:02:01 UTC

Using Parquet adapter with type DURATION : field type loss

Hello,

I integrated the arrow library to a larger project, and was testing doing exports/imports of the same tables to see if it behaved well. Doing this, I became aware that arrow DURATION types were exported as INT64 (as the corresponding number of µs if I remember correctly) in the parquet export, and then imported as INT64 types. So the parquet export loses the type for the DURATION fields.
Would not it be better to export the DURATION type as the parquet logical type "TIME_MICROS" (meaning TIME wit precision micro, as TIME_MICROS seems to be somewhat deprecated (https://apache.googlesource.com/parquet-format/+/refs/heads/bloom-filter/LogicalTypes.md)) as is doing matlab (see https://fr.mathworks.com/help/matlab/import_export/datatype-mappings-matlab-parquet.html) ?

Best regards,
Louis C

Re: Using Parquet adapter with type DURATION : field type loss

Posted by Joris Van den Bossche <jo...@gmail.com>.
On Thu, 14 Jul 2022 at 08:34, Micah Kornfield <em...@gmail.com> wrote:

> Hi Louis,
> I would lean against doing this.  Parquet doesn't seem to be prescriptive,
> but I understand Time type to have a max value of at most 1 day (i.e. 86400
> seconds, this is how Arrow defines the type at least [1]).  Durations can
> be larger and that can lead to ambiguity in handling.  Second, the Arrow
> schema should be preserved by default when writing the parquet file so it
> should be recoverable, I understand this doesn't help for non-arrow based
> systems but it potentially gives a work-around in some contexts.
>

Small note: what Micah mentions here about preserving this information in
the arrow schema (stored in the parquet file metadata) so roundtrips
from/to arrow-based systems work for duration, this is implemented earlier
this year and available since 8.0.0 (
https://issues.apache.org/jira/browse/ARROW-6780)


>
> I think the more appropriate solution is to see if there is interest in
> extending Parquet's type system for this type OR figuring out conventions
> that are more universal for logical types that aren't in Parquet's type
> system.
>
> Thanks,
> Micah
>
> [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L222
>
> On Tue, Jul 12, 2022 at 8:02 AM Louis C <lc...@outlook.fr> wrote:
>
>> Hello,
>>
>> I integrated the arrow library to a larger project, and was testing doing
>> exports/imports of the same tables to see if it behaved well. Doing this, I
>> became aware that arrow DURATION types were exported as INT64 (as the
>> corresponding number of µs if I remember correctly) in the parquet export,
>> and then imported as INT64 types. So the parquet export loses the type for
>> the DURATION fields.
>> Would not it be better to export the DURATION type as the parquet logical
>> type "TIME_MICROS" (meaning TIME wit precision micro, as TIME_MICROS seems
>> to be somewhat deprecated (
>> https://apache.googlesource.com/parquet-format/+/refs/heads/bloom-filter/LogicalTypes.md))
>> as is doing matlab (see
>> https://fr.mathworks.com/help/matlab/import_export/datatype-mappings-matlab-parquet.html)
>> ?
>>
>> Best regards,
>> Louis C
>>
>

RE: Using Parquet adapter with type DURATION : field type loss

Posted by Louis C <lc...@outlook.fr>.
Ok, thanks for your answer.

After a bit of investigation, I figured out that my problem was not using the experimental "store_schema() " method of the parquet::ArrowWriterProperties::Builder when opening the Parquet writer in the C++ implementation. If this method is not called, the DURATION fields are exported in int64, and the information of the real type of the field is forgotten. In pyarrow, it works directly with the method WriteTable, probably because the "store_schema" method is called in it (or an equivalent).
Now it works correctly, but maybe the boolean "store_schema" should be set to "true" by default as we rely on it to find the field type back (or make an error when finding a DURATION type and that the schema is not stored, as it was the case before).

Regards,
Louis C

________________________________
De : Joris Van den Bossche <jo...@gmail.com>
Envoyé : lundi 18 juillet 2022 10:48
À : user@arrow.apache.org <us...@arrow.apache.org>
Cc : emkornfield@gmail.com <em...@gmail.com>
Objet : Re: Using Parquet adapter with type DURATION : field type loss



On Mon, 18 Jul 2022 at 10:38, Louis C <lc...@outlook.fr>> wrote:
Hello Micah and Joris,

Thanks fort your answer. I understand that using the "TIME" fields of Parquet can be problematic in some instances.
But I still find it strange that this is the only case (I think) that exporting/importing an Arrow table in a particular format (Feather, ORC, Parquet) changes the type of the field (there are other cases where the type is not supported at all, but it gives a plain error during the export).
I will try to lookup to the Arrow schema in the Parquet file. Is there a particular task to be done when reading back the Parquet file so that the type of the DURATION field is correctly inferred ?

If you are using the Arrow C++ implementation or one of its bindings (R arrow, pyarrow, ..), this should be done automatically.


Regards,
Louis C
________________________________
De : Micah Kornfield <em...@gmail.com>>
Envoyé : jeudi 14 juillet 2022 08:33
À : user@arrow.apache.org<ma...@arrow.apache.org> <us...@arrow.apache.org>>
Objet : Re: Using Parquet adapter with type DURATION : field type loss

Hi Louis,
I would lean against doing this.  Parquet doesn't seem to be prescriptive, but I understand Time type to have a max value of at most 1 day (i.e. 86400 seconds, this is how Arrow defines the type at least [1]).  Durations can be larger and that can lead to ambiguity in handling.  Second, the Arrow schema should be preserved by default when writing the parquet file so it should be recoverable, I understand this doesn't help for non-arrow based systems but it potentially gives a work-around in some contexts.

I think the more appropriate solution is to see if there is interest in extending Parquet's type system for this type OR figuring out conventions that are more universal for logical types that aren't in Parquet's type system.

Thanks,
Micah

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L222

On Tue, Jul 12, 2022 at 8:02 AM Louis C <lc...@outlook.fr>> wrote:
Hello,

I integrated the arrow library to a larger project, and was testing doing exports/imports of the same tables to see if it behaved well. Doing this, I became aware that arrow DURATION types were exported as INT64 (as the corresponding number of µs if I remember correctly) in the parquet export, and then imported as INT64 types. So the parquet export loses the type for the DURATION fields.
Would not it be better to export the DURATION type as the parquet logical type "TIME_MICROS" (meaning TIME wit precision micro, as TIME_MICROS seems to be somewhat deprecated (https://apache.googlesource.com/parquet-format/+/refs/heads/bloom-filter/LogicalTypes.md)) as is doing matlab (see https://fr.mathworks.com/help/matlab/import_export/datatype-mappings-matlab-parquet.html) ?

Best regards,
Louis C

Re: Using Parquet adapter with type DURATION : field type loss

Posted by Joris Van den Bossche <jo...@gmail.com>.
On Mon, 18 Jul 2022 at 10:38, Louis C <lc...@outlook.fr> wrote:

> Hello Micah and Joris,
>
> Thanks fort your answer. I understand that using the "TIME" fields of
> Parquet can be problematic in some instances.
> But I still find it strange that this is the only case (I think) that
> exporting/importing an Arrow table in a particular format (Feather, ORC,
> Parquet) changes the type of the field (there are other cases where the
> type is not supported at all, but it gives a plain error during the export).
> I will try to lookup to the Arrow schema in the Parquet file. Is there a
> particular task to be done when reading back the Parquet file so that the
> type of the DURATION field is correctly inferred ?
>

If you are using the Arrow C++ implementation or one of its bindings (R
arrow, pyarrow, ..), this should be done automatically.


>
> Regards,
> Louis C
> ------------------------------
> *De :* Micah Kornfield <em...@gmail.com>
> *Envoyé :* jeudi 14 juillet 2022 08:33
> *À :* user@arrow.apache.org <us...@arrow.apache.org>
> *Objet :* Re: Using Parquet adapter with type DURATION : field type loss
>
> Hi Louis,
> I would lean against doing this.  Parquet doesn't seem to be prescriptive,
> but I understand Time type to have a max value of at most 1 day (i.e. 86400
> seconds, this is how Arrow defines the type at least [1]).  Durations can
> be larger and that can lead to ambiguity in handling.  Second, the Arrow
> schema should be preserved by default when writing the parquet file so it
> should be recoverable, I understand this doesn't help for non-arrow based
> systems but it potentially gives a work-around in some contexts.
>
> I think the more appropriate solution is to see if there is interest in
> extending Parquet's type system for this type OR figuring out conventions
> that are more universal for logical types that aren't in Parquet's type
> system.
>
> Thanks,
> Micah
>
> [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L222
>
> On Tue, Jul 12, 2022 at 8:02 AM Louis C <lc...@outlook.fr> wrote:
>
> Hello,
>
> I integrated the arrow library to a larger project, and was testing doing
> exports/imports of the same tables to see if it behaved well. Doing this, I
> became aware that arrow DURATION types were exported as INT64 (as the
> corresponding number of µs if I remember correctly) in the parquet export,
> and then imported as INT64 types. So the parquet export loses the type for
> the DURATION fields.
> Would not it be better to export the DURATION type as the parquet logical
> type "TIME_MICROS" (meaning TIME wit precision micro, as TIME_MICROS seems
> to be somewhat deprecated (
> https://apache.googlesource.com/parquet-format/+/refs/heads/bloom-filter/LogicalTypes.md))
> as is doing matlab (see
> https://fr.mathworks.com/help/matlab/import_export/datatype-mappings-matlab-parquet.html)
> ?
>
> Best regards,
> Louis C
>
>

RE: Using Parquet adapter with type DURATION : field type loss

Posted by Louis C <lc...@outlook.fr>.
Hello Micah and Joris,

Thanks fort your answer. I understand that using the "TIME" fields of Parquet can be problematic in some instances.
But I still find it strange that this is the only case (I think) that exporting/importing an Arrow table in a particular format (Feather, ORC, Parquet) changes the type of the field (there are other cases where the type is not supported at all, but it gives a plain error during the export).
I will try to lookup to the Arrow schema in the Parquet file. Is there a particular task to be done when reading back the Parquet file so that the type of the DURATION field is correctly inferred ?

Regards,
Louis C
________________________________
De : Micah Kornfield <em...@gmail.com>
Envoyé : jeudi 14 juillet 2022 08:33
À : user@arrow.apache.org <us...@arrow.apache.org>
Objet : Re: Using Parquet adapter with type DURATION : field type loss

Hi Louis,
I would lean against doing this.  Parquet doesn't seem to be prescriptive, but I understand Time type to have a max value of at most 1 day (i.e. 86400 seconds, this is how Arrow defines the type at least [1]).  Durations can be larger and that can lead to ambiguity in handling.  Second, the Arrow schema should be preserved by default when writing the parquet file so it should be recoverable, I understand this doesn't help for non-arrow based systems but it potentially gives a work-around in some contexts.

I think the more appropriate solution is to see if there is interest in extending Parquet's type system for this type OR figuring out conventions that are more universal for logical types that aren't in Parquet's type system.

Thanks,
Micah

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L222

On Tue, Jul 12, 2022 at 8:02 AM Louis C <lc...@outlook.fr>> wrote:
Hello,

I integrated the arrow library to a larger project, and was testing doing exports/imports of the same tables to see if it behaved well. Doing this, I became aware that arrow DURATION types were exported as INT64 (as the corresponding number of µs if I remember correctly) in the parquet export, and then imported as INT64 types. So the parquet export loses the type for the DURATION fields.
Would not it be better to export the DURATION type as the parquet logical type "TIME_MICROS" (meaning TIME wit precision micro, as TIME_MICROS seems to be somewhat deprecated (https://apache.googlesource.com/parquet-format/+/refs/heads/bloom-filter/LogicalTypes.md)) as is doing matlab (see https://fr.mathworks.com/help/matlab/import_export/datatype-mappings-matlab-parquet.html) ?

Best regards,
Louis C

Re: Using Parquet adapter with type DURATION : field type loss

Posted by Micah Kornfield <em...@gmail.com>.
Hi Louis,
I would lean against doing this.  Parquet doesn't seem to be prescriptive,
but I understand Time type to have a max value of at most 1 day (i.e. 86400
seconds, this is how Arrow defines the type at least [1]).  Durations can
be larger and that can lead to ambiguity in handling.  Second, the Arrow
schema should be preserved by default when writing the parquet file so it
should be recoverable, I understand this doesn't help for non-arrow based
systems but it potentially gives a work-around in some contexts.

I think the more appropriate solution is to see if there is interest in
extending Parquet's type system for this type OR figuring out conventions
that are more universal for logical types that aren't in Parquet's type
system.

Thanks,
Micah

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L222

On Tue, Jul 12, 2022 at 8:02 AM Louis C <lc...@outlook.fr> wrote:

> Hello,
>
> I integrated the arrow library to a larger project, and was testing doing
> exports/imports of the same tables to see if it behaved well. Doing this, I
> became aware that arrow DURATION types were exported as INT64 (as the
> corresponding number of µs if I remember correctly) in the parquet export,
> and then imported as INT64 types. So the parquet export loses the type for
> the DURATION fields.
> Would not it be better to export the DURATION type as the parquet logical
> type "TIME_MICROS" (meaning TIME wit precision micro, as TIME_MICROS seems
> to be somewhat deprecated (
> https://apache.googlesource.com/parquet-format/+/refs/heads/bloom-filter/LogicalTypes.md))
> as is doing matlab (see
> https://fr.mathworks.com/help/matlab/import_export/datatype-mappings-matlab-parquet.html)
> ?
>
> Best regards,
> Louis C
>