You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "TP Boudreau (JIRA)" <ji...@apache.org> on 2019/07/10 18:04:00 UTC
[jira] [Comment Edited] (ARROW-5889) [Python][C++] Parquet backwards compat for timestamps without timezone broken

    [ https://issues.apache.org/jira/browse/ARROW-5889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16882313#comment-16882313 ] 

TP Boudreau edited comment on ARROW-5889 at 7/10/19 6:03 PM:
-------------------------------------------------------------

I can think of two possible approaches to correcting this on an interim basis before the parquet.thrift gets changed (if it does get changed), but neither is perfect:

1.  Add a new boolean member to the parquet::TimestampLogicalType class named fromConvertedType that is set to true if the object was constructed from a converted type, false if the user explicitly constructed the object.  While in memory, the Arrow conversions can interrogate the property and, if "true", can imitate the old TIMESTAMP converted type logic.  On writing a schema, if the property is "true", the writer would NOT write a TimestampLogicalType for that field/column, but would instead write just the TIMESTAMP converted type (as it does now already) – the original converted type semantics would be retained both in use and on disk.

This would require changes to the recently released public API for the TimestampLogicalType class (new creator functions, accessors, etc.).  Also it would result in a parquet file with mixed converted type and LogicalType annotations (which seems legal, but probably wasn't intended).

2. Use file level key-value metadata to store the fact that the field was from a converted type (as will be done for timezones).  This requires changes to the Arrow public API (converting from an Arrow schema to a Parquet would produce both a Parquet schema and a K-V metadata object).  Also, given that names are not unique, it might be difficult to produce unique keys (knowable both on the Arrow and Parquet sides).  But both these problems will have to be addressed eventually if timezones are to be save this way

I'd lean toward option (1.), but there might gotchas that I'm not considering for either option.  Do either of these sound like they're worth pursuing? If so, I can work on this. 


was (Author: tpboudreau):
I can think of two possible approaches to correcting this on an interim basis before the parquet.thrift gets changed (if it does get changed), but neither is perfect:

1.  Add a new boolean member to the parquet::TimestampLogicalType class named fromConvertedType that is set to true if the object was constructed from a converted type, false if the user explicitly constructed the object.  While in memory, the Arrow conversions can interrogate the property and, if "true", can imitate the old TIMESTAMP converted type logic.  On writing a schema, if the property is "true", the writer would NOT write a TimestampLogicalType for that field/column, but would instead write just the TIMESTAMP converted type (as it does now already) – the original converted type semantics would be retained both in use and on disk.

This would require changes to the recently released public API for the TimestampLogicalType class (new creator functions, accessors, etc.).  Also it would result in a parquet file with mixed converted type and LogicalType annotations (which seems legal, but probably wasn't intended).

2. Use file level key-value metadata to store the fact that the field was from a converted type (as will be done for timezones).  This requires changes to the Arrow public API (converting from an Arrow schema to a Parquet would produce both a Parquet schema and a K-V metadata object).  Also, given that names are not unique, it might be difficult to produce unique keys (knowable both on the Arrow and Parquet sides).  But both these problems will have to be addressed eventually if timezones are to be save this way

I'd lean toward option (1.), but there might gotchas that I'm not considering for either option.  Do either of these sound like they're worth pursuing? If so, I work on this. 

> [Python][C++] Parquet backwards compat for timestamps without timezone broken
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-5889
>                 URL: https://issues.apache.org/jira/browse/ARROW-5889
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.14.0
>            Reporter: Florian Jetter
>            Priority: Minor
>              Labels: parquet
>             Fix For: 0.14.1
>
>         Attachments: 0.12.1.parquet, 0.13.0.parquet
>
>
> When reading a parquet file which has timestamp fields they are read as a timestamp with timezone UTC if the parquet file was written by pyarrow 0.13.0 and/or 0.12.1.
> Expected behavior would be that they are loaded as timestamps without any timezone information.
> The attached files contain one row for all basic types and a few nested types, the timestamp fields are called datetime64 and datetime64_tz
> see also [https://github.com/JDASoftwareGroup/kartothek/tree/master/reference-data/arrow-compat]
> [https://github.com/JDASoftwareGroup/kartothek/blob/c47e52116e2dc726a74d7d6b97922a0252722ed0/tests/serialization/test_arrow_compat.py#L31]
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)