You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "nero (Jira)" <ji...@apache.org> on 2022/01/28 02:18:00 UTC

[jira] [Created] (ARROW-15492) [Python] handle timestamp type in parquet file for compatibility with older HiveQL

nero created ARROW-15492:
----------------------------

             Summary: [Python] handle timestamp type in parquet file for compatibility with older HiveQL
                 Key: ARROW-15492
                 URL: https://issues.apache.org/jira/browse/ARROW-15492
             Project: Apache Arrow
          Issue Type: New Feature
    Affects Versions: 6.0.1
            Reporter: nero


Hi there,


I face an issue when I write a parquet file by PyArrow.

In the older version of Hive, it can only recognize the timestamp type stored in INT96, so I use table.write_to_data with `use_deprecated timestamp_int96_timestamps=True` option to save the parquet file. But the HiveQL will skip conversion when the metadata of parquet file is not created_by "parquet-mr".

[hive/ParquetRecordReaderBase.java at f1ff99636a5546231336208a300a114bcf8c5944 · apache/hive (github.com)|https://github.com/apache/hive/blob/f1ff99636a5546231336208a300a114bcf8c5944/ql/src/java/org/apache/hadoop/hive/ql/io/parquet/ParquetRecordReaderBase.java#L137-L139]

 

So I have to save the timestamp columns with timezone info(pad to UTC+8).

But when pyarrow.parquet read from a dir which contains parquets created by both PyArrow and parquet-mr, Arrow.Table will ignore the timezone info for parquet-mr files.

 

Maybe PyArrow can expose the created_by option in pyarrow({*}prefer{*}, parquet::WriterProperties::created_by is available in the C++ ).

Or handle the timestamp type with timezone which files created by parquet-mr?

 

Maybe related to https://issues.apache.org/jira/browse/ARROW-14422



--
This message was sent by Atlassian Jira
(v8.20.1#820001)