You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Antoine Pitrou (Jira)" <ji...@apache.org> on 2021/04/27 16:30:00 UTC

[jira] [Comment Edited] (ARROW-12096) [Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns])

    [ https://issues.apache.org/jira/browse/ARROW-12096?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17333367#comment-17333367 ] 

Antoine Pitrou edited comment on ARROW-12096 at 4/27/21, 4:29 PM:
------------------------------------------------------------------

[~isichei] You first need to solve the issue on the C++ side. There are probably several places to look into:

* {{ArrowReaderProperties}} in {{cpp/src/parquet/properties.h}}: you'll need to add the new option here
* {{GetArrowType}} in {{cpp/src/parquet/arrow/schema_internal.cc}}: you'll need to take the option into account here
* {{TransferInt96}} in {{cpp/src/parquet/arrow/reader_internal.cc}}: ditto

You'll also need to expand the existing tests in {{cpp/src/parquet/arrow/arrow_reader_writer_test.cc}}.

Once that is done you'll be able to expose the option in Python.


was (Author: pitrou):
[~isichei] You first need to solve the issue on the C++ side. There are probably several places to look into:

* {{ArrowReaderProperties}} in {{cpp/src/parquet/properties.h}}: you'll need to add the new option here
* {{GetArrowType}} in {{cpp/src/parquet/arrow/schema_internal.cc}}: you'll need to take the option into account here
* {{TransferInt96}} in {{cpp/src/parquet/arrow/reader_internal.cc}}: ditto

You'll also need to expand the existing tests in {{src/parquet/arrow/arrow_reader_writer_test.cc}}.

Once that is done you'll be able to expose the option in Python.

> [Python][C++] Pyarrow Parquet reader overflows INT96 timestamps when converting to Arrow Array (timestamp[ns])
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-12096
>                 URL: https://issues.apache.org/jira/browse/ARROW-12096
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 2.0.0, 3.0.0
>         Environment: macos mojave 10.14.6
> Python 3.8.3
> pyarrow 3.0.0
> pandas 1.2.3
>            Reporter: Karik Isichei
>            Priority: Major
>
> When reading Parquet data with timestamps stored as INT96 pyarrow will assume that the timestamp type should be nanoseconds and when converted into an arrow table will cause overflow if the parquet col has stored values that are out of bounds for nanoseconds. 
> {code:python}
> # Round Trip Example
> import datetime
> import pandas as pd
> import pyarrow as pa
> from pyarrow import parquet as pq
> df = pd.DataFrame({"a": [datetime.datetime(1000,1,1), datetime.datetime(2000,1,1), datetime.datetime(3000,1,1)]})
> a_df = pa.Table.from_pandas(df)
> a_df.schema # a: timestamp[us] 
> pq.write_table(a_df, "test_round_trip.parquet", use_deprecated_int96_timestamps=True, version="1.0")
> pfile = pq.ParquetFile("test_round_trip.parquet")
> pfile.schema_arrow # a: timestamp[ns]
> pq.read_table("test_round_trip.parquet").to_pandas()
> # # Results in values:
> # 2169-02-08 23:09:07.419103232
> # 2000-01-01 00:00:00
> # 1830-11-23 00:50:52.580896768
> {code}
> The above example is just trying to demonstrate this bug by getting pyarrow to write out the parquet format to a similar state of original file (where this bug was discovered). This bug was originally found when trying to read in Parquet outputs from Amazon Athena with pyarrow (where we can't control the output format of the parquet file format) [Context|https://github.com/awslabs/aws-data-wrangler/issues/592].
> I found some existing issues that might also be related:
> * [ARROW-10444|https://issues.apache.org/jira/browse/ARROW-10444] 
> * [ARROW-6779|https://issues.apache.org/jira/browse/ARROW-6779] (This shows a similar response although testing this on pyarrow v3 will raise an out of bounds error)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)