You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/11/29 13:52:00 UTC
[jira] [Commented] (ARROW-14891) [parquet] 9999-12-31 date is wrapped to 1816
[ https://issues.apache.org/jira/browse/ARROW-14891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17450470#comment-17450470 ]
Joris Van den Bossche commented on ARROW-14891:
-----------------------------------------------
A small reproducer (without relying on the provided file):
{code:python}
import datetime
import pyarrow as pa
import pyarrow.parquet as pq
table = pa.table({'col': [datetime.datetime(9999, 12, 31)]})
pq.write_table(table, "test_int96.parquet", use_deprecated_int96_timestamps=True)
In [5]: pq.read_table("test_int96.parquet")
Out[5]:
pyarrow.Table
col: timestamp[ns]
----
col: [[1816-03-29 05:56:08.066277376]]
{code}
> [parquet] 9999-12-31 date is wrapped to 1816
> --------------------------------------------
>
> Key: ARROW-14891
> URL: https://issues.apache.org/jira/browse/ARROW-14891
> Project: Apache Arrow
> Issue Type: Bug
> Reporter: Jorge Leitão
> Priority: Major
>
> Given a parquet file with a int96 date 9999-12-31 on it (which does not fit in an i64 ns) is read as a "wrapped", resulting in the date 1816-03-29 05:56:08.066277376.
> Spark seems to discard the nanoseconds and only read int96 to micros, which gives them a 1000x of dates (which happens to cover the 9999, but not others). There is a long discussion over this issue here: https://github.com/apache/arrow-rs/issues/982 including a MWE for pyarrow.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)