You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2020/06/24 22:50:00 UTC
[jira] [Commented] (ARROW-9211) [Python] ArrowInvalid error raised
when deserialising pandas with pd.NaT values in object column
[ https://issues.apache.org/jira/browse/ARROW-9211?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17144465#comment-17144465 ]
Wes McKinney commented on ARROW-9211:
-------------------------------------
Well, raising an exception is strictly better than returning garbage to you. The problem is in the conversion of NaT to null, which is not handled correctly:
{code}
In [29]: v
Out[29]:
bar
0 2020-06-22 06:54:56
1 NaT
In [30]: pa.table(v)[0].chunk(0)
Out[30]:
<pyarrow.lib.TimestampArray object at 0x7f4b4070e528>
[
2020-06-22 06:54:56.000000,
0001-01-01 00:00:00.000000
]
{code}
pandas's NaT value is a bit weird:
{code}
In [31]: type(pd.NaT).mro()
Out[31]:
[pandas._libs.tslibs.nattype.NaTType,
pandas._libs.tslibs.nattype._NaT,
datetime.datetime,
datetime.date,
object]
{code}
I'm gonna quickly see if I can get pyarrow to recognize pandas.NaT, so I'm closing this as a duplicate of ARROW-842
> [Python] ArrowInvalid error raised when deserialising pandas with pd.NaT values in object column
> ------------------------------------------------------------------------------------------------
>
> Key: ARROW-9211
> URL: https://issues.apache.org/jira/browse/ARROW-9211
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.17.0, 0.17.1
> Reporter: Lawrence Ling
> Assignee: Wes McKinney
> Priority: Major
> Fix For: 1.0.0
>
>
> In pyarrow 0.17.x when deserialising a pandas dataframe which has pd.NaT values in an object column, an ArrowInvalid error is raised:
> {code:java}
> pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: -62135596800000000
> {code}
> Reproducible code (using pyarrow==0.17.1 and pandas==1.0.3):
> {code:java}
> import pandas as pd
> import pyarrow.ipc as ipc
> import pyarrow as pa
> v = pd.DataFrame({
> "bar": [1592808896000000000, pd.NaT]
> })
> # works fine as datetime64[ns] but not as object type
> v = v.astype({"bar": "datetime64[ns]"}).astype({"bar": "object"})
> bs = ipc.serialize_pandas(v).to_pybytes()
> df = ipc.deserialize_pandas(bs) # error{code}
> In pyarrow 0.16.0 no error occurs and df is returned as:
> {code:java}
> bar
> 0 2020-06-22 06:54:56.000000000
> 1 1754-08-30 22:43:41.128654848
> {code}
> Was the change in 0.17.x to raise an error an intentional behaviour change? Given the previous behaviour in 0.16.0 seemed a bit like undefined behaviour already, where it converted NaT to 1754-08-30 (which seems due to the -62135596800000000 timestamp mentioned in the error above?).
> Also note that when serialized as datetime64[ns] rather than object, the code works fine in both 0.17.x and 0.16.0, returning:
> {code:java}
> bar
> 0 2020-06-22 06:54:56
> 1 NaT{code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)