You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Alenka Frim (Jira)" <ji...@apache.org> on 2022/10/27 13:24:00 UTC

[jira] [Commented] (ARROW-8816) [Python] Year 2263 or later datetimes get mangled when written using pandas

    [ https://issues.apache.org/jira/browse/ARROW-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17625122#comment-17625122 ] 

Alenka Frim commented on ARROW-8816:
------------------------------------

Closing this as it is not relevant anymore (Arrow now errors with {{{}ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp{}}}) when converting to pandas.

I did create a new issue https://issues.apache.org/jira/browse/ARROW-18175 to track work about using the information stored in the metadata:
{quote}Ah, if there is pandas metadata present and it indicates object dtype, we could indeed use that to avoid conversion to datetime64[ns], but keep datetime objects. That sounds as it should be possible in principle.
{quote}

> [Python] Year 2263 or later datetimes get mangled when written using pandas
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-8816
>                 URL: https://issues.apache.org/jira/browse/ARROW-8816
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0, 0.17.0
>         Environment: Tested using pyarrow 0.17.0 and 0.16.0, pandas 1.0.3, python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2, ubuntu 20.04 (linux).
>            Reporter: Rauli Ruohonen
>            Priority: Major
>
> Using pyarrow 0.17.0, this
>  
> {code:java}
> import datetime
> import pandas as pd
> def try_with_year(year):
>     print(f'Year {year:_}:')
>     df = pd.DataFrame({'x': [datetime.datetime(year, 1, 1)]})
>     df.to_parquet('foo.parquet', engine='pyarrow', compression=None)
>     try:
>         print(pd.read_parquet('foo.parquet', engine='pyarrow'))
>     except Exception as exc:
>         print(repr(exc))
>     print()
> try_with_year(2_263)
> try_with_year(2_262)
> {code}
>  
> prints
>  
> {noformat}
> Year 2_263:
> ArrowInvalid('Casting from timestamp[ms] to timestamp[ns] would result in out of bounds timestamp: 9246182400000')
> Year 2_262:
>            x
> 0 2262-01-01{noformat}
> and using pyarrow 0.16.0, it prints
>  
>  
> {noformat}
> Year 2_263:
>                               x
> 0 1678-06-12 00:25:26.290448384
> Year 2_262:
>            x
> 0 2262-01-01{noformat}
> The issue is that 2263-01-01 is out of bounds for a timestamp stored using epoch nanoseconds, but not out of bounds for a Python datetime.
> While pyarrow 0.17.0 refuses to read the erroneous output, it is still possible to read it using other parquet readers (e.g. pyarrow 0.16.0 or fastparquet), yielding the same result as with 0.16.0 above (i.e. only reading has changed in 0.17.0, not writing). It would be better if an error was raised when attempting to write the file instead of silently producing erroneous output.
> The reason I suspect this is a pyarrow issue instead of a pandas issue is this modified example:
>  
> {code:java}
> import datetime
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': [datetime.datetime(2_263, 1, 1)]})
> table = pa.Table.from_pandas(df)
> print(table[0])
> try:
>     print(table.to_pandas())
> except Exception as exc:
>     print(repr(exc))
> {code}
> which prints
>  
>  
> {noformat}
> [
>   [
>     2263-01-01 00:00:00.000000
>   ]
> ]
> ArrowInvalid('Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 9246182400000000'){noformat}
> on pyarrow 0.17.0 and
>  
>  
> {noformat}
> [
>   [
>     2263-01-01 00:00:00.000000
>   ]
> ]
>                               x
> 0 1678-06-12 00:25:26.290448384{noformat}
> on pyarrow 0.16.0. Both from_pandas() and to_pandas() are pyarrow methods, pyarrow prints the correct timestamp when asked to produce it as a string (so it was not lost inside pandas), but the pa.Table.from_pandas(df).to_pandas() round-trip fails.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)