You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/05/15 18:49:00 UTC
[jira] [Commented] (ARROW-8816) [Python] Year 2263 or later datetimes get mangled when written using pandas

    [ https://issues.apache.org/jira/browse/ARROW-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17108548#comment-17108548 ] 

Joris Van den Bossche commented on ARROW-8816:
----------------------------------------------

> It would be better if an error was raised when attempting to write the file instead of silently producing erroneous output.

The file is correct (so we shouldn't error when writing), it is only after reading in that the conversion to pandas causes the issue given pandas' limitation on the range of timestamps.

As you can see, in pyarrow 0.17 it was at least fixed to not produces garbage dates but an error is raised instead (which I would say is better than garbage). But it is a known issue that there should be a way to still convert to pandas but with converting to datetime objects instead of to datetime64[ns] dtype. This is covered by ARROW-5359 with the idea to add a {{timestamp_as_object}} keyword.





> [Python] Year 2263 or later datetimes get mangled when written using pandas
> ---------------------------------------------------------------------------
>
>                 Key: ARROW-8816
>                 URL: https://issues.apache.org/jira/browse/ARROW-8816
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0, 0.17.0
>         Environment: Tested using pyarrow 0.17.0 and 0.16.0, pandas 1.0.3, python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2, ubuntu 20.04 (linux).
>            Reporter: Rauli Ruohonen
>            Priority: Major
>
> Using pyarrow 0.17.0, this
>  
> {code:java}
> import datetime
> import pandas as pd
> def try_with_year(year):
>     print(f'Year {year:_}:')
>     df = pd.DataFrame({'x': [datetime.datetime(year, 1, 1)]})
>     df.to_parquet('foo.parquet', engine='pyarrow', compression=None)
>     try:
>         print(pd.read_parquet('foo.parquet', engine='pyarrow'))
>     except Exception as exc:
>         print(repr(exc))
>     print()
> try_with_year(2_263)
> try_with_year(2_262)
> {code}
>  
> prints
>  
> {noformat}
> Year 2_263:
> ArrowInvalid('Casting from timestamp[ms] to timestamp[ns] would result in out of bounds timestamp: 9246182400000')
> Year 2_262:
>            x
> 0 2262-01-01{noformat}
> and using pyarrow 0.16.0, it prints
>  
>  
> {noformat}
> Year 2_263:
>                               x
> 0 1678-06-12 00:25:26.290448384
> Year 2_262:
>            x
> 0 2262-01-01{noformat}
> The issue is that 2263-01-01 is out of bounds for a timestamp stored using epoch nanoseconds, but not out of bounds for a Python datetime.
> While pyarrow 0.17.0 refuses to read the erroneous output, it is still possible to read it using other parquet readers (e.g. pyarrow 0.16.0 or fastparquet), yielding the same result as with 0.16.0 above (i.e. only reading has changed in 0.17.0, not writing). It would be better if an error was raised when attempting to write the file instead of silently producing erroneous output.
> The reason I suspect this is a pyarrow issue instead of a pandas issue is this modified example:
>  
> {code:java}
> import datetime
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': [datetime.datetime(2_263, 1, 1)]})
> table = pa.Table.from_pandas(df)
> print(table[0])
> try:
>     print(table.to_pandas())
> except Exception as exc:
>     print(repr(exc))
> {code}
> which prints
>  
>  
> {noformat}
> [
>   [
>     2263-01-01 00:00:00.000000
>   ]
> ]
> ArrowInvalid('Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 9246182400000000'){noformat}
> on pyarrow 0.17.0 and
>  
>  
> {noformat}
> [
>   [
>     2263-01-01 00:00:00.000000
>   ]
> ]
>                               x
> 0 1678-06-12 00:25:26.290448384{noformat}
> on pyarrow 0.16.0. Both from_pandas() and to_pandas() are pyarrow methods, pyarrow prints the correct timestamp when asked to produce it as a string (so it was not lost inside pandas), but the pa.Table.from_pandas(df).to_pandas() round-trip fails.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)