You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Rauli Ruohonen (Jira)" <ji...@apache.org> on 2020/05/15 13:59:00 UTC
[jira] [Created] (ARROW-8816) [Python] Year 2263 or later datetimes get mangled when written using pandas

Rauli Ruohonen created ARROW-8816:
-------------------------------------

             Summary: [Python] Year 2263 or later datetimes get mangled when written using pandas
                 Key: ARROW-8816
                 URL: https://issues.apache.org/jira/browse/ARROW-8816
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.17.0, 0.16.0
         Environment: Tested using pyarrow 0.17.0 and 0.16.0, pandas 1.0.3, python 3.7.5, mojave (macos). Also tested using pyarrow 0.16.0, pandas 1.0.3, python 3.8.2, ubuntu 20.04 (linux).
            Reporter: Rauli Ruohonen


Using pyarrow 0.17.0, this

 
{code:java}
import datetime
import pandas as pd

def try_with_year(year):
    print(f'Year {year:_}:')
    df = pd.DataFrame({'x': [datetime.datetime(year, 1, 1)]})
    df.to_parquet('foo.parquet', engine='pyarrow', compression=None)
    try:
        print(pd.read_parquet('foo.parquet', engine='pyarrow'))
    except Exception as exc:
        print(repr(exc))
    print()

try_with_year(2_263)
try_with_year(2_262)
{code}
 

prints

 
{noformat}
Year 2_263:
ArrowInvalid('Casting from timestamp[ms] to timestamp[ns] would result in out of bounds timestamp: 9246182400000')

Year 2_262:
           x
0 2262-01-01{noformat}
and using pyarrow 0.16.0, it prints

 

 
{noformat}
Year 2_263:
                              x
0 1678-06-12 00:25:26.290448384

Year 2_262:
           x
0 2262-01-01{noformat}
The issue is that 2263-01-01 is out of bounds for a timestamp stored using epoch nanoseconds, but not out of bounds for a Python datetime.

While pyarrow 0.17.0 refuses to read the erroneous output, it is still possible to read it using other parquet readers (e.g. pyarrow 0.16.0 or fastparquet), yielding the same result as with 0.16.0 above (i.e. only reading has changed in 0.17.0, not writing). It would be better if an error was raised when attempting to write the file instead of silently producing erroneous output.

The reason I suspect this is a pyarrow issue instead of a pandas issue is this modified example:

 
{code:java}
import datetime
import pandas as pd
import pyarrow as pa

df = pd.DataFrame({'x': [datetime.datetime(2_263, 1, 1)]})
table = pa.Table.from_pandas(df)
print(table[0])
try:
    print(table.to_pandas())
except Exception as exc:
    print(repr(exc))
{code}
which prints

 

 
{noformat}
[
  [
    2263-01-01 00:00:00.000000
  ]
]
ArrowInvalid('Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 9246182400000000'){noformat}
on pyarrow 0.17.0 and

 

 
{noformat}
[
  [
    2263-01-01 00:00:00.000000
  ]
]
                              x
0 1678-06-12 00:25:26.290448384{noformat}
on pyarrow 0.16.0. Both from_pandas() and to_pandas() are pyarrow methods, pyarrow prints the correct timestamp when asked to produce it as a string (so it was not lost inside pandas), but the pa.Table.from_pandas(df).to_pandas() round-trip fails.

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)