You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Tim Ryles (Jira)" <ji...@apache.org> on 2021/05/07 11:21:00 UTC

[jira] [Updated] (ARROW-12680) [Python] Converting StructScalar Timestamp using .to_pandas() loses/converts type

     [ https://issues.apache.org/jira/browse/ARROW-12680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Ryles updated ARROW-12680:
------------------------------
    Summary: [Python] Converting StructScalar Timestamp using .to_pandas() loses/converts type  (was: [Python] Pyarrow - Converting StructScalar Timestamp using .to_pandas() loses/converts type)

> [Python] Converting StructScalar Timestamp using .to_pandas() loses/converts type
> ---------------------------------------------------------------------------------
>
>                 Key: ARROW-12680
>                 URL: https://issues.apache.org/jira/browse/ARROW-12680
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 3.0.0
>            Reporter: Tim Ryles
>            Priority: Major
>
> Hi,
> We're noticing an issue where we lose type and formatting on conversion to a pandas dataframe for a particular dataset we house, which contains a struct, and the underlying type of the child is Timestamp rather than datetime.datetime (which we believed synonymous from Pandas documentation).
>  
> Inside the StructArray we can see nicely formatted timestamp values, but when we call .to_pandas() on it, we end up with epoch stamps for the date child.
> {code:java}
> import pyarrow.parquet as pq
> tbl=pq.read_table("part-00009-47f62157-cb6f-41a8-9ad6-ace65df94c6e-c000.snappy.parquet")
> tbl.column("observations").chunk(0).values pyarrow.lib.StructArray object at 0x7fc8eb0cab40>
> – is_valid: all not null
> – child 0 type: timestamp[ns]
> [
> 2000-01-01 00:00:00.000000000,
> 2001-01-01 00:00:00.000000000,
> 2002-01-01 00:00:00.000000000,
> 2003-01-01 00:00:00.000000000,
> 2004-01-01 00:00:00.000000000,
> 2005-01-01 00:00:00.000000000,
> 2006-01-01 00:00:00.000000000,
> 2007-01-01 00:00:00.000000000,
> 2008-01-01 00:00:00.000000000,
> 2009-01-01 00:00:00.000000000,
> ...
> 2018-07-01 00:00:00.000000000,
> 2018-10-01 00:00:00.000000000,
> 2019-01-01 00:00:00.000000000,
> 2019-04-01 00:00:00.000000000,
> 2019-07-01 00:00:00.000000000,
> 2019-10-01 00:00:00.000000000,
> 2020-01-01 00:00:00.000000000,
> 2020-04-01 00:00:00.000000000,
> 2020-07-01 00:00:00.000000000,
> 2020-10-01 00:00:00.000000000
> ]
> – child 1 type: double
> [
> -2.69685,
> 9.27988,
> 7.26902,
> -7.55753,
> -1.62137,
> 6.84773,
> -8.21204,
> -8.97041,
> -1.14405,
> -0.710153,
> ...
> 2.1658,
> 3.05588,
> 2.3868,
> 2.10805,
> 2.39984,
> 2.54855,
> -7.26804,
> -2.35179,
> -0.867518,
> 0.150593
> ]
> {code}
> {code:java}
>  
> tbl.to_pandas()['observations'] 
> [{'date': 946684800000000000, 'value': -2.6968... 1 [{'date': 946684800000000000, 'value': 57.9608... 2 [{'date': 1483228800000000000, 'value': 95.904... 3 [{'date': 1214870400000000000, 'value': 19.021... 4 [{'date': 1199145600000000000, 'value': 1.2011... ... 636 [\{'date': 1072915200000000000, 'value': 5.418}... 637 [{'date': 946684800000000000, 'value': 110.695... 638 [{'date': 1009843200000000000, 'value': 3.0094... 639 [{'date': 1222819200000000000, 'value': 48.365... 640 [{'date': 1199145600000000000, 'value': 1.5600... Name: observations, Length: 641, dtype: object
> In [12]: tbl.to_pandas()["observations"].iloc[0][0]
> Out[12]: {'date': 1041379200000000000, 'value': 249.523242}
> # date is now type Int{code}
>  
> We notice that if we take the same table, save it back out to a file first, and then check the chunk(0).values as above, the underlying type changes from *Timestamp* to *datetime.datetime*, and that will now convert .to_pandas() correctly.
> {code:java}
> pq.write_table(tbl, "output.parquet")
> tbl2=pq.read_table("output.parquet")
> tbl2.column("observations").chunk(0).values[0]
> Out[17]: <pyarrow.StructScalar: {'date': datetime.datetime(2003, 1, 1, 0, 0), 'value': 249.523242}>
> tbl2.column("observations").chunk(0).to_pandas()
> Out[18]: 
> 0        [{'date': 2003-01-01 00:00:00, 'value': 249.52...
> 1        [{'date': 2008-01-01 00:00:00, 'value': 29.741...
> 2        [{'date': 2000-01-01 00:00:00, 'value': 2.3454...
> 3        [{'date': 2006-01-01 00:00:00, 'value': 1.2048...
> 4        [{'date': 2008-01-01 00:00:00, 'value': 196546...
>                                ...                        
> 29489    [{'date': 2010-01-01 00:00:00, 'value': 19.155...
> 29490    [{'date': 2012-04-30 00:00:00, 'value': 0.0}, ...
> 29491    [{'date': 2012-04-30 00:00:00, 'value': 0.0}, ...
> 29492    [{'date': 2012-04-30 00:00:00, 'value': 0.0}, ...
> 29493    [{'date': 2012-04-30 00:00:00, 'value': 10.0},...
> Length: 29494, dtype: object
> tbl2.to_pandas()["observations"].iloc[0][0]
> Out[8]: {'date': datetime.datetime(2003, 1, 1, 0, 0), 'value': 249.523242}
> # date remains as datetime.datetime{code}
>  
> Thanks in advance, and apologies if I have not followed Issue protocol on this board.
> If there is a parameter that we just need to pass into .to_pandas for this to take place (I can see there is date_as_object/timestamp_as_object, but these have no effect), we would like to know.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)