You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2019/02/07 16:33:00 UTC

[jira] [Updated] (ARROW-3650) [Python] Mixed column indexes are read back as strings

     [ https://issues.apache.org/jira/browse/ARROW-3650?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney updated ARROW-3650:
--------------------------------
    Fix Version/s: 0.14.0

> [Python] Mixed column indexes are read back as strings 
> -------------------------------------------------------
>
>                 Key: ARROW-3650
>                 URL: https://issues.apache.org/jira/browse/ARROW-3650
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.11.1
>            Reporter: Armin Berres
>            Priority: Major
>              Labels: parquet
>             Fix For: 0.14.0
>
>
> Consider the following example: 
> {code:java}
> df = pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['a string', pd.to_datetime('2018/01/02')])
> table = pa.Table.from_pandas(df)
> pq.write_table(table, 'test.parquet')
> ref_df = pq.read_pandas('test.parquet').to_pandas()
> print(df.columns)
> # Index(['a string', 2018-01-02 00:00:00], dtype='object')
> print(ref_df.columns)
> # Index(['a string', '2018-01-02 00:00:00'], dtype='object')
> {code}
> The serialized data frame has an index with a string and a datetime field (happened when resetting the index of a formerly datetime only column).
> When reading the string back the datetime is converted into a string.
> When looking at the schema I find {{"pandas_type": "mixed", "numpy_ty'
>             b'pe": "object"}} before serializing and {{"pandas_type": "unicode", "numpy_'
>             b'type": "object"}} after reading back. So the schema was aware of the mixed type but did not store the actual types.
> The same happens with other types like numbers as well. One can produce interesting situations:
> {{pd.DataFrame(1, index=[pd.to_datetime('2018/01/01')], columns=['1', 1])}} can be written but fails to be read back as the index is no more unique with '1' showing up two times.
> IIf this is not a bug but expected maybe the user should be somehow warned that information is lost? Like a {{NotImplemented}} exception.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)