You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2019/08/20 19:05:00 UTC
[jira] [Commented] (ARROW-6302) [Python] parquet categorical support doesn't preserve order

    [ https://issues.apache.org/jira/browse/ARROW-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911662#comment-16911662 ] 

Joris Van den Bossche commented on ARROW-6302:
----------------------------------------------

cc [~wesmckinn] This was catched from adding tests in pandas (https://github.com/pandas-dev/pandas/pull/28018)

So it seems that the `ordered` attribute (which is correctly set when converting pandas to arrow), is not set back when reading from parquet. Given that this is stored in the dataframe, it should also be possible to preserve this.

{code}
In [45]: df = pd.DataFrame({"a": pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=True)})                                                                                                  

In [46]: table = pa.table(df)                                                                                                                                                                                      

In [47]: table                                                                                                                                                                                                     
Out[47]: 
pyarrow.Table
a: dictionary<values=string, indices=int8, ordered=1>
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
            b'stop": 4, "step": 1}], "column_indexes": [{"name": null, "field_'
            b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
            b'ield_name": "a", "pandas_type": "categorical", "numpy_type": "in'
            b't8", "metadata": {"num_categories": 3, "ordered": true}}], "crea'
            b'tor": {"library": "pyarrow", "version": "0.14.1.dev304+g7478fac1'
            b'b.d20190816"}, "pandas_version": "0.25.0+193.gd7c4f0d15"}'}

In [48]: import pyarrow.parquet as pq                                                                                                                                                                              

In [49]: pq.write_table(table, 'test_categorical_ordered.parquet')                                                                                                                                                 

In [50]: result = pq.read_table('test_categorical_ordered.parquet')                                                                                                                                                

In [51]: result                                                                                                                                                                                                    
Out[51]: 
pyarrow.Table
a: dictionary<values=string, indices=int32, ordered=0>
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
            b'stop": 4, "step": 1}], "column_indexes": [{"name": null, "field_'
            b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
            b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
            b'ield_name": "a", "pandas_type": "categorical", "numpy_type": "in'
            b't8", "metadata": {"num_categories": 3, "ordered": true}}], "crea'
            b'tor": {"library": "pyarrow", "version": "0.14.1.dev304+g7478fac1'
            b'b.d20190816"}, "pandas_version": "0.25.0+193.gd7c4f0d15"}'}
{code}

> [Python] parquet categorical support doesn't preserve order
> -----------------------------------------------------------
>
>                 Key: ARROW-6302
>                 URL: https://issues.apache.org/jira/browse/ARROW-6302
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.0
>            Reporter: Galuh Sahid
>            Priority: Major
>
> In pandas, I tried roundtripping to parquet with {{to_parquet}} and {{read_parquet}}. It preserves categorical dtypes but does not preserve their order.
> {code:python}
> import pandas as pd
> from pandas.io.parquet import read_parquet, to_parquet
> df = pd.DataFrame()
> df["a"] = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=True)
> df.to_parquet(<path>)
> actual = read_parquet(<path>)
> df["a"]    
> 0    NaN
> 1      b
> 2      c
> 3    NaN
> Name: a, dtype: category
> Categories (3, object): [b < c < d]
> actual["a"]
> 0    NaN
> 1      b
> 2      c
> 3    NaN
> Name: a, dtype: category
> Categories (3, object): [b, c, d]
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)