You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2019/08/20 19:05:00 UTC
[jira] [Commented] (ARROW-6302) [Python] parquet categorical
support doesn't preserve order
[ https://issues.apache.org/jira/browse/ARROW-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16911662#comment-16911662 ]
Joris Van den Bossche commented on ARROW-6302:
----------------------------------------------
cc [~wesmckinn] This was catched from adding tests in pandas (https://github.com/pandas-dev/pandas/pull/28018)
So it seems that the `ordered` attribute (which is correctly set when converting pandas to arrow), is not set back when reading from parquet. Given that this is stored in the dataframe, it should also be possible to preserve this.
{code}
In [45]: df = pd.DataFrame({"a": pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=True)})
In [46]: table = pa.table(df)
In [47]: table
Out[47]:
pyarrow.Table
a: dictionary<values=string, indices=int8, ordered=1>
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
b'stop": 4, "step": 1}], "column_indexes": [{"name": null, "field_'
b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
b'ield_name": "a", "pandas_type": "categorical", "numpy_type": "in'
b't8", "metadata": {"num_categories": 3, "ordered": true}}], "crea'
b'tor": {"library": "pyarrow", "version": "0.14.1.dev304+g7478fac1'
b'b.d20190816"}, "pandas_version": "0.25.0+193.gd7c4f0d15"}'}
In [48]: import pyarrow.parquet as pq
In [49]: pq.write_table(table, 'test_categorical_ordered.parquet')
In [50]: result = pq.read_table('test_categorical_ordered.parquet')
In [51]: result
Out[51]:
pyarrow.Table
a: dictionary<values=string, indices=int32, ordered=0>
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
b'stop": 4, "step": 1}], "column_indexes": [{"name": null, "field_'
b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
b'ield_name": "a", "pandas_type": "categorical", "numpy_type": "in'
b't8", "metadata": {"num_categories": 3, "ordered": true}}], "crea'
b'tor": {"library": "pyarrow", "version": "0.14.1.dev304+g7478fac1'
b'b.d20190816"}, "pandas_version": "0.25.0+193.gd7c4f0d15"}'}
{code}
> [Python] parquet categorical support doesn't preserve order
> -----------------------------------------------------------
>
> Key: ARROW-6302
> URL: https://issues.apache.org/jira/browse/ARROW-6302
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.15.0
> Reporter: Galuh Sahid
> Priority: Major
>
> In pandas, I tried roundtripping to parquet with {{to_parquet}} and {{read_parquet}}. It preserves categorical dtypes but does not preserve their order.
> {code:python}
> import pandas as pd
> from pandas.io.parquet import read_parquet, to_parquet
> df = pd.DataFrame()
> df["a"] = pd.Categorical(["a", "b", "c", "a"], categories=["b", "c", "d"], ordered=True)
> df.to_parquet(<path>)
> actual = read_parquet(<path>)
> df["a"]
> 0 NaN
> 1 b
> 2 c
> 3 NaN
> Name: a, dtype: category
> Categories (3, object): [b < c < d]
> actual["a"]
> 0 NaN
> 1 b
> 2 c
> 3 NaN
> Name: a, dtype: category
> Categories (3, object): [b, c, d]
> {code}
>
--
This message was sent by Atlassian Jira
(v8.3.2#803003)