You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Karl Dunkle Werner (JIRA)" <ji...@apache.org> on 2019/06/02 16:59:00 UTC

[jira] [Created] (ARROW-5480) [Python] Pandas categorical type doesn't survive a round-trip through parquet

Karl Dunkle Werner created ARROW-5480:
-----------------------------------------

             Summary: [Python] Pandas categorical type doesn't survive a round-trip through parquet
                 Key: ARROW-5480
                 URL: https://issues.apache.org/jira/browse/ARROW-5480
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 0.13.0, 0.11.1
         Environment: python: 3.7.3.final.0
python-bits: 64
OS: Linux
OS-release: 5.0.0-15-generic
machine: x86_64
processor: x86_64
byteorder: little
pandas: 0.24.2
numpy: 1.16.4
pyarrow: 0.13.0

            Reporter: Karl Dunkle Werner


Writing a string categorical variable to from pandas parquet is read back as string (object dtype). I expected it to be read as category.
The same thing happens if the category is numeric -- a numeric category is read back as int64.

In the code below, I tried out an in-memory arrow Table, which successfully translates categories back to pandas. However, when I write to a parquet file, it's not.

In the scheme of things, this isn't a big deal, but it's a small surprise.


{code:python}
import pandas as pd
import pyarrow as pa


df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])})
df.dtypes  # category

# This works:
pa.Table.from_pandas(df).to_pandas().dtypes  # category

df.to_parquet("categories.parquet")
# This reads back object, but I expected category
pd.read_parquet("categories.parquet").dtypes  # object


# Numeric categories have the same issue:
df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])})
df_num.dtypes # category

pa.Table.from_pandas(df_num).to_pandas().dtypes  # category

df_num.to_parquet("categories_num.parquet")
# This reads back int64, but I expected category
pd.read_parquet("categories_num.parquet").dtypes  # int64
{code}







--
This message was sent by Atlassian JIRA
(v7.6.3#76005)