You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (JIRA)" <ji...@apache.org> on 2019/08/02 02:05:00 UTC

[jira] [Commented] (ARROW-5480) [Python] Pandas categorical type doesn't survive a round-trip through parquet

    [ https://issues.apache.org/jira/browse/ARROW-5480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16898482#comment-16898482 ] 

Wes McKinney commented on ARROW-5480:
-------------------------------------

One slightly higher level issue is the extent to which we store Arrow schema information in the Parquet metadata. I have been thinking that we should actually store the whole serialized schema in the Parquet footer as an IPC message, so that we can refer to it when reading the file to set various read options

> [Python] Pandas categorical type doesn't survive a round-trip through parquet
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-5480
>                 URL: https://issues.apache.org/jira/browse/ARROW-5480
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>    Affects Versions: 0.11.1, 0.13.0
>         Environment: python: 3.7.3.final.0
> python-bits: 64
> OS: Linux
> OS-release: 5.0.0-15-generic
> machine: x86_64
> processor: x86_64
> byteorder: little
> pandas: 0.24.2
> numpy: 1.16.4
> pyarrow: 0.13.0
>            Reporter: Karl Dunkle Werner
>            Priority: Minor
>
> Writing a string categorical variable to from pandas parquet is read back as string (object dtype). I expected it to be read as category.
> The same thing happens if the category is numeric -- a numeric category is read back as int64.
> In the code below, I tried out an in-memory arrow Table, which successfully translates categories back to pandas. However, when I write to a parquet file, it's not.
> In the scheme of things, this isn't a big deal, but it's a small surprise.
> {code:python}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({'x': pd.Categorical(['a', 'a', 'b', 'b'])})
> df.dtypes  # category
> # This works:
> pa.Table.from_pandas(df).to_pandas().dtypes  # category
> df.to_parquet("categories.parquet")
> # This reads back object, but I expected category
> pd.read_parquet("categories.parquet").dtypes  # object
> # Numeric categories have the same issue:
> df_num = pd.DataFrame({'x': pd.Categorical([1, 1, 2, 2])})
> df_num.dtypes # category
> pa.Table.from_pandas(df_num).to_pandas().dtypes  # category
> df_num.to_parquet("categories_num.parquet")
> # This reads back int64, but I expected category
> pd.read_parquet("categories_num.parquet").dtypes  # int64
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)