You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Chris Roat (Jira)" <ji...@apache.org> on 2021/01/07 01:25:00 UTC

[jira] [Created] (ARROW-11157) Consistent handling of categoricals

Chris Roat created ARROW-11157:
----------------------------------

             Summary: Consistent handling of categoricals
                 Key: ARROW-11157
                 URL: https://issues.apache.org/jira/browse/ARROW-11157
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 2.0.0
            Reporter: Chris Roat


What is the current state of categoricals with pyarrow? The `categories` parameter mentioned [in this GitHub|https://github.com/apache/arrow/issues/1688] issue does not seem to be accepted in `pd.read_parquet` anymore. I see that read/write of `int` categoricals does not work, though `str` do -- except if the file is written by fastparquet.

Using pandas 1.1.5, pyarrow 2.0.0, and fastparquet 0.4.1, I see the following handling of categoricals:

 
{code:java}
import os
import pandas as pd


fname = '/tmp/tst'


data = {
    'int': pd.Series([0, 1] * 1000, dtype=pd.CategoricalDtype([0,1])),
    'str': pd.Series(['foo', 'bar'] * 1000, dtype=pd.CategoricalDtype(['foo', 'bar'])),
}
df = pd.DataFrame(data)


for write in ['fastparquet', 'pyarrow']:
    for read in ['fastparquet', 'pyarrow']:
        if os.path.exists(fname):
            os.remove(fname)
        df.to_parquet(fname, engine=write, compression=None)
        df_read = pd.read_parquet(fname, engine=read)


        print()
        print('write:', write, 'read:', read)
        for t in data.keys():
            print(t, df[t].dtype == df_read[t].dtype){code}
 

 
{noformat}
write: fastparquet read: fastparquet
int True
str True
write: fastparquet read: pyarrow
int False
str False
write: pyarrow read: fastparquet
int True
str True
write: pyarrow read: pyarrow
int False
str True
{noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)